Build an Automated Machine Learning Pipeline

S3 data is extracted, transformed, and fed into a trained AutoML model for prediction before the output is loaded into a target database.

This guide walks you through creating an end-to-end, automated machine learning workflow on the BDB Platform using a no-code/low-code approach.

You will learn how to:

  1. Read data from Amazon S3

  2. Clean and transform data using Data Preparation

  3. Train and run an AutoML model

  4. Write prediction results into a target database (ClickHouse)

Prerequisites:

Before you begin, ensure that:

  • You have access to the BDB Platform with Data Center, DS Lab, and Data Pipeline modules enabled.

  • You have valid credentials for the S3 bucket and the target database.

  • A CSV dataset (e.g., hiring, telecom, or customer data) is ready to upload

Step 1: Create a Sandbox and Upload the CSV File

Purpose

The Sandbox serves as a secure staging area for uploading and exploring data before transformation.

Procedure

  1. From the BDB Platform homepage, click the Apps icon and open the Data Center module.

  2. Inside the Data Center, select the Sandbox tab and click Create.

  3. Upload your CSV file using one of the following methods:

    • Drag and drop the file into the upload area, or

    • Click Browse and select the file from your local system.

  4. Once uploaded successfully, click Upload.

Step 2: Prepare and Transform the Data

Purpose

Data Preparation allows you to clean, filter, and structure your dataset for AutoML model training.

Procedure

  1. From the Sandbox List, click the three dots (⋮) beside your sandbox name.

  2. Select Create Data Preparation to launch the Data Preparation workspace.

  3. Perform the following transformations:

    • Delete the “Gender” column:

      • Select the column → Click Transforms → Delete Column.

    • Remove empty rows:

      • Choose the Previous CTC and Offered CTC columns.

      • Click Transforms → Delete Empty Rows.

  4. Rename your Data Preparation for easier identification.

  5. Review the Steps Tab to confirm all transformations.

  6. Click Save to finalize the data preparation.

Step 3: Create and Run an AutoML Experiment

Purpose

AutoML automates model training by testing multiple algorithms, comparing results, and selecting the most accurate model.

Procedure

  1. Navigate to the Data Science Lab module.

  2. Click on the AutoML tab and select Create Experiment.

  3. Configure your experiment:

    • Experiment Name: Hiring Data

    • Experiment Type: Classification

  4. Under Configure Dataset:

    • Select Sandbox as the data source.

    • Choose CSV as the file type.

    • Select your uploaded Sandbox file.

  5. Under Advanced Information:

    • Choose the Data Preparation process you created earlier.

    • Set Target Column to Gender.

  6. Click Save to start the experiment.

Monitor the AutoML Run

  • AutoML automatically tests multiple models, tunes hyperparameters, and evaluates metrics.

  • Once the run completes, click View Report to see:

    • Model accuracy and evaluation metrics

    • Comparison of all tested models

    • Recommended best-fit model

Step 4: Register the Best Model

Purpose

Registering the model makes it reusable across pipelines for real-time or batch prediction.

Procedure

  1. Open the Model section within DS Lab.

  2. Locate the model you trained with AutoML.

  3. Click the Register (arrow) icon.

  4. Confirm registration.

Step 5: Create a Data Pipeline

Purpose

The Data Pipeline automates data ingestion, transformation, model execution, and output storage.

Procedure

  1. From the Apps Menu, select the Data Pipeline module.

  2. Click Pipeline → Create.

  3. Enter the following:

    • Pipeline Name

    • Description

    • Resource Allocation: Low, Medium, or High

  4. Click Save to create the pipeline.

Step 6: Add and Configure Components

Each component in the pipeline represents a functional step — from data ingestion to model inference and output writing.

6.1 Add an S3 Reader Component

Purpose: Read source data from S3 into the pipeline.

Steps:

  1. From the Components Palette, search for S3 Reader.

  2. Drag and drop the component onto the canvas.

  3. In the Basic Information tab, set:

    • Invocation Type: Real-Time

  4. In the Meta Information tab, enter:

    • Region

    • File Type: CSV

    • Bucket Name

    • Access Key

    • Path Info (full S3 path to the file)

  5. Save the configuration.

  6. From the Event Panel, click + to create a Kafka Event, then connect it to the S3 Reader.

6.2 Add a Data Preparation Component

Purpose: Apply the transformations defined earlier.

Steps:

  1. From the Transformation section, drag and drop the Data Preparation component.

  2. Set Invocation Type: Batch

  3. Under Meta Information, specify:

    • Data Center Type: Data Sandbox

    • Sandbox Name: Select your Sandbox

    • Preparation: Choose your saved Data Preparation

  4. Save the component.

  5. Add a Kafka Event and connect it to the Data Preparation component.

6.3 Add an AutoML Component

Purpose: Execute the trained model for predictions.

Steps:

  1. Search for the AutoML Component and drag it onto the canvas.

  2. Set Invocation Type: Batch

  3. In Meta Information, select:

    • Model Name: Choose your registered model

  4. Save the configuration.

  5. Add and connect a Kafka Event.

6.4 Add a DB Writer Component

Purpose: Store the final model output in a target database (ClickHouse).

Steps:

  1. From the Writer Components, drag and drop the DB Writer onto the canvas.

  2. Set Invocation Type: Batch

  3. In the Meta Information tab, fill in:

    • Host

    • Port

    • Database Name

    • Table Name

    • Username

    • Password

    • Driver: ClickHouse

    • Save Mode: Append

  4. Click Validate Connection, then Save.

Step 7: Activate and Run the Pipeline

Purpose

To execute the end-to-end data flow and validate model predictions.

Procedure

  1. Click the Activate icon on the toolbar.

  2. Wait for all pipeline pods to deploy.

  3. Once active, go to the Logs tab to monitor execution.

  4. Check each component’s logs for:

    • Status updates

    • Record counts

    • Errors or warnings

Monitor Data Flow

Stage

Check

S3 Reader

Confirms successful data ingestion

Data Preparation

Validates data cleaning and transformations

AutoML Component

Displays prediction output and model metrics

DB Writer

Confirms successful data insertion into the database

Tip: Click on any Event → Preview Tab to view sample data processed at that stage. Review logs for confirmation messages.

Deactivate the Pipeline

Once the pipeline run is complete and verified:

  • Click Deactivate to release resources and prevent additional compute costs.

Step 8: Verify and Analyze the Results

  1. Open your target database (ClickHouse).

  2. Verify that the output table contains prediction results.

  3. Use these records for:

    • BI dashboards

    • Reporting

    • Real-time monitoring

Troubleshooting

Issue

Possible Cause

Solution

S3 Reader fails to connect

Incorrect credentials or invalid bucket path

Verify AWS Access Keys and Bucket Policy

Data Prep error

Schema mismatch

Ensure the same schema between Sandbox and transformations

AutoML model not found

Model not registered

Register model before pipeline creation

DB Writer error

Wrong database configuration

Validate credentials, driver, and table name

Next Steps

  • Integrate the generated ClickHouse table into a BDB Report or Dashboard Designer for visualization.

  • Schedule the pipeline using the BDB Job Scheduler for recurring execution.

  • Add monitoring alerts for automatic failure notifications.

Summary

You have successfully built a fully automated ML pipeline using the BDB Platform. The workflow:

  1. Reads data from S3

  2. Applies Data Preparation transforms

  3. Runs AutoML predictions

  4. Writes results into a target database

This pipeline provides a scalable, production-ready ML framework for continuous data-to-insight operations — all without writing a single line of code.

Last updated