Build an Automated Machine Learning Pipeline
S3 data is extracted, transformed, and fed into a trained AutoML model for prediction before the output is loaded into a target database.
This guide walks you through creating an end-to-end, automated machine learning workflow on the BDB Platform using a no-code/low-code approach.
You will learn how to:
Read data from Amazon S3
Clean and transform data using Data Preparation
Train and run an AutoML model
Write prediction results into a target database (ClickHouse)
Step 1: Create a Sandbox and Upload the CSV File
Purpose
The Sandbox serves as a secure staging area for uploading and exploring data before transformation.
Procedure
From the BDB Platform homepage, click the Apps icon and open the Data Center module.
Inside the Data Center, select the Sandbox tab and click Create.
Upload your CSV file using one of the following methods:
Drag and drop the file into the upload area, or
Click Browse and select the file from your local system.
Once uploaded successfully, click Upload.

Step 2: Prepare and Transform the Data
Purpose
Data Preparation allows you to clean, filter, and structure your dataset for AutoML model training.
Procedure
From the Sandbox List, click the three dots (⋮) beside your sandbox name.
Select Create Data Preparation to launch the Data Preparation workspace.
Perform the following transformations:
Delete the “Gender” column:
Select the column → Click Transforms → Delete Column.
Remove empty rows:
Choose the Previous CTC and Offered CTC columns.
Click Transforms → Delete Empty Rows.
Rename your Data Preparation for easier identification.
Review the Steps Tab to confirm all transformations.
Click Save to finalize the data preparation.

Step 3: Create and Run an AutoML Experiment
Purpose
AutoML automates model training by testing multiple algorithms, comparing results, and selecting the most accurate model.
Procedure
Navigate to the Data Science Lab module.
Click on the AutoML tab and select Create Experiment.
Configure your experiment:
Experiment Name:
Hiring DataExperiment Type:
Classification
Under Configure Dataset:
Select Sandbox as the data source.
Choose CSV as the file type.
Select your uploaded Sandbox file.
Under Advanced Information:
Choose the Data Preparation process you created earlier.
Set Target Column to
Gender.
Click Save to start the experiment.


Monitor the AutoML Run
AutoML automatically tests multiple models, tunes hyperparameters, and evaluates metrics.
Once the run completes, click View Report to see:
Model accuracy and evaluation metrics
Comparison of all tested models
Recommended best-fit model
Step 4: Register the Best Model
Purpose
Registering the model makes it reusable across pipelines for real-time or batch prediction.
Procedure
Open the Model section within DS Lab.
Locate the model you trained with AutoML.
Click the Register (arrow) icon.
Confirm registration.

Step 5: Create a Data Pipeline
Purpose
The Data Pipeline automates data ingestion, transformation, model execution, and output storage.
Procedure
From the Apps Menu, select the Data Pipeline module.
Click Pipeline → Create.
Enter the following:
Pipeline Name
Description
Resource Allocation: Low, Medium, or High
Click Save to create the pipeline.
Step 6: Add and Configure Components
Each component in the pipeline represents a functional step — from data ingestion to model inference and output writing.
6.1 Add an S3 Reader Component
Purpose: Read source data from S3 into the pipeline.
Steps:
From the Components Palette, search for S3 Reader.
Drag and drop the component onto the canvas.
In the Basic Information tab, set:
Invocation Type: Real-Time
In the Meta Information tab, enter:
Region
File Type: CSV
Bucket Name
Access Key
Path Info (full S3 path to the file)
Save the configuration.
From the Event Panel, click + to create a Kafka Event, then connect it to the S3 Reader.
6.2 Add a Data Preparation Component
Purpose: Apply the transformations defined earlier.
Steps:
From the Transformation section, drag and drop the Data Preparation component.
Set Invocation Type: Batch
Under Meta Information, specify:
Data Center Type: Data Sandbox
Sandbox Name: Select your Sandbox
Preparation: Choose your saved Data Preparation
Save the component.
Add a Kafka Event and connect it to the Data Preparation component.
6.3 Add an AutoML Component
Purpose: Execute the trained model for predictions.
Steps:
Search for the AutoML Component and drag it onto the canvas.
Set Invocation Type: Batch
In Meta Information, select:
Model Name: Choose your registered model
Save the configuration.
Add and connect a Kafka Event.
6.4 Add a DB Writer Component
Purpose: Store the final model output in a target database (ClickHouse).
Steps:
From the Writer Components, drag and drop the DB Writer onto the canvas.
Set Invocation Type: Batch
In the Meta Information tab, fill in:
Host
Port
Database Name
Table Name
Username
Password
Driver: ClickHouse
Save Mode: Append
Click Validate Connection, then Save.
Step 7: Activate and Run the Pipeline
Purpose
To execute the end-to-end data flow and validate model predictions.
Procedure
Click the Activate icon on the toolbar.
Wait for all pipeline pods to deploy.
Once active, go to the Logs tab to monitor execution.
Check each component’s logs for:
Status updates
Record counts
Errors or warnings

Monitor Data Flow
Stage
Check
S3 Reader
Confirms successful data ingestion
Data Preparation
Validates data cleaning and transformations
AutoML Component
Displays prediction output and model metrics
DB Writer
Confirms successful data insertion into the database
Deactivate the Pipeline
Once the pipeline run is complete and verified:
Click Deactivate to release resources and prevent additional compute costs.
Step 8: Verify and Analyze the Results
Open your target database (ClickHouse).
Verify that the output table contains prediction results.
Use these records for:
BI dashboards
Reporting
Real-time monitoring
Troubleshooting
Issue
Possible Cause
Solution
S3 Reader fails to connect
Incorrect credentials or invalid bucket path
Verify AWS Access Keys and Bucket Policy
Data Prep error
Schema mismatch
Ensure the same schema between Sandbox and transformations
AutoML model not found
Model not registered
Register model before pipeline creation
DB Writer error
Wrong database configuration
Validate credentials, driver, and table name
Next Steps
Integrate the generated ClickHouse table into a BDB Report or Dashboard Designer for visualization.
Schedule the pipeline using the BDB Job Scheduler for recurring execution.
Add monitoring alerts for automatic failure notifications.
Summary
You have successfully built a fully automated ML pipeline using the BDB Platform. The workflow:
Reads data from S3
Applies Data Preparation transforms
Runs AutoML predictions
Writes results into a target database
This pipeline provides a scalable, production-ready ML framework for continuous data-to-insight operations — all without writing a single line of code.
Last updated