Workflow 1

Read data from S3, apply transformations, run a trained model via AutoML, and write the output into a target database.

This workflow showcases the complete machine learning lifecycle — from raw data to production-ready insights — using a no-code/low-code approach. By the end of Workflow, users will have created a fully functional, automated pipeline that reads data from S3, applies transformations, runs the trained model, and writes the output into a target database.

Create a Sandbox and Upload the CSV File

· From the BDB Platform homepage, click on the Apps icon and navigate to the Data Center.

· Inside the Data Center, click on the “sandbox” button and then click “create”

· Upload your CSV file by dragging and dropping or browsing your system.

· After the file loads successfully, click Upload, and the sandbox will be created and available for use.

· In the sandbox list, click on the three dots next to your created Sandbox.

· Choose “Create Data Preparation” to begin cleaning and processing your data.

· This simplifies the cleaning process and sets a solid foundation for machine learning.

o Delete the Gender column: Select the column → click Transforms → search for Delete Column transform and click on it, the Gender column is now successfully deleted.

o Remove empty rows from Previous CTC and Offered CTC: Repeat click on the transforms option → search for delete empty rows and click on it for respective column.

· Once transformation steps are complete:

o Rename the data preparation for easy reference.

o All transformations are tracked in steps Tab and can be removed via the steps tab.

o Click on Save button to finalize.

Create and Run AutoML Experiment using Data Science lab

· Navigate to the AutoML section in the DS Lab Module and click the Create Experiment button.

· Configure the experiment with the following details:

o Experiment Name: Hiring Data

o Experiment Type: Classification

· Under the Configure Dataset option:

o Select Sandbox as the dataset source.

o Select the File Type as CSV from the dropdown.

o Choose the required sandbox from the dropdown menu.

· In the Advanced Information section:

o Select the Data Preparation process you created earlier.

o Set the Target Column to Gender.

· Click Save to initialize and start running the experiment.

· The AutoML will automatically test multiple models in the background, evaluate their performance, and select the most optimal approach.

· Once training is complete, click View Report to explore:

o Model accuracy and performance metrics

o Comparative results of the tested models

o Recommendations for the best-fit model

Register the Best Model

· From the Model section, select the model you created using the AutoML workflow.

· Click the Register button (arrow icon) to register the model.

· Once registered, the model becomes available for use in both real-time and batch pipelines for prediction.

Create a New Pipeline

· From the Apps menu, select the Data Pipeline Plugin.

· Create a new pipeline by clicking on the Pipeline option, then click Create.

· Provide an appropriate name for the pipeline and allocate the required resources.

· Click Save to store the pipeline configuration.

Add S3 Reader Components

· Click the ‘+’ icon from the toolbar on the right side of the screen to open the Components Palette if it’s not present in the canvas

· In the search bar, type S3 Reader.

· Drag and drop the S3 Reader component onto the pipeline canvas.

· Set the Invocation Type to Real-Time, then move to the Meta Information tab.

· Enter the required details in Meta Information, including:

· Region

· File Type (CSV)

· Bucket Name

· Access Key

· Path Info

· Save the component once it has been configured.

· From the Event Panel (located next to the Components Palette), click the ‘+’ icon to create a Kafka Event.

· Drag and drop the event onto the pipeline canvas.

· The event will automatically connect to the source component.

Add a Data Preparation component

· From the Components Palette, go to the Transformation section and drag and drop the Data Preparation component onto the pipeline canvas.

· Set the Invocation Type to Batch, then move to the Meta Information tab.

· In Meta Information:

· Select the Data Center Type as Data Sandbox.

· Choose the appropriate Sandbox Name.

· Select the required Preparation.

· Save the component once configured.

· From the Event panel (next to the component palette), click the ‘+’ icon to add a Kafka Event.

· Drag and drop the newly created event onto the pipeline canvas.

· Finally, add the AutoML component from the Components Palette.

Add Auto ML component.

· Navigate to the search bar of components Palette and search for auto ml component.

· Drag and drop the Auto ML Component:

o Set Invocation as Batch in basic information panel

o Move to meta information, select the model from the model name drop down.

o Save the component, once done.

· From the “event” panel situated next to the component in the component palette, Click on the ‘+’ icon to add a kafka event, After adding the event drag and drop the event to the pipeline canvas.

Add DB Writer Component to Store Final Results

  • Drag and drop the DB Writer component from the writer section of the component palette

  • Give the invocation type as batch and move to meta information tab

  • Enter required DB credentials in the meta information tab

  • Host, Port, Database Name, Table Name, Username, Password.

  • Use ClickHouse as the driver and Append as save mode.

· Validate the connection and save the component.

Activate the Pipeline.

  • Click “Activate” icon located in the toolbar.

  • Once activated, the pipeline will begin executing and the pods will start deploying.

· After all pods are up and running, move to the Logs section to track pipeline execution in detail.

· Check the Component Logs to monitor the status of each component.

  • Monitor logs as the data flows from S3 → Data Prep → AutoML → DB Writer using the logs.

  • Users can view the sample records within the respective component events to verify data flow and transformations at each stage.

  • Click on the event, go to the Preview tab, and view the sample data processed by the component.

  • Check the Logs Panel on the right to troubleshoot issues or verify progress.

· Confirm that data ingestion and transformation have been completed successfully.

· The AutoML component will generate the results, and the DB Writer will load them into the target database.

· Watch for confirmation messages such as “DB Writer written data successfully” to ensure proper execution.

· Once the pipeline run is complete, deactivate the pipeline to avoid unnecessary compute usage.

· The processed data is now available for dashboards, analytics, and further reporting.

Last updated