Workflow 4

To process hiring data from Kafka, prepare and analyze it using AutoML, and store the results in ClickHouse for insights and easy access.

This workflow is a streamlined process for handling hiring data. It starts by receiving the data from a specified Kafka topic. Next, the data undergoes data preparation to ensure its quality and consistency. AutoML techniques are then applied to extract valuable insights from the data. Finally, the processed data is stored in a clickhouse database for further analysis and easy retrieval.

This workflow enables organizations to efficiently handle hiring data, utilize machine learning to uncover insights, and maintain a structured and accessible data repository.

Create a Sandbox and Upload the CSV File

From the BDB Platform homepage, click on the Apps icon and navigate to the Data Center.
Inside the Data Center, click on the “ sandbox” button and then click “create”
Upload your CSV file by dragging and dropping or browsing your system.
After the file loads successfully, click Upload, and the sandbox will be created and available for use.

· In the sandbox list, click on the three dots next to your created Sandbox.

· Choose “Create Data Preparation” to begin cleaning and processing your data.

· This simplifies the cleaning process and sets a solid foundation for machine learning.

o Delete the Gender column: Select the column → click Transforms → search for Delete Column transform and click on it, the Gender column is now successfully deleted.

o Remove empty rows from Previous CTC and Offered CTC: Repeat click on the transforms option → search for delete empty rows and click on it for respective column.

· Once transformation steps are complete:

o Rename the data preparation for easy reference.

o All transformations are tracked and can be removed via the step option.

o Click Save to finalize.

Create and Run AutoML Experiment

· Navigate to the AutoML section in the DS Lab Module and click the Create Experiment button.

· Configure the experiment with the following details:

o Experiment Name: Hiring Data

o Experiment Type: Classification

· Under the Configure Dataset option:

o Select Sandbox as the dataset source.

o Select the File Type as CSV from the dropdown.

o Choose the required sandbox from the dropdown menu.

· In the Advanced Information section:

o Select the Data Preparation process you created earlier.

o Set the Target Column to Gender.

· Click Save to initialize and start running the experiment.

· The AutoML will automatically test multiple models in the background, evaluate their performance, and select the most optimal approach.

· Once training is complete, click View Report to explore:

o Model accuracy and performance metrics

o Comparative results of the tested models

o Recommendations for the best-fit model

Register the Best Model

From the Model section, choose the model that you created using the Autom ml section
Click the “Register” (arrow icon) to Register the Model.
This will allow the model to be used in real-time or batch pipelines for prediction.

Create a New Pipeline

Navigate to the Apps menu and go to the Data Pipeline module.
Create a new pipeline by clicking on the pipeline option and then click on “create”, name it appropriately, and allocate resources.
Click Save .

Add an event and map the event

· From the Event Panel (next to the Components Palette on the right side of the screen), click the ‘+’ icon to add a Kafka Event.

· If the panel is not visible, click the ‘+’ icon in the top-right corner to open the Components Palette.

· Enable the Event Mapping option and create the event.

· Drag and drop the event onto the pipeline canvas.

Add a Data Preparation component

Navigate to Components Palette located to the right of the screen.

· Navigate to the transformation section and drag and drop the data preparation component from the palette

Set Invocation Type to Batch
In meta information, select the data center type as data sandbox, then select the sandbox name and preparation.
Save the component, once configured.

· From the “event” panel situated next to the component in the component palette, click on the ‘+’ icon to add a Kafka event, after adding the event drag and drop the event to the pipeline canvas.

Add Auto ML component.

Navigate to the search bar of components Palette and search for auto ml component.

· Drag and drop the Auto ML Component:

o Set Invocation as Batch in basic information panel

o Move to meta information, select the model from the model name drop down.

o Save the component, once done.

· From the “event” panel situated next to the component in the component palette, Click on the ‘+’ icon to add a kafka event, After adding the event drag and drop the event to the pipeline canvas.

Add an ClickHouse writer component

· Navigate to the search bar of components Palette, from the writer section Drag and drop the ClickHouse writer.

· Configure basic information, give the invocation type as batch

· In meta information, fill all the necessary configuration such as: Host, Port, Database Name, Table Name, Username, Password.

· Save the component once configured.

Save and Activate the Pipeline

Event → Data Prep → Event → AutoML → Event → ClickHouse Writer

Click “Activate” icon located in the toolbar.
Once activated, the pipeline will begin executing and the pods will start deploying.

· After all pods are up and running, move to the Logs section to track pipeline execution in detail.

Use Preview Panel of respective “event” to inspect data after each component.
Check Logs Panel on the right to troubleshoot or verify progress.
You have component status panel next to logs, where you can see the status of each components

· If logs show “DB Writer started successfully”, then data has been correctly ingested, transformed, and stored.

The pipeline can now handle live or batch data, and you can deactivate it when idle to save resources. You have successfully built a modern AI-powered pipeline in BDB that: Cleans raw data, Uses AutoML to model insights, Writes predictions to ClickHouse

PreviousWorkflow 3 NextBuilding a BDB Data Pipeline

Last updated 27 days ago