Kafka-to-ClickHouse Pipeline with Real-Time AutoML Inference
To process hiring data from Kafka, prepare and analyze it using AutoML, and store the results in ClickHouse for insights and easy access.
This guide demonstrates how to build an end-to-end machine learning workflow on the BDB Platform to process real-time hiring data streamed from Kafka.
The workflow enables organizations to:
Ingest hiring data from Kafka topics in real time.
Clean and standardize the data using Data Preparation.
Apply AutoML models to analyze and generate predictions.
Store processed outputs in ClickHouse for analytics and reporting.
The process ensures seamless data orchestration and operationalizes ML-driven insights for recruitment optimization, trend analysis, and business intelligence.
Architecture Overview
Data Flow Sequence
Stage
Component
Mode
Purpose
1
Kafka Event
Real-Time
Ingests hiring data from Kafka topic
2
Data Preparation
Batch
Cleans and transforms hiring data
3
AutoML
Batch
Applies predictive models to generate insights
4
ClickHouse Writer
Batch
Writes predictions and structured data into ClickHouse
Pipeline Flow:
Kafka Event → Data Preparation → Event → AutoML → Event → ClickHouse WriterStep 1: Create a Sandbox and Upload CSV File
Purpose
To create a controlled environment for managing, previewing, and preparing hiring data.
Procedure
From the BDB Platform Homepage, click the Apps icon → select Data Center.
Inside the Data Center, click the Sandbox tab → click Create.
Upload your CSV file by:
Dragging and dropping the file, or
Clicking Browse to select from your system.
Once uploaded, click Upload to finalize. The Sandbox will appear in the Sandbox List.
Step 2: Apply Data Preparation to Clean the Data
Purpose
To transform raw hiring data into a consistent, analysis-ready format.
Procedure
In the Sandbox List, click the three dots (⋮) next to your sandbox.
Select Create Data Preparation to launch the Data Preparation workspace.
Perform Data Cleaning
Action
Steps
Result
Delete a Column
Select Gender column → click Transforms → Delete Column
Removes irrelevant field
Remove Empty Rows
Select Previous CTC and Offered CTC → click Transforms → Delete Empty Rows
Removes incomplete records
Rename your Data Preparation for easier identification.
Review all actions in the Steps tab (transformations are reversible).
Click Save to finalize.
Result: The dataset is now standardized for downstream ML processing.
Step 3: Create and Run an AutoML Experiment
Purpose
To train and validate an ML model that identifies patterns in hiring data (e.g., salary prediction, offer acceptance likelihood).
Procedure
Navigate to the DS Lab Module from the Apps Menu.
Go to the AutoML section → click Create Experiment.
Configure the experiment:
Experiment Name:
Hiring DataExperiment Type:
Classification
Under Configure Dataset:
Select Sandbox as the dataset source.
Set File Type: CSV
Choose the created Sandbox.
Under Advanced Information:
Select the Data Preparation created in Step 2.
Set Target Column:
Gender.
Click Save to start the experiment.
Monitor and Review
AutoML tests multiple algorithms and hyperparameters.
After completion, click View Report to review:
Model accuracy and evaluation metrics.
Comparative model performance.
Recommended best-fit model.
Step 4: Register the Best Model
Purpose
To make the trained model available for use in pipelines and batch inference.
Procedure
Open the Model section within the DS Lab.
Select the desired model from the list of trained models.
Click the Register (arrow) icon.
Confirm registration.
Result: The registered model is now accessible within the AutoML Component for real-time or batch predictions.
Step 5: Create a New Pipeline
Purpose
To automate data ingestion, preparation, model inference, and result storage.
Procedure
From the Apps Menu, navigate to the Data Pipeline module.
Click Pipeline → Create.
Provide the following details:
Pipeline Name
Description
Resource Allocation (Low/Medium/High)
Click Save to create the pipeline.
Step 6: Configure Kafka Event (Source)
Purpose
To connect the pipeline with a Kafka topic for real-time hiring data ingestion.
Procedure
From the Event Panel (right side of the screen), click + to add a Kafka Event.
If the Event Panel isn’t visible, click the + icon at the top-right corner to open the Components Palette.
Configure:
Event Type: Kafka
Event Mapping: Enable
Topic: Enter Kafka topic name
Broker URL: Specify the Kafka server address
Save the event and drag it onto the pipeline canvas.
Result: Kafka Event is now connected as the entry point for live hiring data.
Step 7: Add the Data Preparation Component
Purpose
To apply the standardized data cleaning logic to streaming or batch data.
Procedure
From the Components Palette, navigate to Transformation → Data Preparation.
Drag and drop it onto the pipeline canvas.
Configure:
Invocation Type: Batch
Meta Information:
Data Center Type: Data Sandbox
Sandbox Name: Select your Sandbox
Preparation: Choose the saved Data Preparation
Click Save.
From the Event Panel, click + to add another Kafka Event.
Drag and drop it onto the canvas to connect post Data Preparation.
Step 8: Add the AutoML Component
Purpose
To apply the registered model to processed hiring data for prediction and analysis.
Procedure
From the Components Palette, search for AutoML Component.
Drag and drop onto the pipeline canvas.
Configure:
Invocation Type: Batch
Model Name: Select your registered AutoML model from the dropdown.
Save the component.
Add and connect another Kafka Event after the AutoML component.
Step 9: Add the ClickHouse Writer Component
Purpose
To store prediction results and processed data for analysis, dashboards, or reporting.
Procedure
In the Components Palette, navigate to Writer → ClickHouse Writer.
Drag and drop onto the pipeline canvas.
Configure:
Invocation Type: Batch
Meta Information:
Host:
<hostname>Port:
<port>Database Name:
<database_name>Table Name:
<target_table>Username:
<db_username>Password:
<db_password>
Validate the connection.
Click Save.
Step 10: Connect and Activate the Pipeline
Pipeline Sequence
Event → Data Preparation → Event → AutoML → Event → ClickHouse WriterProcedure
Ensure all components are properly linked.
Click Activate (▶) on the top toolbar.
Wait until all pods deploy successfully.
The Logs Panel will open automatically, showing live status and execution details.
Step 11: Validate the Pipeline Execution
Purpose
To confirm that data is processed, analyzed, and stored successfully.
Procedure
Use the Preview Panel of each event to inspect intermediate data.
Use the Logs Panel to monitor:
Kafka event ingestion
Data Preparation transformations
AutoML inference activity
ClickHouse write operations

Expected Log Messages:
Component
Log Message
Data Preparation
“Transformation applied successfully.”
AutoML
“Model inference completed successfully.”
ClickHouse Writer
“DB Writer started successfully.”
Step 12: Verify Data in ClickHouse
Purpose
To confirm that results are stored and accessible for analytics and dashboards.
Procedure
Log in to your ClickHouse database.
Execute a query on the target table:
SELECT * FROM <database_name>.<target_table> LIMIT 10;Verify:
Data consistency
Correct schema
Presence of model predictions
Step 13: Deactivate Pipeline
Once testing or scheduled runs are complete:
Click Deactivate on the toolbar to stop the pipeline.
This releases system resources and halts active processes.
Troubleshooting
Issue
Possible Cause
Recommended Action
No data received from Kafka
Incorrect broker or topic configuration
Verify Kafka connectivity
AutoML fails to execute
Model not registered
Ensure model registration in DS Lab
Database write failure
Invalid ClickHouse credentials
Verify connection and authentication
Data schema mismatch
Unaligned Data Preparation logic
Re-check schema consistency across components
Outcome
After successful completion:
The Kafka stream is continuously ingested.
Data is cleaned, standardized, and enriched using Data Preparation.
AutoML models process the data to generate insights.
Predictions are stored in ClickHouse for reporting and BI tools.
This workflow provides a fully automated, ML-powered data pipeline for hiring analytics, enabling continuous insight generation with scalable and low-latency performance.
Key Benefits
Feature
Business Advantage
Real-Time Kafka Ingestion
Continuous data flow and low-latency analytics
Data Preparation
Ensures quality and uniformity of data
AutoML Integration
Automates insight extraction
ClickHouse Storage
Enables fast querying and dashboard integration
Summary
You have successfully implemented a modern, AI-driven data processing pipeline that:
Integrates Kafka ingestion for real-time data flow
Utilizes Data Preparation for structured cleaning
Employs AutoML for predictive analytics
Stores final results in ClickHouse for BI and dashboard visualization
This implementation showcases BDB’s capability to unify streaming data, machine learning, and analytics into a single, low-code environment — streamlining enterprise-grade recruitment insights.