Kafka-to-ClickHouse Pipeline with Real-Time AutoML Inference

To process hiring data from Kafka, prepare and analyze it using AutoML, and store the results in ClickHouse for insights and easy access.

This guide demonstrates how to build an end-to-end machine learning workflow on the BDB Platform to process real-time hiring data streamed from Kafka.

The workflow enables organizations to:

  • Ingest hiring data from Kafka topics in real time.

  • Clean and standardize the data using Data Preparation.

  • Apply AutoML models to analyze and generate predictions.

  • Store processed outputs in ClickHouse for analytics and reporting.

The process ensures seamless data orchestration and operationalizes ML-driven insights for recruitment optimization, trend analysis, and business intelligence.

Architecture Overview

Data Flow Sequence

Stage

Component

Mode

Purpose

1

Kafka Event

Real-Time

Ingests hiring data from Kafka topic

2

Data Preparation

Batch

Cleans and transforms hiring data

3

AutoML

Batch

Applies predictive models to generate insights

4

ClickHouse Writer

Batch

Writes predictions and structured data into ClickHouse

Pipeline Flow:

Kafka Event → Data Preparation → Event → AutoML → Event → ClickHouse Writer

Prerequisites:

Before implementing this workflow, ensure the following:

  • Access to the BDB Platform with permissions for:

    • Data Center

    • DS Lab (AutoML)

    • Data Pipeline

  • Kafka topic credentials and connectivity details (e.g., broker URL, topic name).

  • ClickHouse credentials (host, port, database, table, username, password).

  • A hiring dataset (CSV format) to create a sandbox and test transformations.

  • Active AutoML model trained and registered within DS Lab.

Step 1: Create a Sandbox and Upload CSV File

Purpose

To create a controlled environment for managing, previewing, and preparing hiring data.

Procedure

  1. From the BDB Platform Homepage, click the Apps icon → select Data Center.

  2. Inside the Data Center, click the Sandbox tab → click Create.

  3. Upload your CSV file by:

    • Dragging and dropping the file, or

    • Clicking Browse to select from your system.

  4. Once uploaded, click Upload to finalize. The Sandbox will appear in the Sandbox List.

Step 2: Apply Data Preparation to Clean the Data

Purpose

To transform raw hiring data into a consistent, analysis-ready format.

Procedure

  1. In the Sandbox List, click the three dots (⋮) next to your sandbox.

  2. Select Create Data Preparation to launch the Data Preparation workspace.

Perform Data Cleaning

Action

Steps

Result

Delete a Column

Select Gender column → click Transforms → Delete Column

Removes irrelevant field

Remove Empty Rows

Select Previous CTC and Offered CTC → click Transforms → Delete Empty Rows

Removes incomplete records

  1. Rename your Data Preparation for easier identification.

  2. Review all actions in the Steps tab (transformations are reversible).

  3. Click Save to finalize.

Result: The dataset is now standardized for downstream ML processing.

Step 3: Create and Run an AutoML Experiment

Purpose

To train and validate an ML model that identifies patterns in hiring data (e.g., salary prediction, offer acceptance likelihood).

Procedure

  1. Navigate to the DS Lab Module from the Apps Menu.

  2. Go to the AutoML section → click Create Experiment.

  3. Configure the experiment:

    • Experiment Name: Hiring Data

    • Experiment Type: Classification

  4. Under Configure Dataset:

    • Select Sandbox as the dataset source.

    • Set File Type: CSV

    • Choose the created Sandbox.

  5. Under Advanced Information:

    • Select the Data Preparation created in Step 2.

    • Set Target Column: Gender.

  6. Click Save to start the experiment.

Monitor and Review

  • AutoML tests multiple algorithms and hyperparameters.

  • After completion, click View Report to review:

    • Model accuracy and evaluation metrics.

    • Comparative model performance.

    • Recommended best-fit model.

Step 4: Register the Best Model

Purpose

To make the trained model available for use in pipelines and batch inference.

Procedure

  1. Open the Model section within the DS Lab.

  2. Select the desired model from the list of trained models.

  3. Click the Register (arrow) icon.

  4. Confirm registration.

Result: The registered model is now accessible within the AutoML Component for real-time or batch predictions.

Step 5: Create a New Pipeline

Purpose

To automate data ingestion, preparation, model inference, and result storage.

Procedure

  1. From the Apps Menu, navigate to the Data Pipeline module.

  2. Click Pipeline → Create.

  3. Provide the following details:

    • Pipeline Name

    • Description

    • Resource Allocation (Low/Medium/High)

  4. Click Save to create the pipeline.

Step 6: Configure Kafka Event (Source)

Purpose

To connect the pipeline with a Kafka topic for real-time hiring data ingestion.

Procedure

  1. From the Event Panel (right side of the screen), click + to add a Kafka Event.

  2. If the Event Panel isn’t visible, click the + icon at the top-right corner to open the Components Palette.

  3. Configure:

    • Event Type: Kafka

    • Event Mapping: Enable

    • Topic: Enter Kafka topic name

    • Broker URL: Specify the Kafka server address

  4. Save the event and drag it onto the pipeline canvas.

Result: Kafka Event is now connected as the entry point for live hiring data.

Step 7: Add the Data Preparation Component

Purpose

To apply the standardized data cleaning logic to streaming or batch data.

Procedure

  1. From the Components Palette, navigate to Transformation → Data Preparation.

  2. Drag and drop it onto the pipeline canvas.

  3. Configure:

    • Invocation Type: Batch

    • Meta Information:

      • Data Center Type: Data Sandbox

      • Sandbox Name: Select your Sandbox

      • Preparation: Choose the saved Data Preparation

  4. Click Save.

  5. From the Event Panel, click + to add another Kafka Event.

  6. Drag and drop it onto the canvas to connect post Data Preparation.

Step 8: Add the AutoML Component

Purpose

To apply the registered model to processed hiring data for prediction and analysis.

Procedure

  1. From the Components Palette, search for AutoML Component.

  2. Drag and drop onto the pipeline canvas.

  3. Configure:

    • Invocation Type: Batch

    • Model Name: Select your registered AutoML model from the dropdown.

  4. Save the component.

  5. Add and connect another Kafka Event after the AutoML component.

Step 9: Add the ClickHouse Writer Component

Purpose

To store prediction results and processed data for analysis, dashboards, or reporting.

Procedure

  1. In the Components Palette, navigate to Writer → ClickHouse Writer.

  2. Drag and drop onto the pipeline canvas.

  3. Configure:

    • Invocation Type: Batch

    • Meta Information:

      • Host: <hostname>

      • Port: <port>

      • Database Name: <database_name>

      • Table Name: <target_table>

      • Username: <db_username>

      • Password: <db_password>

  4. Validate the connection.

  5. Click Save.

Step 10: Connect and Activate the Pipeline

Pipeline Sequence

Event → Data Preparation → Event → AutoML → Event → ClickHouse Writer

Procedure

  1. Ensure all components are properly linked.

  2. Click Activate (▶) on the top toolbar.

  3. Wait until all pods deploy successfully.

  4. The Logs Panel will open automatically, showing live status and execution details.

Step 11: Validate the Pipeline Execution

Purpose

To confirm that data is processed, analyzed, and stored successfully.

Procedure

  • Use the Preview Panel of each event to inspect intermediate data.

  • Use the Logs Panel to monitor:

    • Kafka event ingestion

    • Data Preparation transformations

    • AutoML inference activity

    • ClickHouse write operations

Expected Log Messages:

Component

Log Message

Data Preparation

“Transformation applied successfully.”

AutoML

“Model inference completed successfully.”

ClickHouse Writer

“DB Writer started successfully.”


Step 12: Verify Data in ClickHouse

Purpose

To confirm that results are stored and accessible for analytics and dashboards.

Procedure

  1. Log in to your ClickHouse database.

  2. Execute a query on the target table:

    SELECT * FROM <database_name>.<target_table> LIMIT 10;
  3. Verify:

    • Data consistency

    • Correct schema

    • Presence of model predictions

Step 13: Deactivate Pipeline

Once testing or scheduled runs are complete:

  • Click Deactivate on the toolbar to stop the pipeline.

  • This releases system resources and halts active processes.

Troubleshooting

Issue

Possible Cause

Recommended Action

No data received from Kafka

Incorrect broker or topic configuration

Verify Kafka connectivity

AutoML fails to execute

Model not registered

Ensure model registration in DS Lab

Database write failure

Invalid ClickHouse credentials

Verify connection and authentication

Data schema mismatch

Unaligned Data Preparation logic

Re-check schema consistency across components

Outcome

After successful completion:

  • The Kafka stream is continuously ingested.

  • Data is cleaned, standardized, and enriched using Data Preparation.

  • AutoML models process the data to generate insights.

  • Predictions are stored in ClickHouse for reporting and BI tools.

This workflow provides a fully automated, ML-powered data pipeline for hiring analytics, enabling continuous insight generation with scalable and low-latency performance.

Key Benefits

Feature

Business Advantage

Real-Time Kafka Ingestion

Continuous data flow and low-latency analytics

Data Preparation

Ensures quality and uniformity of data

AutoML Integration

Automates insight extraction

ClickHouse Storage

Enables fast querying and dashboard integration

Summary

You have successfully implemented a modern, AI-driven data processing pipeline that:

  • Integrates Kafka ingestion for real-time data flow

  • Utilizes Data Preparation for structured cleaning

  • Employs AutoML for predictive analytics

  • Stores final results in ClickHouse for BI and dashboard visualization

This implementation showcases BDB’s capability to unify streaming data, machine learning, and analytics into a single, low-code environment — streamlining enterprise-grade recruitment insights.