Creating a Python Data Processing Job

End-to-End Python Job Creation and Execution on BDB Decision Platform

This guide demonstrates how to create and execute a complete Python-based data processing workflow within the BDB Decision Platform. The workflow covers data ingestion from a sandbox, data cleaning and transformation through Python scripts, and loading the refined dataset into ClickHouse, a high-performance analytical database.

Overview

This process follows a batch data processing approach, ideal for use cases that require consolidating and cleaning large data volumes from multiple sources at scheduled intervals.

Unlike real-time pipelines that utilize event streaming, batch processing streamlines data handling and enhances efficiency by performing transformations on large datasets.

The workflow integrates multiple BDB plugins:

Data Center – for sandbox creation and data ingestion.
Data Preparation – for automated data cleaning and transformation.
DS Lab – for Python-based development and job registration.
Pipeline – for automated scheduling, execution, and monitoring of the registered job.

Prerequisites:

Access to the BDB Decision Platform.
A valid ClickHouse database connection.
Python environment with TensorFlow configured in DS Lab.
Source data file in CSV format.

Workflow Summary

The workflow consists of the following stages:

Create a Data Sandbox using the Data Center plugin.
Perform data transformations using Data Preparation.
Process and transform data in Python via DS Lab.
Register the notebook as a batch job.
Schedule and verify job execution through the Data Pipeline plugin.

Step 1: Create a Data Sandbox (Using Data Center Plugin)

Procedure

Log in to the BDB Decision Platform.
Navigate to Data Center → Data Connector.
Navigate to the Sandbox section and click "Create."
Enter a Sandbox Name.
Upload the data file (CSV format) using the Upload option.
Click Save to finalize creation.

The sandbox is successfully created and will serve as the data source for subsequent data preparation and Python processing steps.

Step 2: Perform Data Transformations (Using Data Preparation Plugin)

Procedure

Navigate to Data Preparation.
Locate the created sandbox and click the three-dot menu → Create Data Preparation.
The dataset opens in a grid view with automatic data profiling.
Review charts, patterns, and column-level statistics in the right-side panel.
Open the Transform tab to view transformation options.
Select AutoPrep Transform and perform the following:
- Remove special characters.
- Handle missing values.
- Normalize fields.
Click Save Preparation and close the module.

The transformation preparation is saved and reusable across other modules and pipelines.

Best Situation to Use: Use Data Preparation when data requires automated cleaning and consistency enforcement before analytics or ML modeling.

Step 3: Execute Data Transformation in Python (Using DS Lab)

Procedure

Open Data Science Lab from the App Menu.
Create a new project or open an existing one.
Ensure the environment is set to Python TensorFlow and activate the project.
In the Repo section, click Create to start a new notebook.
Provide a notebook name and short description, then click Save.
Add a dataset:
- Click Add Dataset → Add to your dataset.
- Choose the Data Sandbox File as the source and select the Hiring Data sandbox.
Run the first cell to load the dataset into the notebook.
From the dataset’s three-dot menu, select Data Preparation and choose the predefined preparation.
Generate the script and, if necessary, rename the dataframe variable.
Run the cell to apply the transformation and output the cleaned dataframe.
In a new cell, define a Python function named loaddata() and paste the preparation script inside it.
Ensure the function returns the final prepared dataframe.
Run print(loaddata()) to verify successful data processing.

The data is cleaned, transformed, and verified in a reusable Python function ready for integration into the batch job.

Best Situation to Use: Use DS Lab when you need custom Python data transformation logic, advanced ML preprocessing, or integration with external Python libraries.

Step 4: Register Notebook as a Job

Procedure

Save the notebook using the Save icon.
Go to the notebook’s Repo section → click the three-dot menu → select Register.
Validate the function details.
On the Register as Job screen, configure the following:
- Scheduler Name: workflow2
- Description: Python Job
- Start Function: Select the entry function.
- Resource Allocation: Choose from Low, Medium, or High.
- Limit Section: Define maximum CPU and memory usage.
- Request Section: Define CPU and memory for job initiation.
- Instances: Specify the number of parallel instances.
- On-Demand Execution: Enable if needed.
- Success Alerts: Configure webhook URL and channel (Slack, Teams, etc.).
- Failure Alerts: Configure similarly for error notifications.
Click Save to register the job.

The job is registered and activated for execution in the Data Pipeline module.

Best Situation to Use: Use Job Registration when workflows must be scheduled, version-controlled, and monitored via the centralized pipeline system.

Step 5: Schedule and Verify Job Execution (Using Data Pipeline Plugin)

Procedure

Open the Data Pipeline plugin from the App Menu.
Navigate to the Jobs tab.
Locate the newly registered job from DS Lab.
Click the job to view logs and execution status.
Confirm that the job is scheduled to run every 1 minute.
Ensure Forbid Concurrency is enabled to allow only one active instance.
Click View to confirm configuration details.
Go to the Logs section and review timestamps to verify periodic execution.

The Python job runs automatically at the defined interval, performing data ingestion, transformation, and loading into ClickHouse.

Best Situation to Use: Use this setup for batch workloads that require recurring execution, such as ETL processes, data refresh jobs, or analytics pipeline automation.

Results

The dataset is successfully ingested, cleaned, transformed, and loaded into ClickHouse.
The batch job executes automatically at scheduled intervals.
Logs and alerts confirm successful completion or highlight any issues for troubleshooting.

Notes and Recommendations

Ensure data preparation scripts are validated before registering the job.
Enable Forbid Concurrency for consistency and to prevent overlapping job runs.
Use alerts and logs for proactive monitoring and debugging.
Regularly review ClickHouse load performance to maintain optimal analytical query speeds.

Best Situation to Use

Use Case

Best Scenario

Data Center Sandbox

When ingesting raw data from multiple external or file-based sources.

Data Preparation

When automatic cleaning and standardization are needed before model training or reporting.

Python Transformation (DS Lab)

For applying custom logic, ML feature engineering, or using Python libraries on bulk datasets.

Job Scheduling (Data Pipeline)

When batch processing jobs must run on fixed intervals with monitoring and alerting enabled.

Conclusion

By combining the Data Center, Data Preparation, DS Lab, and Pipeline plugins, the BDB Decision Platform provides a fully integrated environment for Python-based batch processing workflows. This setup ensures reliable, reusable, and scalable data transformation pipelines suitable for data warehousing, analytics, and enterprise reporting.

PreviousCreating an ETL Spark Job NextCreating and Executing a PySpark Job