Creating a Python Data Processing Job

End-to-End Python Job Creation and Execution on BDB Decision Platform

This guide demonstrates how to create and execute a complete Python-based data processing workflow within the BDB Decision Platform. The workflow covers data ingestion from a sandbox, data cleaning and transformation through Python scripts, and loading the refined dataset into ClickHouse, a high-performance analytical database.

Overview

This process follows a batch data processing approach, ideal for use cases that require consolidating and cleaning large data volumes from multiple sources at scheduled intervals.

Unlike real-time pipelines that utilize event streaming, batch processing streamlines data handling and enhances efficiency by performing transformations on large datasets.

The workflow integrates multiple BDB plugins:

  • Data Center – for sandbox creation and data ingestion.

  • Data Preparation – for automated data cleaning and transformation.

  • DS Lab – for Python-based development and job registration.

  • Pipeline – for automated scheduling, execution, and monitoring of the registered job.

Workflow Summary

The workflow consists of the following stages:

  1. Create a Data Sandbox using the Data Center plugin.

  2. Perform data transformations using Data Preparation.

  3. Process and transform data in Python via DS Lab.

  4. Register the notebook as a batch job.

  5. Schedule and verify job execution through the Data Pipeline plugin.

Step 1: Create a Data Sandbox (Using Data Center Plugin)

Procedure

  1. Log in to the BDB Decision Platform.

  2. Navigate to Data Center → Data Connector.

  3. Navigate to the Sandbox section and click "Create."

  4. Enter a Sandbox Name.

  5. Upload the data file (CSV format) using the Upload option.

  6. Click Save to finalize creation.

Step 2: Perform Data Transformations (Using Data Preparation Plugin)

Procedure

  1. Navigate to Data Preparation.

  2. Locate the created sandbox and click the three-dot menuCreate Data Preparation.

  3. The dataset opens in a grid view with automatic data profiling.

  4. Review charts, patterns, and column-level statistics in the right-side panel.

  5. Open the Transform tab to view transformation options.

  6. Select AutoPrep Transform and perform the following:

    • Remove special characters.

    • Handle missing values.

    • Normalize fields.

  7. Click Save Preparation and close the module.

Best Situation to Use: Use Data Preparation when data requires automated cleaning and consistency enforcement before analytics or ML modeling.

Step 3: Execute Data Transformation in Python (Using DS Lab)

Procedure

  1. Open Data Science Lab from the App Menu.

  2. Create a new project or open an existing one.

  3. Ensure the environment is set to Python TensorFlow and activate the project.

  4. In the Repo section, click Create to start a new notebook.

  5. Provide a notebook name and short description, then click Save.

  6. Add a dataset:

    • Click Add DatasetAdd to your dataset.

    • Choose the Data Sandbox File as the source and select the Hiring Data sandbox.

  7. Run the first cell to load the dataset into the notebook.

  8. From the dataset’s three-dot menu, select Data Preparation and choose the predefined preparation.

  9. Generate the script and, if necessary, rename the dataframe variable.

  10. Run the cell to apply the transformation and output the cleaned dataframe.

  11. In a new cell, define a Python function named loaddata() and paste the preparation script inside it.

  12. Ensure the function returns the final prepared dataframe.

  13. Run print(loaddata()) to verify successful data processing.

Best Situation to Use: Use DS Lab when you need custom Python data transformation logic, advanced ML preprocessing, or integration with external Python libraries.

Step 4: Register Notebook as a Job

Procedure

  1. Save the notebook using the Save icon.

  2. Go to the notebook’s Repo section → click the three-dot menu → select Register.

  3. Validate the function details.

  4. On the Register as Job screen, configure the following:

    • Scheduler Name: workflow2

    • Description: Python Job

    • Start Function: Select the entry function.

    • Resource Allocation: Choose from Low, Medium, or High.

    • Limit Section: Define maximum CPU and memory usage.

    • Request Section: Define CPU and memory for job initiation.

    • Instances: Specify the number of parallel instances.

    • On-Demand Execution: Enable if needed.

    • Success Alerts: Configure webhook URL and channel (Slack, Teams, etc.).

    • Failure Alerts: Configure similarly for error notifications.

  5. Click Save to register the job.

Best Situation to Use: Use Job Registration when workflows must be scheduled, version-controlled, and monitored via the centralized pipeline system.

Step 5: Schedule and Verify Job Execution (Using Data Pipeline Plugin)

Procedure

  1. Open the Data Pipeline plugin from the App Menu.

  2. Navigate to the Jobs tab.

  3. Locate the newly registered job from DS Lab.

  4. Click the job to view logs and execution status.

  5. Confirm that the job is scheduled to run every 1 minute.

  6. Ensure Forbid Concurrency is enabled to allow only one active instance.

  7. Click View to confirm configuration details.

  8. Go to the Logs section and review timestamps to verify periodic execution.

Best Situation to Use: Use this setup for batch workloads that require recurring execution, such as ETL processes, data refresh jobs, or analytics pipeline automation.

Results

Notes and Recommendations

  • Ensure data preparation scripts are validated before registering the job.

  • Enable Forbid Concurrency for consistency and to prevent overlapping job runs.

  • Use alerts and logs for proactive monitoring and debugging.

  • Regularly review ClickHouse load performance to maintain optimal analytical query speeds.

Best Situation to Use

Use Case
Best Scenario

Data Center Sandbox

When ingesting raw data from multiple external or file-based sources.

Data Preparation

When automatic cleaning and standardization are needed before model training or reporting.

Python Transformation (DS Lab)

For applying custom logic, ML feature engineering, or using Python libraries on bulk datasets.

Job Scheduling (Data Pipeline)

When batch processing jobs must run on fixed intervals with monitoring and alerting enabled.

Conclusion

By combining the Data Center, Data Preparation, DS Lab, and Pipeline plugins, the BDB Decision Platform provides a fully integrated environment for Python-based batch processing workflows. This setup ensures reliable, reusable, and scalable data transformation pipelines suitable for data warehousing, analytics, and enterprise reporting.