Workflow 2

End-to-End Python Job Creation and Execution on BDB Decision Platform

This workflow demonstrates how to create and execute an end-to-end Python data processing job using the BDB Decision Platform. The job ingests a raw CSV file from a sandbox environment, applies data cleaning and transformation logic written in Python, and loads the refined dataset into ClickHouse, a high-performance analytical database.

This setup follows a batch processing approach, best suited for scenarios where data from multiple sources must be consolidated, cleansed, and prepared for analytics or reporting. Unlike real-time pipelines that rely on event queues, the Python batch job approach streamlines operations and improves efficiency by running transformations on bulk data at scheduled intervals.

The implementation makes use of several BDB plugins—Data Center, Data Preparation, DS Lab, and Pipeline—to deliver a smooth, automated workflow from raw input to Python-based transformation and final job execution.

Create a Data Sandbox Using the Data Center Plugin

· Log in to the BDB Decision Platform using your username and password.

· From the left-side menu, navigate to Data Center → Data Connector.

· Navigate to the Sandbox section and click Create.

· Enter a Sandbox Name and upload the required data file using the Upload option.

· Click Save to finalize and create the sandbox.

· The sandbox is now ready and will serve as the data source for all subsequent operations.

Perform Data Transformations Using Data Preparation

To clean the Hiring Data Sandbox, navigate to the Data Preparation option.
Select the created sandbox, click the three-dot menu, and choose Create Data Preparation.
The dataset will appear in a grid like structure with automatic data profiling.
On the right panel, view charts, patterns, and column-level statistics.
Go to the Transform tab to access available transformation options.

Use the AutoPrep Transform to:
Remove special characters
Handle missing values
Normalize fields
Once transformations are applied, click Save Preparation and close the module.
This preparation is now reusable across various modules and pipelines.
Open the Data Science Lab plugin from the App Menu.

Either create a new project or open an existing one.

Ensure the environment is set to Python TensorFlow, then activate the project.
Go to the Repo section and click Create to start a new notebook.
Provide a notebook name and short description, then click Save.

· Click the add Dataset tab, then the Add to your dataset.

· Choose Data Sandbox File as the source and select Hiring Data.

· The dataset will now appear in the data section of your notebook.

· Click the first cell in your notebook.

· Select the dataset (Hiring Data) to auto-generate the code.

· Run the cell to load the dataset into the notebook.

· click on the 3 dots of the dataset ypu have added, select Data Preparation, and choose the predefined preparation.

· Generate the script and edit the dataframe name if needed.

· Run the cell to apply the transformation and output the cleaned dataframe.

· In a new cell, define a function called loaddata().

Paste the generated preparation script inside this function.
Ensure the function returns the final prepared dataframe.
Run print(loaddata()) in another cell to test and verify output.

Register Notebook as a Job

Save the notebook using the Save icon.
Go to the notebook’s repo section, ckick on the 3 dots , and select Register.
Validate the function details and proceed.

· On the Register as a Job screen:

o Set Scheduler Name as workflow2

o Set Description as Python Job

o Select the Start Function

o Choose a Resource Allocation level:

Low

Medium

High

o In the Limit section, define max CPU and memory needed

In the Request section, define CPU and memory needed at job start
Specify the number of instances
Enable On Demand if needed
Set up Success Alerts with:
Webhook URL
Channel (Slack, Teams, etc.)
Set up Failure Alerts similarly
Click Save to register the job

· The job is now created and activated for pipeline execution.

Schedule and Verify Job Execution

Open the Data Pipeline plugin from the App Menu.
Navigate to the jobs tab.
View the list of registered jobs.
Locate your newly registered job from the DS Lab.
Click the job to view logs and monitor its execution status.
Confirm that the job is scheduled to run every 1 minute.
Ensure the Forbid Concurrency setting is enabled so that only one instance runs at a time.
Click View to double-check the job configuration.

· Go to the Logs section and verify timestamp entries to confirm that the job is executing at the specified interval.

Conclusion In this workflow, we successfully designed, developed, and scheduled an end-to-end Python job using the BDB Decision Platform. Starting with sandbox creation in Data Center, followed by data cleaning and transformation in DataPrep, and extending to Python scripting and experimentation in DS Lab, the job was finally automated and scheduled for execution through the Pipeline plugin

PreviousWorkflow 1 NextWorkflow 3

Last updated 27 days ago