Workflow 1

End-to-End Spark Job Creation and Execution BDB Platform

In this workflow, we design a complete end-to-end data processing job using the BDB Decision Platform, where raw CSV files from a sandbox environment are ingested, transformed with Spark SQL for cleansing and enrichment, and then seamlessly loaded into ClickHouse, a high-performance analytical database optimized for advanced analytics and reporting.

This workflow follows a batch data processing approach, purpose-built for scenarios where data from multiple sources is ingested, cleansed, and consolidated into a unified format for analytics and reporting. Unlike real-time processing that depends on queues and continuous event handling, the batch model streamlines operations and accelerates performance by executing large-scale transformations in scheduled Spark jobs.

Create a Sandbox Using the Data Center Plugin

· Log in to the BDB Platform.

· From the left-side menu, navigate to Data Center → Data Connector.

· Open the Sandbox section.

· Click on Create after selection the sandbox section.

1. Enter a Sandbox Name (e.g., Hiring Data) and provide an optional description.

2. Upload a CSV File by clicking the Upload button.

3. Once uploaded, the file name will appear in the Reference section.

4. Your sandbox is now created and ready to use for further processing.

Create a Spark Job

Go to the App Menu and select the Data Pipeline plugin.
Select Job option and Click the Create icon to start a new job.
Enter:
Job Name: workflow1

o Description: Spark job for hiring data

· Under Job Base Info, select Spark Job.

· Configure Resource Under Resource Configuration:

o Choose Resource Allocation: Low, Medium, or High depending on workload.

o Set CPU and Memory for the Driver and Executor according to the expected data volume.

Build the Job Workflow

Once you are on the Job Canvas,click on create follow these steps to add and configure the components:

Add and Configure Reader Component

· In the Search Bar, type Sandbox Reader.

· Drag the Sandbox Reader component onto the canvas.

· Open the configuration panel and set the following options:

o Storage Type: Platform

o File Type: CSV

o Sandbox: Select your sandbox name and the uploaded file

o Options: Enable Header and Infer Schema to correctly interpret column names and data types

· Rename the component to Hiring Input for clarity.

· Click Save to store the configuration.

Your reader component is now ready to serve as the data source for the Spark job.

Add and Configure Transformation Component

· Drag and drop Query Transform component to the canvas.

· In the configuration panel, set the following:

· Set View Name: inputDF

· Enter the following SQL query:

SELECT

name,

gender,

source,

designation,

team,

previous_organisation,

skills,

expected_joining_date,

joining_status,

experience,

COALESCE(previous_ctc, 0) AS previous_ctc,

COALESCE(offered_ctc, 0) AS offered_ctc,

COALESCE(monthly_salary, 0) AS monthly_salary,

current_status

FROM inputDF

Rename the component to Hiring Query.
Save the component

Add and Configure Writer Component

Drag the DB Writer component from the writer section of the component palette to the canvas.

Configuration details:
Host: [ClickHouse Host]
Port: [ClickHouse Port]
Username: [ClickHouse Username]
Password: [ClickHouse Password]
Database Name: [Target DB]
Table Name: [Target Table]
Select Driver: ClickHouse
Save Mode: Append / Overwrite (as required)
Rename the component to Hiring Writer.
Validate the connection and click Save.

Save, Activate, and Monitor the Job

· Click the Update Job icon to save your workflow changes.

· Click Activate Job to start running the Spark job.

· Open the Log Panel to monitor execution in real time:

Verify that driver and executor pods are running correctly.
Track progress of data read, transformation, and load operations.
Confirm successful job completion without errors.

Development Mode Preview

Click the Development Mode icon to enable dev execution.
Confirm execution and monitor logs to ensure the job runs successfully.
Select the Output Component and open the Data Preview tab.
Review up to 10 sample records for quick verification of transformations and data quality.

Note: In Development Mode, only 10 rows are written for preview purposes. Full data loads occur only when the job is activated in production mode.

Monitoring Jobs

Open the Job Monitoring tab.
View job status, activation time, resource usage (CPU, memory), and logs.
Click on any instance to see a graphical timeline of performance metrics.
System Logs are also available for deeper debugging (only during active runs).

Conclusion

You’ve successfully created and executed an end-to-end Spark job using the BDB Platform. The job read raw data from a sandbox, applied SQL-based transformations, and loaded it into ClickHouse. This approach highlights the efficiency of batch processing for enterprise data pipelines—enabling better data governance, performance, and downstream reporting across your analytics stack

PreviousJobs NextWorkflow 2

Last updated 27 days ago