Creating and Executing a PySpark Job
End-to-End Creation and Execution of a PySpark Job on BDB Platform
This workflow demonstrates the end-to-end process of creating and executing a PySpark job on the BDB Platform. It includes setting up a sandbox, building a DS Lab project, developing and registering a PySpark script, exporting it to the Pipeline, configuring job parameters, and activating it for execution and monitoring.
Overview
The goal of this workflow is to showcase how PySpark scripts can be seamlessly integrated into the BDB Platform for large-scale data operations, automation, and analytics. By following these steps, users can:
Ingest and process raw data.
Develop and register PySpark scripts in the DS Lab.
Schedule, run, and monitor PySpark jobs via the Data Pipeline.
Prerequisites:
Before starting, ensure that:
You have valid access to the Data Center, DS Lab, and Data Pipeline modules.
You have a ClickHouse database (or another JDBC-compatible destination) configured.
A valid CSV dataset (e.g.,
Restaurant_Sales_Data.csv) is available for upload.
Step 1: Create a Data Sandbox
Procedure
Navigate to Data Center → Sandbox from the Apps menu.
Click Create.
Provide a Sandbox Name and upload your CSV file (e.g.,
Restaurant_Sales_Data.csv).Click Upload, then Done.

Step 2: Create a PySpark Project in DS Lab
Procedure
Open DS Lab from the Apps menu.
Click Create, and enter the following details:
Name:
Job WorkflowDescription:
PySpark Job Workflow 3Environment:
PySparkAlgorithm:
ClassificationResource Allocation:
LowIdle Shutdown:
1 hour
Click Save, then Activate the project.

Step 3: Create a PySpark Notebook
Procedure
Navigate to the Repo folder inside the project.
Click Create Notebook, name it (e.g.,
Job_Workflow3), and click Save.Open the notebook, then:
Click Data → Select Data Sandbox File.
Choose
Restaurant_Sales_Data.Click Add.
Check the box next to the dataset to auto-generate PySpark read code.

Step 4: Add and Register the PySpark Script
Script Example
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
def pyspark_write(df, table, user, password, host, port, database):
url = "jdbc:clickhouse://" + host + ":" + str(port) + "/" + database
try:
df.write \
.format("jdbc") \
.mode("append") \
.option("url", url) \
.option("dbtable", table) \
.option("user", user) \
.option("password", password) \
.option("driver", "com.clickhouse.jdbc.ClickHouseDriver") \
.option("createTableOptions", "ENGINE=Memory") \
.save()
print("Saved Successfully")
except Exception as e:
print(f"Error writing to ClickHouse: {e}")
def writer(table, user, password, host, port, database):
from Notebook.DSNotebook.NotebookExecutor import NotebookExecutor
nb = NotebookExecutor()
df = nb.get_data(
'56031752215145392',
'@SYS.USERID',
'True',
{},
[],
limit=None,
spark=spark,
sheet_name=''
)
pyspark_write(df, table, user, password, host, port, database)Procedure
Paste the above code into the created PySpark notebook.
Click the three dots (⋮) in the notebook header.
Select Register → choose Export as Script → click Next → Finish.
Step 5: Create and Configure the PySpark Job
Procedure
Navigate to Data Pipeline from the Apps menu.
In the Job section, click Create and enter the following details:
Job Name:
Workflow3Description:
PySpark Restaurant JobJob Base Info: Select PySpark
Click Save.
Configure Script Meta Information
Project:
Job WorkflowScript: Auto-loaded from export
Start Function:
writer
Input Arguments
Provide the following inputs:
Host
localhost
Port
8123
Username
(database user)
Password
(database password)
Database Name
(e.g., restaurantdb)
Table Name
(e.g., sales_data)
Click Save to confirm the configuration.

Step 6: Activate Job and Monitor Logs
Procedure
Click the Activate icon to trigger job execution.
Monitor progress in the Logs tab.
Logs display the execution status, data writing progress, and system-level messages.

If the script runs successfully, a “Saved Successfully” message appears.
Step 7: Configure Notifications (Optional)
To receive real-time status updates:
Configure Microsoft Teams Webhooks or other alert channels for success/failure notifications in job settings.
Results
Notes and Recommendations
Always validate sandbox data before execution.
For production workloads, configure appropriate resource allocation and failure alerts.
Monitor job history for performance trends and optimize Spark configurations accordingly.
Best Situation to Use
Use this workflow when:
You need to execute PySpark scripts for ETL or large-scale data transformation.
You want to automate data ingestion and integration with ClickHouse or other databases.
You are building a reproducible machine learning or analytics workflow that requires unified orchestration of data and computation in BDB.