Creating and Executing a PySpark Job

End-to-End Creation and Execution of a PySpark Job on BDB Platform

This workflow demonstrates the end-to-end process of creating and executing a PySpark job on the BDB Platform. It includes setting up a sandbox, building a DS Lab project, developing and registering a PySpark script, exporting it to the Pipeline, configuring job parameters, and activating it for execution and monitoring.

Overview

The goal of this workflow is to showcase how PySpark scripts can be seamlessly integrated into the BDB Platform for large-scale data operations, automation, and analytics. By following these steps, users can:

Ingest and process raw data.
Develop and register PySpark scripts in the DS Lab.
Schedule, run, and monitor PySpark jobs via the Data Pipeline.

Prerequisites:

Before starting, ensure that:

You have valid access to the Data Center, DS Lab, and Data Pipeline modules.
You have a ClickHouse database (or another JDBC-compatible destination) configured.
A valid CSV dataset (e.g., Restaurant_Sales_Data.csv) is available for upload.

Step 1: Create a Data Sandbox

Procedure

Navigate to Data Center → Sandbox from the Apps menu.
Click Create.
Provide a Sandbox Name and upload your CSV file (e.g., Restaurant_Sales_Data.csv).
Click Upload, then Done.

A sandbox file is successfully created and listed in the Data Center.

Step 2: Create a PySpark Project in DS Lab

Procedure

Open DS Lab from the Apps menu.
Click Create, and enter the following details:
- Name: Job Workflow
- Description: PySpark Job Workflow 3
- Environment: PySpark
- Algorithm: Classification
- Resource Allocation: Low
- Idle Shutdown: 1 hour
Click Save, then Activate the project.

A PySpark-based project is created and ready for script development.

Step 3: Create a PySpark Notebook

Procedure

Navigate to the Repo folder inside the project.
Click Create Notebook, name it (e.g., Job_Workflow3), and click Save.
Open the notebook, then:
- Click Data → Select Data Sandbox File.
- Choose Restaurant_Sales_Data.
- Click Add.
Check the box next to the dataset to auto-generate PySpark read code.

The notebook is now linked to the sandbox dataset and ready for code execution.

Step 4: Add and Register the PySpark Script

Script Example

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

def pyspark_write(df, table, user, password, host, port, database):
    url = "jdbc:clickhouse://" + host + ":" + str(port) + "/" + database
    try:
        df.write \
            .format("jdbc") \
            .mode("append") \
            .option("url", url) \
            .option("dbtable", table) \
            .option("user", user) \
            .option("password", password) \
            .option("driver", "com.clickhouse.jdbc.ClickHouseDriver") \
            .option("createTableOptions", "ENGINE=Memory") \
            .save()
        print("Saved Successfully")
    except Exception as e:
        print(f"Error writing to ClickHouse: {e}")

def writer(table, user, password, host, port, database):
    from Notebook.DSNotebook.NotebookExecutor import NotebookExecutor
    nb = NotebookExecutor()
    df = nb.get_data(
        '56031752215145392',
        '@SYS.USERID',
        'True',
        {},
        [],
        limit=None,
        spark=spark,
        sheet_name=''
    )
    pyspark_write(df, table, user, password, host, port, database)

Procedure

Paste the above code into the created PySpark notebook.
Click the three dots (⋮) in the notebook header.
Select Register → choose Export as Script → click Next → Finish.

The notebook is registered as a reusable PySpark script.

Step 5: Create and Configure the PySpark Job

Procedure

Navigate to Data Pipeline from the Apps menu.
In the Job section, click Create and enter the following details:
- Job Name: Workflow3
- Description: PySpark Restaurant Job
- Job Base Info: Select PySpark
Click Save.

Configure Script Meta Information

Project: Job Workflow
Script: Auto-loaded from export
Start Function: writer

Input Arguments

Provide the following inputs:

Parameter

Example Value

Host

localhost

Port

8123

Username

(database user)

Password

(database password)

Database Name

(e.g., restaurantdb)

Table Name

(e.g., sales_data)

Click Save to confirm the configuration.

Step 6: Activate Job and Monitor Logs

Procedure

Click the Activate icon to trigger job execution.
Monitor progress in the Logs tab.
- Logs display the execution status, data writing progress, and system-level messages.
If the script runs successfully, a “Saved Successfully” message appears.

The PySpark job writes the sandbox CSV data into ClickHouse, confirming successful end-to-end execution.

Step 7: Configure Notifications (Optional)

To receive real-time status updates:

Configure Microsoft Teams Webhooks or other alert channels for success/failure notifications in job settings.

Results

The PySpark script is executed end-to-end, transforming sandbox data and storing it in ClickHouse.
Execution logs, metrics, and notifications confirm the success of the job.
The workflow ensures scalable, automated, and auditable big data processing within the BDB Platform.

Notes and Recommendations

Always validate sandbox data before execution.
For production workloads, configure appropriate resource allocation and failure alerts.
Monitor job history for performance trends and optimize Spark configurations accordingly.

Best Situation to Use

Use this workflow when:

You need to execute PySpark scripts for ETL or large-scale data transformation.
You want to automate data ingestion and integration with ClickHouse or other databases.
You are building a reproducible machine learning or analytics workflow that requires unified orchestration of data and computation in BDB.

PreviousCreating a Python Data Processing Job NextOn-Demand Python Job Execution using the BDB Platform