Creating and Executing a PySpark Job

End-to-End Creation and Execution of a PySpark Job on BDB Platform

This workflow demonstrates the end-to-end process of creating and executing a PySpark job on the BDB Platform. It includes setting up a sandbox, building a DS Lab project, developing and registering a PySpark script, exporting it to the Pipeline, configuring job parameters, and activating it for execution and monitoring.

Overview

The goal of this workflow is to showcase how PySpark scripts can be seamlessly integrated into the BDB Platform for large-scale data operations, automation, and analytics. By following these steps, users can:

  • Ingest and process raw data.

  • Develop and register PySpark scripts in the DS Lab.

  • Schedule, run, and monitor PySpark jobs via the Data Pipeline.

Step 1: Create a Data Sandbox

Procedure

  1. Navigate to Data Center → Sandbox from the Apps menu.

  2. Click Create.

  3. Provide a Sandbox Name and upload your CSV file (e.g., Restaurant_Sales_Data.csv).

  4. Click Upload, then Done.

Step 2: Create a PySpark Project in DS Lab

Procedure

  1. Open DS Lab from the Apps menu.

  2. Click Create, and enter the following details:

    • Name: Job Workflow

    • Description: PySpark Job Workflow 3

    • Environment: PySpark

    • Algorithm: Classification

    • Resource Allocation: Low

    • Idle Shutdown: 1 hour

  3. Click Save, then Activate the project.

Step 3: Create a PySpark Notebook

Procedure

  1. Navigate to the Repo folder inside the project.

  2. Click Create Notebook, name it (e.g., Job_Workflow3), and click Save.

  3. Open the notebook, then:

    • Click DataSelect Data Sandbox File.

    • Choose Restaurant_Sales_Data.

    • Click Add.

  4. Check the box next to the dataset to auto-generate PySpark read code.

Step 4: Add and Register the PySpark Script

Script Example

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

def pyspark_write(df, table, user, password, host, port, database):
    url = "jdbc:clickhouse://" + host + ":" + str(port) + "/" + database
    try:
        df.write \
            .format("jdbc") \
            .mode("append") \
            .option("url", url) \
            .option("dbtable", table) \
            .option("user", user) \
            .option("password", password) \
            .option("driver", "com.clickhouse.jdbc.ClickHouseDriver") \
            .option("createTableOptions", "ENGINE=Memory") \
            .save()
        print("Saved Successfully")
    except Exception as e:
        print(f"Error writing to ClickHouse: {e}")

def writer(table, user, password, host, port, database):
    from Notebook.DSNotebook.NotebookExecutor import NotebookExecutor
    nb = NotebookExecutor()
    df = nb.get_data(
        '56031752215145392',
        '@SYS.USERID',
        'True',
        {},
        [],
        limit=None,
        spark=spark,
        sheet_name=''
    )
    pyspark_write(df, table, user, password, host, port, database)

Procedure

  1. Paste the above code into the created PySpark notebook.

  2. Click the three dots (⋮) in the notebook header.

  3. Select Register → choose Export as Script → click Next → Finish.

Step 5: Create and Configure the PySpark Job

Procedure

  1. Navigate to Data Pipeline from the Apps menu.

  2. In the Job section, click Create and enter the following details:

    • Job Name: Workflow3

    • Description: PySpark Restaurant Job

    • Job Base Info: Select PySpark

  3. Click Save.

Configure Script Meta Information

  • Project: Job Workflow

  • Script: Auto-loaded from export

  • Start Function: writer

Input Arguments

Provide the following inputs:

Parameter
Example Value

Host

localhost

Port

8123

Username

(database user)

Password

(database password)

Database Name

(e.g., restaurantdb)

Table Name

(e.g., sales_data)

Click Save to confirm the configuration.

Step 6: Activate Job and Monitor Logs

Procedure

  1. Click the Activate icon to trigger job execution.

  2. Monitor progress in the Logs tab.

    • Logs display the execution status, data writing progress, and system-level messages.

  3. If the script runs successfully, a “Saved Successfully” message appears.

Step 7: Configure Notifications (Optional)

To receive real-time status updates:

  • Configure Microsoft Teams Webhooks or other alert channels for success/failure notifications in job settings.

Results

Notes and Recommendations

  • Always validate sandbox data before execution.

  • For production workloads, configure appropriate resource allocation and failure alerts.

  • Monitor job history for performance trends and optimize Spark configurations accordingly.

Best Situation to Use

Use this workflow when:

  • You need to execute PySpark scripts for ETL or large-scale data transformation.

  • You want to automate data ingestion and integration with ClickHouse or other databases.

  • You are building a reproducible machine learning or analytics workflow that requires unified orchestration of data and computation in BDB.