Workflow 3

End-to-End Creation and Execution of a PySpark Job on BDB Platform

This workflow (Workflow 3) demonstrates the end-to-end process of creating and executing a PySpark job on the BDB Platform. The steps include setting up a sandbox, creating a DS Lab project, writing and registering PySpark code, exporting the script to the Pipeline, configuring the job, and finally activating it for seamless execution and monitoring.

The primary goal is to showcase how PySpark scripts can be efficiently integrated within the BDB platform for large-scale data operations and automation.

Create Data Sandbox

  1. Navigate to the Data Center module from the Apps menu.

  2. Select Sandbox and click Create.

  3. Provide a name and upload the CSV file (e.g., Restaurant_Sales_Data.csv).

4. Click Upload, then Done. A sandbox file is now created.

Create PySpark Project in DS Lab

  1. Go to DS Lab from the Apps menu.

  2. Click Create and fill in:

  3. Name: Job Workflow

  4. Description: PySpark Job Workflow 3

  5. Environment: PySpark

  6. Algorithm: Classification

  7. Resource Allocation: Low

  8. Idle Shutdown: 1 hour

  9. Click Save, then Activate the project.

Create PySpark Notebook.

  1. Inside the project, go to the Repo folder.

  2. Click Create Notebook, name it (e.g., Job_Workflow3) and save

  3. Open the notebook and add a dataset:

  4. Click Data, select Data Sandbox File

  5. Choose Restaurant_Sales_Data and click Add

4. Click the checkbox next to the added dataset to auto-generate PySpark read code.

Add and Register PySpark Script

PySpark Script:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

def pyspark_write(df, table, user, password, host, port, database):

url = "jdbc:clickhouse://" + host + ":" + str(port) + "/" + database

try:

df.write \

.format("jdbc") \

.mode("append") \

.option("url", url) \

.option("dbtable", table) \

.option("user", user) \

.option("password", password) \

.option("driver", "com.clickhouse.jdbc.ClickHouseDriver") \

.option("createTableOptions", "ENGINE=Memory") \

.save()

print("Saved Successfully")

except Exception as e:

print(f"Error writing to ClickHouse: {e}")

def writer(table, user, password, host, port, database):

from Notebook.DSNotebook.NotebookExecutor import NotebookExecutor

nb = NotebookExecutor()

df = nb.get_data(

'56031752215145392',

'@SYS.USERID',

'True',

{},

[],

limit=None,

spark=spark,

sheet_name=''

)

pyspark_write(df, table, user, password, host, port, database)

  1. Save and Register the notebook script:

  2. Click three dots > Register

  3. Select the 'export as script' and click Next > Finish

Create and Configure PySpark Job

· Go to Data Pipeline from the Apps menu.

· Click Create after navigating to the Job section, and fill in:

o Job Name: Workflow3

o Description: PySpark Restaurant Job

o Job Base Info: PySpark, Save the job.

Configure Script Meta Info:

  • Project: Job Workflow

  • Script: Auto-loaded from export

  • Start Function: writer

Input Arguments:

  • Host: (e.g., localhost)

  • Port: (e.g., 8123)

  • Username

  • Password

  • Database name

  • Table name

Click Save.

Activate Job and Monitor Logs

· Click on Activate Icon to trigger the job.

· Monitor progress via Logs tab.

· If the script runs without issues, a success message appears.

· The PySpark job writes the sandbox CSV data into ClickHouse as expected.

· Notifications (e.g., via Microsoft Teams Webhook) can be configured for success/failure alerts.

Conclusion

This workflow showcases the seamless integration of PySpark scripts into the BDB Platform — covering raw data ingestion, scalable job execution, and result validation. By following these structured steps, organizations can ensure reliable, efficient, and automated big data processing using the BDB unified analytics platform.

Last updated