PySpark Job
Write and execute custom PySpark scripts as Jobs in the Data Pipeline module. Scripts are authored in the Data Science Lab and exported to the Pipeline for operational runs.
Prerequisites
A Project created in Data Science Lab under the PySpark Environment.
The Notebook/script exported to Pipeline.
The script’s entry logic must be wrapped inside a function (see “Export to Pipeline” notes in DS Lab).
Access to required data sources and any external libraries used by the script.
Part 1 — Create a PySpark Job
Steps
Open the Pipeline homepage and click Create.
In the right-hand panel, under Job, click Create.
Fill in Base Information:
Enter name: Provide a unique job name.
Job Description: (Optional) Short description of purpose/logic.
Job Baseinfo: Select PySpark Job.
Configure Trigger By (optional):
On Success: Select a job whose success triggers this PySpark job.
On Failure: Select a job whose failure triggers this PySpark job.
(Optional) Is Schedule: Check to schedule the job via the platform’s scheduler.
Spark config: Select compute resources for the job (cores, memory, etc.).
Click Save. You are redirected to the PySpark Job workspace.
Part 2 — Configure the PySpark Job
Open the job’s Meta Information tab and complete the fields below.
Project Name
Choose the same DS Lab Project where the notebook/script was created.
Script Name
Select the exported Notebook/script (exported from DS Lab to Pipeline).
External Library
Comma-separated list of extra Python packages (e.g., requests, pyjwt
). The runtime attempts to make these available during execution.
Start Function
Select the function within your script that acts as the entry point (must exist in the exported script).
Script
Read-only view of the exported script contents.
Input Data
Provide key-value pairs for function parameters. Keys = parameter names; Values = the input values passed at runtime.
Important: Your DS Lab script must define a callable function for Start Function (e.g.,
def main(params): …
). Business logic placed only at module top level will not be executed.
JDBC Connectors (Supported)
You can use the following JDBC connectors within your PySpark code:
MySQL
MSSQL
Oracle
MongoDB
PostgreSQL
ClickHouse
Configure connector URLs, credentials, and driver options inside your script or via platform secrets/configs, as appropriate for your environment.
Run & Monitor
Run
From the job workspace, Run the job on demand, or rely on Trigger By and/or Schedule if configured.
Monitor
Use the job’s Logs panel for driver/executor output, errors, and progress.
Validate expected artifacts (tables/files/metrics) downstream.
Best Practices
Entry point: Keep all orchestration inside the Start Function; parse
Input Data
args there.Idempotency: For scheduled/triggered runs, guard against duplicates (e.g., use watermarks or upserts).
Dependencies: Pin library versions in External Library to avoid environment drift.
Observability: Log key parameters and checkpoints; expose row counts and timings.
Next Steps
Version your script in DS Lab and re-export on changes.
Use Trigger By chains to orchestrate multi-stage pipelines (e.g., ingest → transform → model).
Apply scheduling for periodic or SLA-bound execution.
Last updated