PySpark Job

Write and execute custom PySpark scripts as Jobs in the Data Pipeline module. Scripts are authored in the Data Science Lab and exported to the Pipeline for operational runs.

Prerequisites

  • A Project created in Data Science Lab under the PySpark Environment.

  • The Notebook/script exported to Pipeline.

    The script’s entry logic must be wrapped inside a function (see “Export to Pipeline” notes in DS Lab).

  • Access to required data sources and any external libraries used by the script.

Part 1 — Create a PySpark Job

Steps

  1. Open the Pipeline homepage and click Create.

  2. In the right-hand panel, under Job, click Create.

  3. Fill in Base Information:

    • Enter name: Provide a unique job name.

    • Job Description: (Optional) Short description of purpose/logic.

    • Job Baseinfo: Select PySpark Job.

  4. Configure Trigger By (optional):

    • On Success: Select a job whose success triggers this PySpark job.

    • On Failure: Select a job whose failure triggers this PySpark job.

  5. (Optional) Is Schedule: Check to schedule the job via the platform’s scheduler.

  6. Spark config: Select compute resources for the job (cores, memory, etc.).

  7. Click Save. You are redirected to the PySpark Job workspace.

Part 2 — Configure the PySpark Job

Open the job’s Meta Information tab and complete the fields below.

Field
Description / Guidance

Project Name

Choose the same DS Lab Project where the notebook/script was created.

Script Name

Select the exported Notebook/script (exported from DS Lab to Pipeline).

External Library

Comma-separated list of extra Python packages (e.g., requests, pyjwt). The runtime attempts to make these available during execution.

Start Function

Select the function within your script that acts as the entry point (must exist in the exported script).

Script

Read-only view of the exported script contents.

Input Data

Provide key-value pairs for function parameters. Keys = parameter names; Values = the input values passed at runtime.

Important: Your DS Lab script must define a callable function for Start Function (e.g., def main(params): …). Business logic placed only at module top level will not be executed.

JDBC Connectors (Supported)

You can use the following JDBC connectors within your PySpark code:

  • MySQL

  • MSSQL

  • Oracle

  • MongoDB

  • PostgreSQL

  • ClickHouse

Configure connector URLs, credentials, and driver options inside your script or via platform secrets/configs, as appropriate for your environment.

Run & Monitor

Run

  • From the job workspace, Run the job on demand, or rely on Trigger By and/or Schedule if configured.

Monitor

  • Use the job’s Logs panel for driver/executor output, errors, and progress.

  • Validate expected artifacts (tables/files/metrics) downstream.

Best Practices

  • Entry point: Keep all orchestration inside the Start Function; parse Input Data args there.

  • Idempotency: For scheduled/triggered runs, guard against duplicates (e.g., use watermarks or upserts).

  • Dependencies: Pin library versions in External Library to avoid environment drift.

  • Observability: Log key parameters and checkpoints; expose row counts and timings.

Next Steps

  • Version your script in DS Lab and re-export on changes.

  • Use Trigger By chains to orchestrate multi-stage pipelines (e.g., ingest → transform → model).

  • Apply scheduling for periodic or SLA-bound execution.

Last updated