PySpark Script

The PySpark Script Component allows users to write, manage, and execute custom PySpark scripts within a pipeline. It supports both user-defined scripts and scripts exported from DSLab notebooks, enabling distributed data processing at scale.

Key Capabilities

Run custom PySpark scripts directly in the pipeline.
Import and execute scripts exported from DSLab notebooks.
Manage scripts with Push/Pull to VCS for version control.
Accepts input in DataFrame or List of Dictionary format.
Use external PySpark libraries by specifying them in configuration.

Configuration Overview

All PySpark Script configurations are grouped into:

Basic Information
Meta Information
Resource Configuration

Configuring Meta Information

Component Name

Provide a unique name for the component.
Restrictions:
- No spaces or special characters (use _ for spacing).
- Must not be "test" or start with "test" (reserved for backend processes).

Start Function Name

Displays all function names from the PySpark script.
Select one as the entry point.

In Event Data Type

Select the format of input data:
- DataFrame
- List of Dictionary

External Libraries

Specify additional PySpark libraries required in the script.
Multiple libraries can be entered, separated by commas.

Execution Type

Two execution modes are supported:

Custom Script
- Write your own PySpark script directly in the Script field.
- Requirements:
  - Must contain at least one function.
  - Functions must be validated using Validate Script.
- Start Function – Select which function to execute.
- Input Data – Define key-value pairs for function parameters.
DSLab Script
- Execute a script exported from a DSLab notebook.
- Required fields:
  - Project Name – Select the project containing the notebook.
  - Script Name – Choose the exported notebook script.
  - Start Function – Select the entry-point function.
- The exported script appears in the editor and can be validated.
- Input Data – Pass parameters as key-value pairs.

VCS Integration

Pull Script from VCS – Retrieve a previously committed script version.
Push Script to VCS – Commit the current script with a version tag and commit message.

Custom PySpark Script Example

import pandas as pd
import numpy as np
import json
from datetime import datetime
 
def a():
    data = {
        'Lists': [[1, 2, 3], [4, 5], [6]],
        'JSON': [json.dumps({"name": "Alice", "age": 30}),
                 json.dumps({"name": "Bob", "age": 25}),
                 json.dumps({"name": "Charlie", "age": 35})],
        'Numbers': [10, 20, 30],
        'Timestamps': [datetime.now(), datetime.now(), datetime.now()]
    }
 
    # Create DataFrame
    df = pd.DataFrame(data)
 
    # Return the DataFrame for pipeline use
    return df

In this example:

A pandas DataFrame is created with lists, JSON objects, numbers, and timestamps.
The function returns the DataFrame for downstream pipeline processing.

Best Practices

Define functions clearly and validate them before saving.
Keep external libraries lightweight to avoid performance issues.
Use VCS versioning to manage and rollback scripts easily.
Ensure the input data type matches the expected schema of the script.

Example Use Cases

Clean and transform large datasets using PySpark operations.
Parse and flatten semi-structured JSON data.
Enrich datasets with external data sources.
Apply statistical computations or machine learning preprocessing before model training.

PreviousKeeping Different Versions of the Python Script in VCS NextScheduler