PySpark Script

The PySpark Script Component allows users to write, manage, and execute custom PySpark scripts within a pipeline. It supports both user-defined scripts and scripts exported from DSLab notebooks, enabling distributed data processing at scale.

Key Capabilities

  • Run custom PySpark scripts directly in the pipeline.

  • Import and execute scripts exported from DSLab notebooks.

  • Manage scripts with Push/Pull to VCS for version control.

  • Accepts input in DataFrame or List of Dictionary format.

  • Use external PySpark libraries by specifying them in configuration.

Configuration Overview

All PySpark Script configurations are grouped into:

  • Basic Information

  • Meta Information

  • Resource Configuration

Configuring Meta Information

Component Name

  • Provide a unique name for the component.

  • Restrictions:

    • No spaces or special characters (use _ for spacing).

    • Must not be "test" or start with "test" (reserved for backend processes).

Start Function Name

  • Displays all function names from the PySpark script.

  • Select one as the entry point.

In Event Data Type

  • Select the format of input data:

    • DataFrame

    • List of Dictionary

External Libraries

  • Specify additional PySpark libraries required in the script.

  • Multiple libraries can be entered, separated by commas.

Execution Type

Two execution modes are supported:

  1. Custom Script

    • Write your own PySpark script directly in the Script field.

    • Requirements:

      • Must contain at least one function.

      • Functions must be validated using Validate Script.

    • Start Function – Select which function to execute.

    • Input Data – Define key-value pairs for function parameters.

  2. DSLab Script

    • Execute a script exported from a DSLab notebook.

    • Required fields:

      • Project Name – Select the project containing the notebook.

      • Script Name – Choose the exported notebook script.

      • Start Function – Select the entry-point function.

    • The exported script appears in the editor and can be validated.

    • Input Data – Pass parameters as key-value pairs.

VCS Integration

  • Pull Script from VCS – Retrieve a previously committed script version.

  • Push Script to VCS – Commit the current script with a version tag and commit message.

Custom PySpark Script Example

import pandas as pd
import numpy as np
import json
from datetime import datetime
 
def a():
    data = {
        'Lists': [[1, 2, 3], [4, 5], [6]],
        'JSON': [json.dumps({"name": "Alice", "age": 30}),
                 json.dumps({"name": "Bob", "age": 25}),
                 json.dumps({"name": "Charlie", "age": 35})],
        'Numbers': [10, 20, 30],
        'Timestamps': [datetime.now(), datetime.now(), datetime.now()]
    }
 
    # Create DataFrame
    df = pd.DataFrame(data)
 
    # Return the DataFrame for pipeline use
    return df

In this example:

  • A pandas DataFrame is created with lists, JSON objects, numbers, and timestamps.

  • The function returns the DataFrame for downstream pipeline processing.

Best Practices

  • Define functions clearly and validate them before saving.

  • Keep external libraries lightweight to avoid performance issues.

  • Use VCS versioning to manage and rollback scripts easily.

  • Ensure the input data type matches the expected schema of the script.

Example Use Cases

  • Clean and transform large datasets using PySpark operations.

  • Parse and flatten semi-structured JSON data.

  • Enrich datasets with external data sources.

  • Apply statistical computations or machine learning preprocessing before model training.