PySpark Script
The PySpark Script Component allows users to write, manage, and execute custom PySpark scripts within a pipeline. It supports both user-defined scripts and scripts exported from DSLab notebooks, enabling distributed data processing at scale.
Key Capabilities
Run custom PySpark scripts directly in the pipeline.
Import and execute scripts exported from DSLab notebooks.
Manage scripts with Push/Pull to VCS for version control.
Accepts input in DataFrame or List of Dictionary format.
Use external PySpark libraries by specifying them in configuration.
Configuration Overview
All PySpark Script configurations are grouped into:
Basic Information
Meta Information
Resource Configuration
Configuring Meta Information
Component Name
Provide a unique name for the component.
Restrictions:
No spaces or special characters (use
_
for spacing).Must not be
"test"
or start with"test"
(reserved for backend processes).
Start Function Name
Displays all function names from the PySpark script.
Select one as the entry point.
In Event Data Type
Select the format of input data:
DataFrame
List of Dictionary
External Libraries
Specify additional PySpark libraries required in the script.
Multiple libraries can be entered, separated by commas.
Execution Type
Two execution modes are supported:
Custom Script
Write your own PySpark script directly in the Script field.
Requirements:
Must contain at least one function.
Functions must be validated using Validate Script.
Start Function – Select which function to execute.
Input Data – Define key-value pairs for function parameters.
DSLab Script
Execute a script exported from a DSLab notebook.
Required fields:
Project Name – Select the project containing the notebook.
Script Name – Choose the exported notebook script.
Start Function – Select the entry-point function.
The exported script appears in the editor and can be validated.
Input Data – Pass parameters as key-value pairs.
VCS Integration
Pull Script from VCS – Retrieve a previously committed script version.
Push Script to VCS – Commit the current script with a version tag and commit message.
Custom PySpark Script Example
import pandas as pd
import numpy as np
import json
from datetime import datetime
def a():
data = {
'Lists': [[1, 2, 3], [4, 5], [6]],
'JSON': [json.dumps({"name": "Alice", "age": 30}),
json.dumps({"name": "Bob", "age": 25}),
json.dumps({"name": "Charlie", "age": 35})],
'Numbers': [10, 20, 30],
'Timestamps': [datetime.now(), datetime.now(), datetime.now()]
}
# Create DataFrame
df = pd.DataFrame(data)
# Return the DataFrame for pipeline use
return df
In this example:
A pandas DataFrame is created with lists, JSON objects, numbers, and timestamps.
The function returns the DataFrame for downstream pipeline processing.
Best Practices
Define functions clearly and validate them before saving.
Keep external libraries lightweight to avoid performance issues.
Use VCS versioning to manage and rollback scripts easily.
Ensure the input data type matches the expected schema of the script.
Example Use Cases
Clean and transform large datasets using PySpark operations.
Parse and flatten semi-structured JSON data.
Enrich datasets with external data sources.
Apply statistical computations or machine learning preprocessing before model training.