Data Pipeline
  • Data Pipeline
    • About Data Pipeline
    • Design Philosophy
    • Low Code Visual Authoring
    • Real-time and Batch Orchestration
    • Event based Process Orchestration
    • ML and Data Ops
    • Distributed Compute
    • Fault Tolerant and Auto-recovery
    • Extensibility via Custom Scripting
  • Getting Started
    • Homepage
      • List Pipelines
      • Create
        • Creating a New Pipeline
          • Adding Components to Canvas
          • Connecting Components
            • Events [Kafka and Data Sync]
          • Memory and CPU Allocations
        • Creating a New Job
          • Job Editor Page
          • Task Components
            • Readers
              • HDFS Reader
              • MongoDB Reader
              • DB Reader
              • S3 Reader
              • Azure Blob Reader
              • ES Reader
              • Sandbox Reader
            • Writers
              • HDFS Writer
              • Azure Writer
              • DB Writer
              • ES Writer
              • S3 Writer
              • Sandbox Writer
              • Mongodb Writer
              • Kafka Producer
            • Transformations
          • PySpark Job
          • Python Job
      • List Jobs
      • List Components
      • Delete Orphan Pods
      • Scheduler
      • Data Channel
      • Cluster Event
      • Trash
      • Settings
    • Pipeline Workflow Editor
      • Pipeline Toolbar
        • Pipeline Overview
        • Pipeline Testing
        • Search Component in Pipelines
        • Push Pipeline (to VCS/GIT)
        • Pull Pipeline
        • Full Screen
        • Log Panel
        • Event Panel
        • Activate/Deactivate Pipeline
        • Update Pipeline
        • Failure Analysis
        • Pipeline Monitoring
        • Delete Pipeline
        • Pipeline Component Configuration
        • Pipeline Failure Alert History
      • Component Panel
      • Right-side Panel
    • Testing Suite
    • Activating Pipeline
    • Monitoring Pipeline
    • Job Monitoring
  • Components
    • Adding Components to Workflow
    • Component Architecture
    • Component Base Configuration
    • Resource Configuration
    • Intelligent Scaling
    • Connection Validation
    • Readers
      • S3 Reader
      • HDFS Reader
      • DB Reader
      • ES Reader
      • SFTP Stream Reader
      • SFTP Reader
      • Mongo DB Reader
        • MongoDB Reader Lite (PyMongo Reader)
        • MongoDB Reader
      • Azure Blob Reader
      • Azure Metadata Reader
      • ClickHouse Reader (Docker)
      • Sandbox Reader
      • Azure Blob Reader
    • Writers
      • S3 Writer
      • DB Writer
      • HDFS Writer
      • ES Writer
      • Video Writer
      • Azure Writer
      • ClickHouse Writer (Docker)
      • Sandbox Writer
      • MongoDB Writers
        • MongoDB Writer
        • MongoDB Writer Lite (PyMongo Writer)
    • Machine Learning
      • DSLab Runner
      • AutoML Runner
    • Consumers
      • SFTP Monitor
      • MQTT Consumer
      • Video Stream Consumer
      • Eventhub Subscriber
      • Twitter Scrapper
      • Mongo ChangeStream
      • Rabbit MQ Consumer
      • AWS SNS Monitor
      • Kafka Consumer
      • API Ingestion and Webhook Listener
    • Producers
      • WebSocket Producer
      • Eventhub Publisher
      • EventGrid Producer
      • RabbitMQ Producer
      • Kafka Producer
      • Synthetic Data Generator
    • Transformations
      • SQL Component
      • Dateprep Script Runner
      • File Splitter
      • Rule Splitter
      • Stored Producer Runner
      • Flatten JSON
      • Email Component
      • Pandas Query Component
      • Enrichment Component
      • Mongo Aggregation
      • Data Loss Protection
      • Data Preparation (Docker)
      • Rest Api Component
      • Schema Validator
    • Scripting
      • Script Runner
      • Python Script
        • Keeping Different Versions of the Python Script in VCS
    • Scheduler
  • Custom Components
  • Advance Configuration & Monitoring
    • Configuration
      • Default Component Configuration
      • Logger
    • Data Channel
    • Cluster Events
    • System Component Status
  • Version Control
  • Use Cases
Powered by GitBook
On this page
  • Creating a PySpark job
  • Configuring a PySpark Job:

Was this helpful?

  1. Getting Started
  2. Homepage
  3. Create
  4. Creating a New Job

PySpark Job

PreviousTransformationsNextPython Job

Was this helpful?

Please go through the below given demonstration to create and configure a Pyspark Job.

Creating a PySpark job

  • Open the pipeline homepage and click on the Create Job option.

  • The New Job page opens.

  • Provide the following information:

    • Enter name: Enter the name for the job.

    • Job Description: Add description of the Job (It is an optional field).

    • Job Baseinfo: Select PySpark Job option from the drop in Job Base Information.

    • Trigger By: The PySpark Job can be triggered by another Job or PySpark Job. The PySpark Job can be triggered in two scenarios from another jobs:

      • On Success: Select a job from drop-down. Once the selected job is run successfully, it will trigger the PySpark Job.

      • On Failure: Select a job from drop-down. Once the selected job gets failed, it will trigger the PySpark Job.

    • Is Schedule: Put a checkmark in the given box to schedule the new Job.

    • Spark config: Select resource for the job.

    • Click on Save option to save the Job.

  • The PySpark Job gets saved and it will redirect the user to the Job workspace.

Configuring a PySpark Job:

Once the PySpark Job is created, follow the below given steps to configure the Meta Information tab of the PySpark Job.

  • Project Name: Select the same Project using the drop-down menu where the concerned Notebook has been created.

  • Script Name: This field will list the exported Notebook names which are exported from the Data Science Lab module to Data Pipeline.

  • External Library: If any external libraries are used in the script the user can mention it here. The user can mention multiple libraries by giving comma(,) in between the names.

  • Start Function: Select the function name in which the script has been written.

  • Script: The Exported script appears under this space.

  • Input Data: If any parameter has been given in the function, then the name of the parameter is provided as Key and value of the parameters has to be provided as value in this field.

Please note: We are currently supporting JDBC connector like ClickHouse, PostgreSQL and MongoDB Spark connector.

Please Note: The script written in DS Lab module should be inside a function. Refer the page for more details on how to export a PySpark script to the Data Pipeline module.

Export to Pipeline
PySpark Job component in the Pipeline environment
Accesing the Create Job option from the Pipeline Homepage
PySpark Job workspace
Configuring Meta information of PySpark Job.