Data Pipeline
  • Data Pipeline
    • About Data Pipeline
    • Design Philosophy
    • Low Code Visual Authoring
    • Real-time and Batch Orchestration
    • Event based Process Orchestration
    • ML and Data Ops
    • Distributed Compute
    • Fault Tolerant and Auto-recovery
    • Extensibility via Custom Scripting
  • Getting Started
    • Homepage
      • List Pipelines
      • Creating a New Pipeline
        • Adding Components to Canvas
        • Connecting Components
          • Events [Kafka and Data Sync]
        • Memory and CPU Allocations
      • List Jobs
      • Create Job
        • Job Editor Page
        • Task Components
          • Readers
            • HDFS Reader
            • MongoDB Reader
            • DB Reader
            • S3 Reader
            • Azure Blob Reader
            • ES Reader
            • Sandbox Reader
          • Writers
            • HDFS Writer
            • Azure Writer
            • DB Writer
            • ES Writer
            • S3 Writer
            • Sandbox Writer
            • Mongodb Writer
            • Kafka Producer
          • Transformations
        • PySpark Job
        • Python Job
      • List Components
      • Delete Orphan Pods
      • Scheduler
      • Data Channel
      • Cluster Event
      • Trash
      • Settings
    • Pipeline Workflow Editor
      • Pipeline Toolbar
        • Pipeline Overview
        • Pipeline Testing
        • Search Component in Pipelines
        • Push Pipeline (to VCS/GIT)
        • Pull Pipeline
        • Full Screen
        • Log Panel
        • Event Panel
        • Activate/Deactivate Pipeline
        • Update Pipeline
        • Failure Analysis
        • Pipeline Monitoring
        • Delete Pipeline
      • Component Panel
      • Right-side Panel
    • Testing Suite
    • Activating Pipeline
    • Monitoring Pipeline
  • Components
    • Adding Components to Workflow
    • Component Architecture
    • Component Base Configuration
    • Resource Configuration
    • Intelligent Scaling
    • Connection Validation
    • Readers
      • S3 Reader
      • HDFS Reader
      • DB Reader
      • ES Reader
      • SFTP Stream Reader
      • SFTP Reader
      • Mongo DB Reader
        • MongoDB Reader Lite (PyMongo Reader)
        • MongoDB Reader
      • Azure Blob Reader
      • Azure Metadata Reader
      • ClickHouse Reader (Docker)
      • Sandbox Reader
      • Azure Blob Reader
    • Writers
      • S3 Writer
      • DB Writer
      • HDFS Writer
      • ES Writer
      • Video Writer
      • Azure Writer
      • ClickHouse Writer (Docker)
      • Sandbox Writer
      • MongoDB Writers
        • MongoDB Writer
        • MongoDB Writer Lite (PyMongo Writer)
    • Machine Learning
      • DSLab Runner
      • AutoML Runner
    • Consumers
      • SFTP Monitor
      • MQTT Consumer
      • Video Stream Consumer
      • Eventhub Subscriber
      • Twitter Scrapper
      • Mongo ChangeStream
      • Rabbit MQ Consumer
      • AWS SNS Monitor
      • Kafka Consumer
      • API Ingestion and Webhook Listener
    • Producers
      • WebSocket Producer
      • Eventhub Publisher
      • EventGrid Producer
      • RabbitMQ Producer
      • Kafka Producer
    • Transformations
      • SQL Component
      • Dateprep Script Runner
      • File Splitter
      • Rule Splitter
      • Stored Producer Runner
      • Flatten JSON
      • Email Component
      • Pandas Query Component
      • Enrichment Component
      • Mongo Aggregation
      • Data Loss Protection
      • Data Preparation (Docker)
      • Rest Api Component
      • Schema Validator
    • Scripting
      • Script Runner
      • Python Script
        • Keeping Different Versions of the Python Script in VCS
    • Scheduler
  • Custom Components
  • Advance Configuration & Monitoring
    • Configuration
      • Default Component Configuration
      • Logger
    • Data Channel
    • Cluster Events
    • System Component Status
  • Version Control
  • Use Cases
Powered by GitBook
On this page
  • Steps to configure the Data Preparation component
  • Meta Information
  • When Data Set is selected as Data Center Type
  • When Data Sandbox is selected as Data Center Type
  • Saving the Component
  1. Components
  2. Transformations

Data Preparation (Docker)

PreviousData Loss ProtectionNextRest Api Component

Last updated 1 year ago

Data Preparation component allows to run data preparation scripts on selected datasets. These datasets can be created from sources such as sandbox or by creating them using data connector. With Data Preparation, you can easily create data preparation with a single click. This automates common data cleaning and transformation tasks, such as filtering, aggregation, mapping, and joining.

All component configurations are classified broadly into 3 section

  • ​​

  • Meta Information

Follow the steps given in the demonstration to configure the Data Preparation component.

Steps to configure the Data Preparation component

  • Select the Data Preparation from the Transformations group and drag it to the Pipeline Editor Workspace.

  • The user needs to connect the Data Preparation component with an In-Event and Out Event to create a Workflow as displayed below:

Meta Information

  • The following two options provided under the Data Center Type field:

    • Data Set

    • Data Sandbox

Please Note: Based on the selected option for the Data Center Type field the configuration fields will appear for the Meta Information tab.

When Data Set is selected as Data Center Type

  • Navigate to the Meta Information tab.

  • Data Center Type: Select Data Set as the Data Center Type.

  • Data Set Name: Select a Data Set Name using the drop-down menu.

  • Preparation(s): The available Data Preparation will list under the Preparation(s) field for the selected Data Set. Select a Preparation by using the checkbox.

When Data Sandbox is selected as Data Center Type

  • Navigate to the Meta Information tab.

  • Data Center Type: Select Data Sandbox as the Data Center Type.

  • Data Sandbox Name: Select a Data Sandbox Name using the drop-down menu.

  • Preparation(s): The available Data Preparation will list under the Preparation(s) field for the selected Data Sandbox. Select a Preparation by using the checkbox.

Please Note: Once Meta Information is configured, the same transformation will be applied to the in-Event data which has been done while creating the Data Preparation.

Saving the Component

  • A success notification message appears when the component gets saved.

  • Save and Run the Pipeline workflow.

Please Note: Once the Pipeline workflow gets saved and activated, the related comopnent logs will appear under the Logs panel. The Preview tab will come for the concerned component displaying the preview of the data. The schema preview can be accessed under the Preview Schema tab.

Click the Save Component in Storage icon.

​Basic Information​
Resource Configuration​
Dragging & dropping the Data Preparation component to the workspace
Using the Data Preparation component in a Pipeline Workflow
Configuring the Meta Information
Configuration fields when the Data Set is selected as Data Center Type
Configuration fields when the Data Sandbox is selected as Data Center Type
Data Preparation component as a pert of Pipeline Workflow (Selected Data Center Type is Data Set)
Data Preparation component as a pert of Pipeline Workflow (Selected Data Center Type is Data Sandbox)