Data Preparation (Docker)
The Data Preparation Component enables users to run data preparation scripts on selected datasets. These datasets can be sourced from:
Data Sets created using Data Connectors.
Data Sandbox for uploaded files.
With this component, you can automate common data cleaning and transformation tasks, including:
Filtering
Aggregation
Mapping
Joining
This makes it easier to prepare datasets for downstream analysis or machine learning pipelines.
Key Capabilities
Apply predefined preparation scripts to in-event data.
Supports both Data Sets and Data Sandbox sources.
Reuse transformations previously created in the Data Center module.
Provides data preview and schema inspection after pipeline execution.
Configuration Overview
All Data Preparation configurations are organized into the following sections:
Basic Information
Meta Information
Resource Configuration
Steps to Configure the Data Preparation Component
Add the Component
From the Transformations group, drag and drop the Data Preparation Component into the Pipeline Editor Workspace.
Connect Events
Connect the component to an In-Event (source) and an Out-Event (destination).
This creates a complete workflow where the prepared data flows through the pipeline.
Configure Meta Information
Navigate to the Meta Information tab.
Under Data Center Type, select one of the following:
Option 1: Data Set
If Data Set is selected as the Data Center Type:
Data Center Type – Select Data Set.
Data Set Name – Choose a dataset from the drop-down menu.
Preparation(s) – Select one or more available preparations associated with the dataset.
The selected preparation displays the list of transformations applied during its creation.
Option 2: Data Sandbox
If Data Sandbox is selected as the Data Center Type:
Data Center Type – Select Data Sandbox.
Data Sandbox Name – Choose a sandbox from the drop-down menu.
Preparation(s) – Select one or more available preparations associated with the sandbox.
The selected preparation displays the list of transformations applied during its creation.
Note
The same source data used to create the preparation must also be used in the input event for consistent transformations.
Files uploaded to the Data Sandbox by Admin users will not be visible in the Sandbox Name field for non-admin users.
Saving the Component
Click the Save Component in Storage icon.
A success notification confirms that the component configuration has been saved.
Save and run the pipeline workflow.
Logs and Preview
Once the pipeline workflow is saved and activated:
Related component logs appear in the Logs Panel.
A Preview tab is available for the component, displaying the transformed data.
The Preview Schema tab provides a schema view of the transformed dataset.
Example Use Cases
Standardize customer data across multiple sources (map inconsistent fields).
Join transactional data with lookup tables for enrichment.
Filter out invalid or duplicate records before downstream processing.
Aggregate raw logs into summarized datasets for analytics.