Data Preparation (Docker)

The Data Preparation Component enables users to run data preparation scripts on selected datasets. These datasets can be sourced from:

  • Data Sets created using Data Connectors.

  • Data Sandbox for uploaded files.

With this component, you can automate common data cleaning and transformation tasks, including:

  • Filtering

  • Aggregation

  • Mapping

  • Joining

This makes it easier to prepare datasets for downstream analysis or machine learning pipelines.

Key Capabilities

  • Apply predefined preparation scripts to in-event data.

  • Supports both Data Sets and Data Sandbox sources.

  • Reuse transformations previously created in the Data Center module.

  • Provides data preview and schema inspection after pipeline execution.

Configuration Overview

All Data Preparation configurations are organized into the following sections:

  • Basic Information

  • Meta Information

  • Resource Configuration

Steps to Configure the Data Preparation Component

  1. Add the Component

    • From the Transformations group, drag and drop the Data Preparation Component into the Pipeline Editor Workspace.

  2. Connect Events

    • Connect the component to an In-Event (source) and an Out-Event (destination).

    • This creates a complete workflow where the prepared data flows through the pipeline.

  3. Configure Meta Information

    • Navigate to the Meta Information tab.

    • Under Data Center Type, select one of the following:

Option 1: Data Set

If Data Set is selected as the Data Center Type:

  • Data Center Type – Select Data Set.

  • Data Set Name – Choose a dataset from the drop-down menu.

  • Preparation(s) – Select one or more available preparations associated with the dataset.

    • The selected preparation displays the list of transformations applied during its creation.

Option 2: Data Sandbox

If Data Sandbox is selected as the Data Center Type:

  • Data Center Type – Select Data Sandbox.

  • Data Sandbox Name – Choose a sandbox from the drop-down menu.

  • Preparation(s) – Select one or more available preparations associated with the sandbox.

    • The selected preparation displays the list of transformations applied during its creation.

Note

  • The same source data used to create the preparation must also be used in the input event for consistent transformations.

  • Files uploaded to the Data Sandbox by Admin users will not be visible in the Sandbox Name field for non-admin users.

Saving the Component

  1. Click the Save Component in Storage icon.

  2. A success notification confirms that the component configuration has been saved.

  3. Save and run the pipeline workflow.

Logs and Preview

  • Once the pipeline workflow is saved and activated:

    • Related component logs appear in the Logs Panel.

    • A Preview tab is available for the component, displaying the transformed data.

    • The Preview Schema tab provides a schema view of the transformed dataset.

Example Use Cases

  • Standardize customer data across multiple sources (map inconsistent fields).

  • Join transactional data with lookup tables for enrichment.

  • Filter out invalid or duplicate records before downstream processing.

  • Aggregate raw logs into summarized datasets for analytics.