Create Data Preparation

Users can access the Data Preparation landing page using this option from the Data Sandbox list page.

Data Sandbox files serve as an isolated workspace where users can upload, explore, and prepare datasets before moving them into enterprise data pipelines. Applying Data Preparation steps on a Data Sandbox file transforms raw uploaded files—such as CSV, Excel, Parquet, or JSON—into high-quality, analysis-ready data suitable for modeling, reporting, and further processing.

1. File Upload & Initial Exploration

Users upload a dataset into the Sandbox and perform initial inspection:

  • Previewing columns and sample rows

  • Detecting schema and automatic data types

  • Checking file integrity, row counts, and basic patterns

This step provides a quick understanding of the structure and quality of the uploaded file.

2. Data Profiling

The Sandbox automatically scans the dataset to generate profiling statistics:

  • Missing value percentages

  • Min/Max/Mean values

  • Unique counts and frequency distributions

  • Data type mismatches and anomalies

Profiling helps identify what transformations or cleaning steps are required.

3. Data Cleaning Operations

Cleaning steps can be applied directly on the Sandbox file, including:

  • Removing duplicate records

  • Handling missing values (fill, drop, or replace)

  • Standardizing date/time formats

  • Fixing inconsistent or invalid entries

  • Trimming whitespace and normalizing text fields

  • Converting data types (string → integer, float → date, etc.)

These steps convert raw files into consistent, trustworthy data.

4. Transformations & Derivations

Users can apply a wide range of transformations to refine and derive new insights:

  • Column creation (KPIs, ratios, flags, classifications)

  • Aggregation and grouping

  • Filtering based on business rules

  • Splitting or merging columns

  • Conditional logic (IF/ELSE)

  • Reordering or dropping columns

  • Renaming fields for clarity and standardization

This step reshapes the Sandbox file into a usable analytical dataset.

5. Data Enrichment

The Sandbox allows enriching the file using:

  • Reference/master data available in the environment

  • Lookup mappings (regions, codes, taxonomy values)

  • Historical data from previous Sandbox sessions

  • Lightweight AI/ML-generated features (optional)

Enrichment adds depth and business relevance to the dataset.

6. Validation & Quality Checks

Before publishing, users can run validation rules:

  • Schema validation

  • Threshold checks (e.g., sales > 0, dates < today)

  • Null/duplicate checks

  • Referential integrity validation (if master data is used)

Validation ensures the Sandbox file is reliable before it moves to production workflows.

7. Publishing the Prepared Sandbox Dataset

Once prepared, the dataset can be:

  • Exported back to the platform under the Data Sandbox list

  • Pushed into the Data Pipeline as a source for scheduled workflows

  • Consumed by Data Agents or Data Science Lab notebooks

It becomes an enterprise-ready dataset that can power dashboards, ML models, and automated processes.

Note:

Last updated