Create Data Preparation
Users can access the Data Preparation landing page using this option from the Data Sandbox list page.
Data Sandbox files serve as an isolated workspace where users can upload, explore, and prepare datasets before moving them into enterprise data pipelines. Applying Data Preparation steps on a Data Sandbox file transforms raw uploaded files—such as CSV, Excel, Parquet, or JSON—into high-quality, analysis-ready data suitable for modeling, reporting, and further processing.
1. File Upload & Initial Exploration
Users upload a dataset into the Sandbox and perform initial inspection:
Previewing columns and sample rows
Detecting schema and automatic data types
Checking file integrity, row counts, and basic patterns
This step provides a quick understanding of the structure and quality of the uploaded file.
2. Data Profiling
The Sandbox automatically scans the dataset to generate profiling statistics:
Missing value percentages
Min/Max/Mean values
Unique counts and frequency distributions
Data type mismatches and anomalies
Profiling helps identify what transformations or cleaning steps are required.
3. Data Cleaning Operations
Cleaning steps can be applied directly on the Sandbox file, including:
Removing duplicate records
Handling missing values (fill, drop, or replace)
Standardizing date/time formats
Fixing inconsistent or invalid entries
Trimming whitespace and normalizing text fields
Converting data types (string → integer, float → date, etc.)
These steps convert raw files into consistent, trustworthy data.
4. Transformations & Derivations
Users can apply a wide range of transformations to refine and derive new insights:
Column creation (KPIs, ratios, flags, classifications)
Aggregation and grouping
Filtering based on business rules
Splitting or merging columns
Conditional logic (IF/ELSE)
Reordering or dropping columns
Renaming fields for clarity and standardization
This step reshapes the Sandbox file into a usable analytical dataset.
5. Data Enrichment
The Sandbox allows enriching the file using:
Reference/master data available in the environment
Lookup mappings (regions, codes, taxonomy values)
Historical data from previous Sandbox sessions
Lightweight AI/ML-generated features (optional)
Enrichment adds depth and business relevance to the dataset.
6. Validation & Quality Checks
Before publishing, users can run validation rules:
Schema validation
Threshold checks (e.g., sales > 0, dates < today)
Null/duplicate checks
Referential integrity validation (if master data is used)
Validation ensures the Sandbox file is reliable before it moves to production workflows.
7. Publishing the Prepared Sandbox Dataset
Once prepared, the dataset can be:
Exported back to the platform under the Data Sandbox list
Pushed into the Data Pipeline as a source for scheduled workflows
Consumed by Data Agents or Data Science Lab notebooks
It becomes an enterprise-ready dataset that can power dashboards, ML models, and automated processes.
Last updated