Create Data Preparation
Users can create Preparations on top of a Data Set using this option.
Data Preparation is the foundational step in any analytics, machine learning, or data-driven decision workflow. It transforms raw or semi-structured data into a clean, enriched, and analysis-ready dataset that can be reliably consumed by downstream applications like BI dashboards, AI models, and Data Agents.
Creating a Data Preparation layer on top of a dataset typically involves the following processes:
1. Data Ingestion & Profiling
Data is first ingested from source systems—databases, files, APIs, or pipelines—and profiled to understand its structure, data types, missing values, patterns, and data quality issues. Profiling helps identify anomalies, duplicates, outliers, and incorrect data types.
2. Data Cleaning
The dataset undergoes systematic cleaning operations such as:
Removing duplicates
Handling missing values (drop/impute)
Standardizing formats (dates, currencies, phone numbers)
Correcting inconsistent entries
Fixing schema or column-level issues (renaming, type casting)
3. Data Transformation
Transformations are applied to reshape and enhance the dataset, including:
Column derivation (KPIs, flags, ratios, classifications)
Normalization, aggregation, and filtering
Joins and merges with reference/master tables
Splitting or combining columns
Converting raw attributes into meaningful business metrics
4. Data Enrichment
The dataset is augmented using additional internal or external data sources:
Master data (Customer, Product, Location)
Lookup tables (Codes, Categories)
Behavioral or transactional histories
AI-generated features (sentiment, affinity scores, churn likelihood)
This enriches the dataset and improves insight generation.
5. Validation & Quality Checks
Before publishing, validation and rule-based checks ensure:
Schema accuracy
Business rule conformance
Referential integrity
Threshold-based data quality scores
Any failed rule triggers alerts or logs for correction.
6. Publishing the Prepared Dataset
The final, cleaned, transformed, and enriched dataset is stored in a structured zone—often a Gold Layer, DataMart, or Semantic Layer—and can be consumed by:
BI Dashboards
Data Science Models
Data Agents
Pipeline automation
APIs for external applications
It becomes the “single source of truth” for business analytics.
Last updated