Create Data Preparation

Users can create Preparations on top of a Data Set using this option.

Data Preparation is the foundational step in any analytics, machine learning, or data-driven decision workflow. It transforms raw or semi-structured data into a clean, enriched, and analysis-ready dataset that can be reliably consumed by downstream applications like BI dashboards, AI models, and Data Agents.

Creating a Data Preparation layer on top of a dataset typically involves the following processes:

1. Data Ingestion & Profiling

Data is first ingested from source systems—databases, files, APIs, or pipelines—and profiled to understand its structure, data types, missing values, patterns, and data quality issues. Profiling helps identify anomalies, duplicates, outliers, and incorrect data types.

2. Data Cleaning

The dataset undergoes systematic cleaning operations such as:

  • Removing duplicates

  • Handling missing values (drop/impute)

  • Standardizing formats (dates, currencies, phone numbers)

  • Correcting inconsistent entries

  • Fixing schema or column-level issues (renaming, type casting)

3. Data Transformation

Transformations are applied to reshape and enhance the dataset, including:

  • Column derivation (KPIs, flags, ratios, classifications)

  • Normalization, aggregation, and filtering

  • Joins and merges with reference/master tables

  • Splitting or combining columns

  • Converting raw attributes into meaningful business metrics

4. Data Enrichment

The dataset is augmented using additional internal or external data sources:

  • Master data (Customer, Product, Location)

  • Lookup tables (Codes, Categories)

  • Behavioral or transactional histories

  • AI-generated features (sentiment, affinity scores, churn likelihood)

This enriches the dataset and improves insight generation.

5. Validation & Quality Checks

Before publishing, validation and rule-based checks ensure:

  • Schema accuracy

  • Business rule conformance

  • Referential integrity

  • Threshold-based data quality scores

Any failed rule triggers alerts or logs for correction.

6. Publishing the Prepared Dataset

The final, cleaned, transformed, and enriched dataset is stored in a structured zone—often a Gold Layer, DataMart, or Semantic Layer—and can be consumed by:

  • BI Dashboards

  • Data Science Models

  • Data Agents

  • Pipeline automation

  • APIs for external applications

It becomes the “single source of truth” for business analytics.

Note:

Last updated