# Create Data Preparation

Data Sandbox files serve as an isolated workspace where users can upload, explore, and prepare datasets before moving them into enterprise data pipelines. Applying Data Preparation steps on a Data Sandbox file transforms raw uploaded files—such as CSV, Excel, Parquet, or JSON—into high-quality, analysis-ready data suitable for modeling, reporting, and further processing.

#### **1. File Upload & Initial Exploration**

Users upload a dataset into the Sandbox and perform initial inspection:

* Previewing columns and sample rows
* Detecting schema and automatic data types
* Checking file integrity, row counts, and basic patterns

This step provides a quick understanding of the structure and quality of the uploaded file.

#### **2. Data Profiling**

The Sandbox automatically scans the dataset to generate profiling statistics:

* Missing value percentages
* Min/Max/Mean values
* Unique counts and frequency distributions
* Data type mismatches and anomalies

Profiling helps identify what transformations or cleaning steps are required.

#### **3. Data Cleaning Operations**

Cleaning steps can be applied directly on the Sandbox file, including:

* Removing duplicate records
* Handling missing values (fill, drop, or replace)
* Standardizing date/time formats
* Fixing inconsistent or invalid entries
* Trimming whitespace and normalizing text fields
* Converting data types (string → integer, float → date, etc.)

These steps convert raw files into consistent, trustworthy data.

#### **4. Transformations & Derivations**

Users can apply a wide range of transformations to refine and derive new insights:

* Column creation (KPIs, ratios, flags, classifications)
* Aggregation and grouping
* Filtering based on business rules
* Splitting or merging columns
* Conditional logic (IF/ELSE)
* Reordering or dropping columns
* Renaming fields for clarity and standardization

This step reshapes the Sandbox file into a usable analytical dataset.

#### **5. Data Enrichment**

The Sandbox allows enriching the file using:

* Reference/master data available in the environment
* Lookup mappings (regions, codes, taxonomy values)
* Historical data from previous Sandbox sessions
* Lightweight AI/ML-generated features (optional)

Enrichment adds depth and business relevance to the dataset.

#### **6. Validation & Quality Checks**

Before publishing, users can run validation rules:

* Schema validation
* Threshold checks (e.g., sales > 0, dates < today)
* Null/duplicate checks
* Referential integrity validation (if master data is used)

Validation ensures the Sandbox file is reliable before it moves to production workflows.

#### **7. Publishing the Prepared Sandbox Dataset**

Once prepared, the dataset can be:

* Exported back to the platform under the Data Sandbox list
* Pushed into the Data Pipeline as a source for scheduled workflows
* Consumed by Data Agents or Data Science Lab notebooks

It becomes an enterprise-ready dataset that can power dashboards, ML models, and automated processes.

{% hint style="info" %}
**Note:**&#x20;

* Refer to the [**Data Preparation**](/bdb-user-documentation/platform-modules/10.0/data-center/my-connectors-data-flow-initiation-and-management/data-preparation.md) section of this documentation for a comprehensive understanding of the entire process.
* Use[ **Launch Data Preparation from the Data Sandbox list**](/bdb-user-documentation/platform-modules/10.0/data-center/my-connectors-data-flow-initiation-and-management/data-preparation.md#launch-from-the-data-sandbox-list) section to understand how to access the Data Preparation landing page for a Data Set.
  {% endhint %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.bdb.ai/bdb-user-documentation/platform-modules/10.0/data-center/my-connectors-data-flow-initiation-and-management/data-sandbox/create-data-preparation.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
