Data Cleansing Transforms in Data Preparation
Data Cleansing transforms standardize and sanitize datasets by removing inconsistencies and duplicates, preparing data for analysis.
The Data Cleansing transforms help standardize, clean, and sanitize datasets by removing or correcting unwanted values, duplicates, and inconsistencies. These transforms ensure the dataset is ready for analysis, modeling, or reporting.
Best Situations to Use
Remove invalid, empty, or negative values from columns.
Fill missing values to ensure dataset completeness.
Flag or remove duplicate values or rows for data integrity.
Remove letters, numbers, special characters, or whitespace from columns for consistency.
Prepare data for feature engineering, modeling, or visualization by cleaning metadata and content.
Clear Cells on Matching Value
The Clear Cells on Matching Value transform clears the contents of cells that match a specified condition or value in a column.
Best Situations to Use
Remove specific unwanted or placeholder values from columns (e.g., zeros, default entries).
Standardize columns before applying other transformations or analytics.
Steps
Select a column in the Data Grid.
Navigate to Transforms > Data Cleansing > Clear Cells on Matching Value.
Specify the following:
Operator: equals, starts with, ends with, or regex.
Value: The value or pattern to search for.
Click Submit.
Example: Cells with values 0
or 7
in the Quantity column are cleared, resulting in blank cells.
Delete Rows on Matching Value
The Delete Rows on Matching Value transform removes entire rows that match a specified condition in a column.
Best Situations to Use
Remove invalid or unwanted rows (e.g., placeholders, test data).
Prepare a dataset for analysis by eliminating rows that do not meet criteria.
Steps
Select a column in the Data Grid.
Navigate to Transforms > Data Cleansing > Delete Rows on Matching Value.
Specify the following:
Operator: contains, equals, starts with, ends with, regex.
Value: The value or pattern to search for.
Click Submit.
Example: Row with a value 5
in the Sl. No.
column is removed.
Delete Rows with Empty Cell
Removes rows containing empty cells in a selected column.
Best Situations to Use
Eliminate incomplete rows from datasets.
Ensure datasets are clean for aggregation, reporting, or modeling.
Steps
Select a column.
Navigate to Transforms > Data Cleansing > Delete Rows with Empty Cell.
Click the transform name to apply.
Delete Rows with Invalid Cell
Removes rows containing invalid or inconsistent data in a column.
Best Situations to Use
Filter out corrupted or non-conforming values.
Prepare high-quality datasets for analysis or modeling.
Steps
Select a column.
Navigate to Transforms > Data Cleansing > Delete Rows with Invalid Cell.
Click the transform name to apply.
Delete Rows with Negative Values
Removes rows containing negative numeric values in a selected column.
Best Situations to Use
Remove invalid negative values from datasets where only positive numbers are meaningful (e.g., counts, salaries).
Prepare data for analytics, aggregation, or modeling.
Steps
Select a column with numeric values.
Navigate to Transforms > Data Cleansing > Delete Rows with Negative Values.
Click the transform name to apply.
Example: Two rows with negative values in the id
column are deleted.
Flag Duplicates in Columns
The Flag Duplicates in Columns transform adds a Boolean column indicating duplicate values in a column. Original values are flagged as false
, duplicates as true
.
Best Situations to Use
Identify repeated entries in a single column.
Prepare datasets for cleaning or validation.
Steps
Select a column.
Navigate to Transforms > Data Cleansing > Flag Duplicates in Columns.
Click Submit.
Flag Duplicates in Table
The Flag Duplicates in Table transform adds a Boolean column marking duplicate rows in the table. Original rows are flagged as false
, duplicates as true
.
Best Situations to Use
Identify duplicate rows in a dataset.
Maintain data integrity before analysis or reporting.
Steps
Select a column to check for duplicates.
Navigate to Transforms > Data Cleansing > Flag Duplicates in Table.
Click Submit.
Remove Duplicates from Column
Removes duplicate values from selected column(s). Can be applied to single or multiple columns.
Best Situations to Use
Clean up repeated entries in a column.
Prepare columns for aggregation or grouping operations.
Steps
Select a column.
Navigate to Transforms > Data Cleansing > Remove Duplicates from Column.
Click Submit.
Remove Duplicates from Table
Removes duplicate rows from the dataset entirely.
Best Situations to Use
Clean datasets with identical rows to ensure uniqueness.
Prepare data for modeling, reporting, or aggregation.
Steps
Select a column.
Navigate to Transforms > Data Cleansing > Remove Duplicates from Table.
Click Submit.
Remove Letters
Removes letters from the selected column, optionally creating a new column with the transformed values.
Best Situations to Use
Extract numeric-only content from mixed data columns.
Clean columns for numerical analysis or aggregation.
Steps
Select a column.
Navigate to Transforms > Data Cleansing > Remove Letters.
Enable Create New Column (optional).
Click Submit.
Remove Numbers
The Remove Numbers transform removes all numeric characters from a selected column, leaving only letters, spaces, and other text content.
Best Situations to Use
Clean text columns that accidentally contain numeric values.
Prepare columns for categorical analysis or text processing.
Steps
Select a column containing numeric and text values.
Navigate to Transforms > Data Cleansing > Remove Numbers.
Enable Create New Column (optional) to preserve the original column.
Click Submit.
Remove Special Characters from Metadata
The Remove Special Characters transform cleans column headers by removing punctuation and special characters, retaining only letters, numbers, and spaces.
Best Situations to Use
Standardize column names for analysis, feature engineering, or reporting.
Ensure column names are compatible with other modules or APIs.
Steps
Select a column or metadata header.
Navigate to Transforms > Data Cleansing > Remove Special Characters.
Enable Create New Column (optional).
Click Submit.
Remove Trail and Lead Quotes
This transform removes leading and trailing single or double quotes from string columns.
Best Situations to Use
Clean text data imported from CSV, Excel, or external sources that contain extraneous quotes.
Prepare text for parsing or analysis without manual editing.
Steps
Select a string-type column with leading/trailing quotes.
Navigate to Transforms > Data Cleansing > Remove Trail and Lead Quotes.
Click Submit.
Remove Trail and Lead Whitespace
Removes leading and trailing spaces from string columns.
Best Situations to Use
Standardize text columns for data consistency.
Prepare string data for joins, comparisons, or analytics.
Steps
Select a string-type column with leading/trailing whitespace.
Navigate to Transforms > Data Cleansing > Remove Trail and Lead Whitespace.
Click Submit.
Remove Whitespace
Removes all whitespace from a string, including leading, trailing, and in-between spaces.
Best Situations to Use
Standardize string values that must not contain spaces (e.g., codes, IDs).
Clean data for feature engineering or text-based operations.
Steps
Select a string-type column.
Navigate to Transforms > Data Cleansing > Remove Whitespace.
Enable Create New Column (optional).
Provide a New Column Name.
Click Submit.
Last updated