Machine Learning (ML) Transforms
The ML transforms provide operations commonly used in feature engineering, preprocessing, and time-series analysis for machine learning workflows.
ML Transforms (Machine Learning Transforms) are specialized operations used to manipulate and restructure raw data into a format that is more effective for training machine learning models. They are a critical part of the feature engineering and data preprocessing workflow, ensuring the data is clean, consistent, and adheres to the assumptions of the chosen algorithm.
Binarizer
Converts a numerical column into binary values based on a threshold.
Values ≤ threshold → 0
Values > threshold → 1
Best Situations to Use
Prepare binary features for classification models.
Transform continuous variables into categorical binary variables.
Steps
Select a numerical column.
Open Transforms > ML > Binarizer.
Provide the Threshold value.
Click Submit.
Binning / Discretize Values
Converts continuous data into discrete bins or categories.
Best Situations to Use
Group continuous variables for histograms or categorical modeling.
Prepare features for classification or decision tree models.
Steps
Select a column with continuous values.
Open Transforms > ML > Binning / Discretize Values.
Specify the number of bins.
Click Submit.
Expanding Window Transform
Overview
Creates features using rolling aggregates over expanding windows of numeric data. Methods: Min, Max, Mean.
Best Situations to Use
Feature engineering for time series data.
Compute cumulative statistics across a column.
Steps
Select a numeric column.
Open Transforms > ML > Expanding Window Transform.
Choose method(s): Min, Max, Mean.
Click Submit.
New columns are added for each method applied (e.g., col1_Expanding_Min
).
Feature Agglomeration
Overview
Reduces dimensionality by clustering correlated features into representative columns.
Best Situations to Use
Reduce feature redundancy in datasets with many correlated numeric columns.
Simplify datasets for modeling or analysis.
Steps
Select multiple numeric columns.
Open Transforms > ML > Feature Agglomeration.
Specify the number of clusters/samples.
Click Submit.
Label Encoding
Converts categorical string columns into numeric labels, starting from 0.
Best Situations to Use
Encode categorical variables for machine learning models that require numeric inputs.
Steps
Select a categorical column.
Open Transforms > ML > Label Encoding.
Click Submit.
Example: "Tall, Medium, Short, Tall"
→ 0, 1, 2, 0
Lag Transform
Shifts a time series column by a specified number of units (lags).
Best Situations to Use
Analyze temporal dependencies or trends.
Generate lagged features for forecasting models.
Steps
Select a numeric column.
Open Transforms > ML > Lag Transform.
Specify Lag value (number of units to shift).
Click Submit.
Example: Sales data shifted by 2 months → first two cells empty, remaining cells contain lagged values
Leave One Out Encoding
Encodes categorical variables based on a target column while avoiding data leakage. Computes the mean of target values for each category excluding the current row.
Best Situations to Use
Prepare categorical variables for classification models.
Avoid overfitting or bias in target-based encoding.
Steps
Select a categorical column.
Open Transforms > ML > Leave One Out Encoding.
Select an integer column as the target variable.
Click Submit.
Example:
A
1
0.5
B
0
0.5
A
1
0.5
B
1
0
A
0
1
B
0
0.5
This structured guide provides an overview, best-use scenarios, step-by-step instructions, and result examples for all core ML transforms in the Data Preparation framework.
One-Hot Encoding
Converts a categorical column into binary indicator columns, where each unique category is represented as a separate column with 1 if present and 0 otherwise.
Best Situations to Use
Encode categorical variables for ML models requiring numeric input.
Prepare dummy variables for regression or classification tasks.
Steps
Select a categorical column.
Open Transforms > ML > One-Hot Encoding.
Provide New Column Name(s).
Click Submit.
Example: Column "Color"
with ["Red","Blue","Green","Red"]
→ Color_Red
, Color_Blue
, Color_Green
.
1
0
0
0
1
0
0
0
1
1
0
0
Principal Component Analysis (PCA)
Reduces dimensionality by transforming numerical columns into orthogonal principal components.
Best Situations to Use
Reduce correlated features for modeling or visualization.
Identify patterns in high-dimensional datasets.
Steps
Select multiple numerical columns.
Open Transforms > ML > Principal Component Analysis (PCA).
Specify Output Features (number of principal components).
Click Submit.
Rolling Data
Applies rolling window computations on numeric columns, generating new features like Min, Max, or Mean over a moving window.
Best Situations to Use
Feature engineering for time series analysis.
Calculate moving averages or trends.
Steps
Select a numeric column.
Open Transforms > ML > Rolling Data.
Specify Window Size (≥2) and Method (Min, Max, Mean).
Click Submit.
Singular Value Decomposition (SVD)
Performs matrix decomposition to reduce dimensionality or extract latent features from numeric columns.
Best Situations to Use
Feature extraction for dimensionality reduction.
Reduce noise in datasets while retaining essential information.
Steps
Select multiple numeric columns.
Open Transforms > ML > Singular Value Decomposition (SVD).
Specify Latent Factors.
Click Submit.
Target Encoding
Encodes categorical columns using the mean of a target numeric column per category.
Best Situations to Use
Encode categorical variables for regression or classification tasks.
Prevent overfitting by using target statistics cautiously.
Steps
Select a categorical column.
Open Transforms > ML > Target Encoding.
Specify the Target Column (numeric/integer).
Click Submit.
Example: Category "A"
with target values [1,1,0]
→ mean = 0.667
.
Target-based Quantile Encoding
Encodes categorical columns using quantiles of a numeric target variable for regression tasks.
Best Situations to Use
Encode categorical variables for continuous target regression.
Capture the distribution of target values per category.
Steps
Select a categorical column.
Open Transforms > ML > Target-based Quantile Encoding.
Specify a numeric target column.
Click Submit.
Weight of Evidence (WoE) Encoding
Encodes categorical variables for binary classification based on predictive power relative to the target variable.
Best Situations to Use
Feature engineering for binary classification tasks.
Assess the relationship between categories and the target variable.
Steps
Select a categorical column.
Open Transforms > ML > Weight of Evidence Encoding.
Select a binary target column.
Click Submit.
Last updated