Machine Learning (ML) Transforms

The ML transforms provide operations commonly used in feature engineering, preprocessing, and time-series analysis for machine learning workflows.

ML Transforms (Machine Learning Transforms) are specialized operations used to manipulate and restructure raw data into a format that is more effective for training machine learning models. They are a critical part of the feature engineering and data preprocessing workflow, ensuring the data is clean, consistent, and adheres to the assumptions of the chosen algorithm.

Binarizer

Converts a numerical column into binary values based on a threshold.

Values ≤ threshold → 0
Values > threshold → 1

Best Situations to Use

Prepare binary features for classification models.
Transform continuous variables into categorical binary variables.

Steps

Select a numerical column.
Open Transforms > ML > Binarizer.
Provide the Threshold value.
Click Submit.

A new column is added with 0 and 1 values.

Binning / Discretize Values

Converts continuous data into discrete bins or categories.

Best Situations to Use

Group continuous variables for histograms or categorical modeling.
Prepare features for classification or decision tree models.

Steps

Select a column with continuous values.
Open Transforms > ML > Binning / Discretize Values.
Specify the number of bins.
Click Submit.

A new column is added representing the discretized categories.

Expanding Window Transform

Overview

Creates features using rolling aggregates over expanding windows of numeric data. Methods: Min, Max, Mean.

Best Situations to Use

Feature engineering for time series data.
Compute cumulative statistics across a column.

Steps

Select a numeric column.
Open Transforms > ML > Expanding Window Transform.
Choose method(s): Min, Max, Mean.
Click Submit.

New columns are added for each method applied (e.g., col1_Expanding_Min).

Feature Agglomeration

Overview

Reduces dimensionality by clustering correlated features into representative columns.

Best Situations to Use

Reduce feature redundancy in datasets with many correlated numeric columns.
Simplify datasets for modeling or analysis.

Steps

Select multiple numeric columns.
Open Transforms > ML > Feature Agglomeration.
Specify the number of clusters/samples.
Click Submit.

New columns are created representing feature clusters.

Label Encoding

Converts categorical string columns into numeric labels, starting from 0.

Best Situations to Use

Encode categorical variables for machine learning models that require numeric inputs.

Steps

Select a categorical column.
Open Transforms > ML > Label Encoding.
Click Submit.

Original categories are replaced with numeric labels.

Example: "Tall, Medium, Short, Tall" → 0, 1, 2, 0

Lag Transform

Shifts a time series column by a specified number of units (lags).

Best Situations to Use

Analyze temporal dependencies or trends.
Generate lagged features for forecasting models.

Steps

Select a numeric column.
Open Transforms > ML > Lag Transform.
Specify Lag value (number of units to shift).
Click Submit.

A new column is added with values shifted by the specified lag.

Example: Sales data shifted by 2 months → first two cells empty, remaining cells contain lagged values

Leave One Out Encoding

Encodes categorical variables based on a target column while avoiding data leakage. Computes the mean of target values for each category excluding the current row.

Best Situations to Use

Prepare categorical variables for classification models.
Avoid overfitting or bias in target-based encoding.

Steps

Select a categorical column.
Open Transforms > ML > Leave One Out Encoding.
Select an integer column as the target variable.
Click Submit.

A new column is added containing the mean target values for each category, excluding the current row.

Example:

One-Hot Encoding

Converts a categorical column into binary indicator columns, where each unique category is represented as a separate column with 1 if present and 0 otherwise.

Best Situations to Use

Encode categorical variables for ML models requiring numeric input.
Prepare dummy variables for regression or classification tasks.

Steps

Select a categorical column.
Open Transforms > ML > One-Hot Encoding.
Provide New Column Name(s).
Click Submit.

Multiple binary columns are created for each category.

Example: Column "Color" with ["Red","Blue","Green","Red"] → Color_Red, Color_Blue, Color_Green.

Principal Component Analysis (PCA)

Reduces dimensionality by transforming numerical columns into orthogonal principal components.

Best Situations to Use

Reduce correlated features for modeling or visualization.
Identify patterns in high-dimensional datasets.

Steps

Select multiple numerical columns.
Open Transforms > ML > Principal Component Analysis (PCA).
Specify Output Features (number of principal components).
Click Submit.

New columns representing principal components are added.

Rolling Data

Applies rolling window computations on numeric columns, generating new features like Min, Max, or Mean over a moving window.

Best Situations to Use

Feature engineering for time series analysis.
Calculate moving averages or trends.

Steps

Select a numeric column.
Open Transforms > ML > Rolling Data.
Specify Window Size (≥2) and Method (Min, Max, Mean).
Click Submit.

New columns are added for each method. The first cells in the window are null.

Singular Value Decomposition (SVD)

Performs matrix decomposition to reduce dimensionality or extract latent features from numeric columns.

Best Situations to Use

Feature extraction for dimensionality reduction.
Reduce noise in datasets while retaining essential information.

Steps

Select multiple numeric columns.
Open Transforms > ML > Singular Value Decomposition (SVD).
Specify Latent Factors.
Click Submit.

New columns corresponding to latent factors are added.

Target Encoding

Encodes categorical columns using the mean of a target numeric column per category.

Best Situations to Use

Encode categorical variables for regression or classification tasks.
Prevent overfitting by using target statistics cautiously.

Steps

Select a categorical column.
Open Transforms > ML > Target Encoding.
Specify the Target Column (numeric/integer).
Click Submit.

A new column with target-encoded values is added.

Example: Category "A" with target values [1,1,0] → mean = 0.667.

Target-based Quantile Encoding

Encodes categorical columns using quantiles of a numeric target variable for regression tasks.

Best Situations to Use

Encode categorical variables for continuous target regression.
Capture the distribution of target values per category.

Steps

Select a categorical column.
Open Transforms > ML > Target-based Quantile Encoding.
Specify a numeric target column.
Click Submit.

A new column with quantile-based encoded values is added.

Weight of Evidence (WoE) Encoding

Encodes categorical variables for binary classification based on predictive power relative to the target variable.

Best Situations to Use

Feature engineering for binary classification tasks.
Assess the relationship between categories and the target variable.

Steps

Select a categorical column.
Open Transforms > ML > Weight of Evidence Encoding.
Select a binary target column.
Click Submit.

A new column with WoE values for each category is added.

PreviousInteger Transform NextNumber Transforms