Machine Learning (ML) Transforms

The ML transforms provide operations commonly used in feature engineering, preprocessing, and time-series analysis for machine learning workflows.

ML Transforms (Machine Learning Transforms) are specialized operations used to manipulate and restructure raw data into a format that is more effective for training machine learning models. They are a critical part of the feature engineering and data preprocessing workflow, ensuring the data is clean, consistent, and adheres to the assumptions of the chosen algorithm.

Binarizer

Converts a numerical column into binary values based on a threshold.

  • Values ≤ threshold → 0

  • Values > threshold → 1

Best Situations to Use

  • Prepare binary features for classification models.

  • Transform continuous variables into categorical binary variables.

Steps

  1. Select a numerical column.

  2. Open Transforms > ML > Binarizer.

  3. Provide the Threshold value.

  4. Click Submit.

Binning / Discretize Values

Converts continuous data into discrete bins or categories.

Best Situations to Use

  • Group continuous variables for histograms or categorical modeling.

  • Prepare features for classification or decision tree models.

Steps

  1. Select a column with continuous values.

  2. Open Transforms > ML > Binning / Discretize Values.

  3. Specify the number of bins.

  4. Click Submit.

Expanding Window Transform

Overview

Creates features using rolling aggregates over expanding windows of numeric data. Methods: Min, Max, Mean.

Best Situations to Use

  • Feature engineering for time series data.

  • Compute cumulative statistics across a column.

Steps

  1. Select a numeric column.

  2. Open Transforms > ML > Expanding Window Transform.

  3. Choose method(s): Min, Max, Mean.

  4. Click Submit.

New columns are added for each method applied (e.g., col1_Expanding_Min).

Feature Agglomeration

Overview

Reduces dimensionality by clustering correlated features into representative columns.

Best Situations to Use

  • Reduce feature redundancy in datasets with many correlated numeric columns.

  • Simplify datasets for modeling or analysis.

Steps

  1. Select multiple numeric columns.

  2. Open Transforms > ML > Feature Agglomeration.

  3. Specify the number of clusters/samples.

  4. Click Submit.

Label Encoding

Converts categorical string columns into numeric labels, starting from 0.

Best Situations to Use

  • Encode categorical variables for machine learning models that require numeric inputs.

Steps

  1. Select a categorical column.

  2. Open Transforms > ML > Label Encoding.

  3. Click Submit.

Example: "Tall, Medium, Short, Tall"0, 1, 2, 0

Lag Transform

Shifts a time series column by a specified number of units (lags).

Best Situations to Use

  • Analyze temporal dependencies or trends.

  • Generate lagged features for forecasting models.

Steps

  1. Select a numeric column.

  2. Open Transforms > ML > Lag Transform.

  3. Specify Lag value (number of units to shift).

  4. Click Submit.

Example: Sales data shifted by 2 months → first two cells empty, remaining cells contain lagged values

Leave One Out Encoding

Encodes categorical variables based on a target column while avoiding data leakage. Computes the mean of target values for each category excluding the current row.

Best Situations to Use

  • Prepare categorical variables for classification models.

  • Avoid overfitting or bias in target-based encoding.

Steps

  1. Select a categorical column.

  2. Open Transforms > ML > Leave One Out Encoding.

  3. Select an integer column as the target variable.

  4. Click Submit.

Example:

category
target
Result

A

1

0.5

B

0

0.5

A

1

0.5

B

1

0

A

0

1

B

0

0.5


This structured guide provides an overview, best-use scenarios, step-by-step instructions, and result examples for all core ML transforms in the Data Preparation framework.

One-Hot Encoding

Converts a categorical column into binary indicator columns, where each unique category is represented as a separate column with 1 if present and 0 otherwise.

Best Situations to Use

  • Encode categorical variables for ML models requiring numeric input.

  • Prepare dummy variables for regression or classification tasks.

Steps

  1. Select a categorical column.

  2. Open Transforms > ML > One-Hot Encoding.

  3. Provide New Column Name(s).

  4. Click Submit.

Example: Column "Color" with ["Red","Blue","Green","Red"]Color_Red, Color_Blue, Color_Green.

1

0

0

0

1

0

0

0

1

1

0

0

Principal Component Analysis (PCA)

Reduces dimensionality by transforming numerical columns into orthogonal principal components.

Best Situations to Use

  • Reduce correlated features for modeling or visualization.

  • Identify patterns in high-dimensional datasets.

Steps

  1. Select multiple numerical columns.

  2. Open Transforms > ML > Principal Component Analysis (PCA).

  3. Specify Output Features (number of principal components).

  4. Click Submit.

Rolling Data

Applies rolling window computations on numeric columns, generating new features like Min, Max, or Mean over a moving window.

Best Situations to Use

  • Feature engineering for time series analysis.

  • Calculate moving averages or trends.

Steps

  1. Select a numeric column.

  2. Open Transforms > ML > Rolling Data.

  3. Specify Window Size (≥2) and Method (Min, Max, Mean).

  4. Click Submit.

Singular Value Decomposition (SVD)

Performs matrix decomposition to reduce dimensionality or extract latent features from numeric columns.

Best Situations to Use

  • Feature extraction for dimensionality reduction.

  • Reduce noise in datasets while retaining essential information.

Steps

  1. Select multiple numeric columns.

  2. Open Transforms > ML > Singular Value Decomposition (SVD).

  3. Specify Latent Factors.

  4. Click Submit.

Target Encoding

Encodes categorical columns using the mean of a target numeric column per category.

Best Situations to Use

  • Encode categorical variables for regression or classification tasks.

  • Prevent overfitting by using target statistics cautiously.

Steps

  1. Select a categorical column.

  2. Open Transforms > ML > Target Encoding.

  3. Specify the Target Column (numeric/integer).

  4. Click Submit.

Example: Category "A" with target values [1,1,0] → mean = 0.667.

Target-based Quantile Encoding

Encodes categorical columns using quantiles of a numeric target variable for regression tasks.

Best Situations to Use

  • Encode categorical variables for continuous target regression.

  • Capture the distribution of target values per category.

Steps

  1. Select a categorical column.

  2. Open Transforms > ML > Target-based Quantile Encoding.

  3. Specify a numeric target column.

  4. Click Submit.

Weight of Evidence (WoE) Encoding

Encodes categorical variables for binary classification based on predictive power relative to the target variable.

Best Situations to Use

  • Feature engineering for binary classification tasks.

  • Assess the relationship between categories and the target variable.

Steps

  1. Select a categorical column.

  2. Open Transforms > ML > Weight of Evidence Encoding.

  3. Select a binary target column.

  4. Click Submit.

Last updated