ML
The Binarizer transform details are provided under this section.
Last updated
The Binarizer transform details are provided under this section.
Last updated
It converts the value of a numerical column to zero when the value in the column is less than or equal to the threshold value and one if the value in the column is greater than the threshold value.
Check out the given illustration on how to apply the Binarizer transform.
Steps to apply Binarizer transform:
Navigate to the Data Preparation landing page with the selected dataset.
Open the Transforms tab.
Select the Binarizer transform from the ML category of transforms.
The Binarizer dialog box opens.
Provide a Threshold value.
Click the Submit option.
The Dataset gets a new column with the 1 and 0 values by comparing the actual values with the set threshold limit.
Binning, or discretization, involves converting continuous data into distinct categories or values. This is commonly done to simplify data analysis, create histogram bins, or prepare data for certain machine-learning algorithms. Here are the steps to perform this transformation:
Check out the illustration on the Binning/ Dicretize Values transform.
Navigate to the Data Preparation landing page.
Select a column containing the continuous data you want to bin.
Open the Transforms section to get the list of available transforms.
Click the Binning/ Discretize Values transform method from the ML section.
The Binning/ Discretize Values dialog box opens.
Set the number of the Bins.
Click the Submit option.
The result will be displayed as a new column representing the binned or discretized values of the original continuous data. These steps help you to effectively transform continuous data into discrete categories for further analysis or use in machine learning algorithms.
Expanding Window Transform is a common technique used in time series analysis and machine learning for feature engineering. It involves creating new features based on rolling statistics or aggregates calculated over expanding windows of historical data. Here are the steps to perform this transformation:
Check out the illustration on the Expanding Window transform.
Navigate to the Data Preparation landing page.
Choose a column containing numeric (integer or float) data that you want to transform using the expanding window method.
Open the Transforms tab.
Choose the Expanding Window Transform option from the available transformations.
The Expanding Window Transform dialog box opens.
Select Method (Min, Max, Mean): Select the method you want to apply for calculation within the expanding window. Options typically include Minimum (Min), Maximum (Max), and Mean. Users can select multiple columns.
Execute the expanding window transformation with the chosen column and method(s) by clicking the Submit option.
The output will be generated as follows:
If multiple methods are selected, new columns will be created with names indicating the method used. For example, if three methods are selected for a column named 'col1', the resulting columns will be named 'col1_Expanding_Min', 'col1_Expanding_Max', and 'col1_Expanding_Mean'.
col1_Expanding_Min: Compares each value to the smallest value from the column and updates the result. The minimum value will always be the least value from the column.
col2_Expanding_Max: Compares each value to the first cell (smallest value) and updates it if a higher value is encountered.
col1_Expanding_Mean: Calculates the mean by adding each value to the first cell value and dividing by the number of elements encountered so far in the expanding window.
Feature Agglomeration is indeed used in machine learning and dimensionality reduction to combine correlated features into a smaller set of representative features. It's beneficial when dealing with datasets containing a large number of features, some of which may be redundant or highly correlated with each other.
Check out the illustration on the Feature Agglomeration transform.
Here are the steps to perform the transformation:
Navigate to the Data Preparation workspace.
Open the Transforms tab.
Select the Feature Agglomeration transform from the ML section.
The Feature Agglomeration dialog opens.
Choose multiple numerical columns from your dataset.
Set the samples.
Click the Submit option.
The output will contain the transformed features, where the number of resulting columns will be equal to the number of clusters specified or determined by the algorithm.
Each column will represent a cluster, a combination of the original features. The clusters are formed based on the similarity or correlation between features.
If the selected numerical columns are 3 and the sample size is 2, the resulting output will have 2 columns labeled cluster_1 and cluster_2, respectively, representing the two clusters obtained from the Feature Agglomeration transformation.
Label Encoding is a technique used to convert categorical columns into numerical ones, enabling them to be utilized by machine learning models that only accept numerical data. It's a crucial pre-processing step in many machine-learning projects.
Check out the illustration on the Label Encoding transform.
Here are the steps to perform Label Encoding:
Navigate to the Data Preparation workspace.
Select a column containing string or categorical data from the Data Grid display using the Data Preparation workspace.
Open the Transforms tab.
Choose the Label Encoding transform.
A new column is generated, replacing the categorical values with numerical ones.
These numerical values are typically assigned in ascending order starting from 0. Each unique category in the original column is mapped to a unique numerical value.
For example:
If a column contains the categories "Tall," "Medium," "Short," and "Tall," after applying Label Encoding, it will show the result as 0, 1, 2, and 0, respectively. Each unique category is assigned a distinct numerical value based on its position in the encoding scheme.
The lag transformation involves shifting or delaying a time series by a certain number of time units (lags). This transformation is commonly used in time series analysis to study patterns, trends, or dependencies over time.
Check out the illustration on the Lag Transform method.
Here are the steps to perform a lag transformation:
Navigate to the Data Preparation workspace.
Open the Transforms tab.
Select the Lag Transform from the ML category.
The Lag Transform dialog box opens.
Choose a column with numeric values.
Update the Lag parameter to specify the number of time units to shift or delay the time series. Provide a number to the Lag field. The Lag value should be 1 or more.
Click the Submit option to submit the transformation.
After applying the lag transformation, the result will be updated with a new column.
This new column represents the original data shifted by the specified lag.
The first few cells in the new column will be empty as they correspond to the lag period specified.
The subsequent cells will contain the values of the original time series data shifted accordingly.
For example, if we have simple time series data representing the monthly sales of a product over a year with a lag of 2, the first two cells in the new column will be empty, and the subsequent cells will contain the sales data shifted by two months.
Month
Sales
Sales_lag_2
Jan
100
Feb
120
Mar
90
100
April
60
120
May
178
90
June
298
60
The Leave One Out Encoding transform encodes categorical variables in a dataset based on the target variable while avoiding data leakage. It's useful for classification tasks where you want to encode categorical variables without introducing bias or overfitting to the training data.
Check out the illustration on the Leave One Out Encoding transform.
Here are the steps to perform the Leave One Out Encoding transformation:
Navigate to the Data Preparation workspace.
Select a string column for which the transformation is applied. This column should contain categorical variables.
Open the Transforms tab.
Choose the Leave One Out Encoding transformation from the ML section.
The Leave One Out Encoding dialog box appears.
Select an integer column representing the target value used to calculate the mean for category values. This column is usually associated with the target variable in your dataset.
Submit the transformation by using the Submit option.
After applying the Leave One Out Encoding transformation, the result will be displayed as a new column.
This new column will contain the mean values of the occurrences for each record in the selected categorical column, excluding the target value in that record.
This encoding method helps to encode categorical variables based on the target variable while avoiding data leakage, making it particularly useful for classification tasks where you want to encode categorical variables without introducing bias or overfitting to the training data. Refer to the following image as an example:
category
target
Result
A
1
0.5
B
0
0.5
A
1
0.5
B
1
0
A
0
1
B
0
0.5
One-Hot Encoding/ Convert Value to Column is a data preparation technique used to convert categorical variables into a binary format, making them suitable for machine learning algorithms that require numerical input. It creates binary columns for each category in the original data, where each column represents one category and has a value of 1 if the category is present in the original data and 0 otherwise.
Check out the illustration on applying the One Hot Encoding transform.
Here are the steps to perform One-Hot Encoding:
Navigate to the Data Preparation workspace.
Select Categorical Column: Choose the categorical column(s) from the dataset to be encoded. These columns typically contain string or categorical values.
Open the Transforms section.
Use the One-Hot Encoding transformation to convert the selected categorical column(s) into a binary format. Click the One-Hot Encoding transform.
Result Interpretation: The output will be a set of new binary columns, each representing a category in the original categorical column. For each row in the dataset, the value in the corresponding binary column will be 1 if the category is present in that row, and 0 otherwise.
Example: Suppose you have a dataset with a categorical column "Color" containing the following values: "Red", "Blue", "Green", and "Red".
Original Dataset:
Color
Red
Blue
Green
Red
After applying One-Hot Encoding:
Each row represents a category from the original column, and the presence of that category is indicated by a value of 1 in the corresponding binary column. For instance, the first row has "Red" in the original column, hence "Color_Red" is 1, while the others are 0. Likewise "Color_Blue" and "Color_Green" are displayed.
1
0
0
0
1
0
0
0
1
1
0
0
Principal Component Analysis (PCA) is a dimensionality reduction technique for identifying patterns in data. It involves expressing the data as a new set of orthogonal (uncorrelated) variables called principal components. PCA is widely used in various fields, such as data analysis, pattern recognition, and machine learning.
Check out the illustration on the Principal Component Analysis transform.
Here are the steps to perform Principal Component Analysis (PCA):
Navigate to the Data Preparation workspace.
Open the Transforms tab.
Select the Principal Component Analysis transform from the ML category.
The Principal Component Analysis dialog window opens.
Select multiple numerical columns by using the given checkboxes.
The selected columns are displayed separated by commas.
Output Features: Update output features by providing a number based on the number provided for this field, the result columns are inserted in the data set.
Click the Submit option.
Here's an illustration to explain the Principal Component Analysis:
Suppose we have a dataset with two numerical variables, "Height" and "Weight", and we want to perform PCA on this dataset.
Original Dataset:
170
65
165
60
180
70
160
55
After standardization:
0.44
0.50
-0.22
-0.50
1.33
1.00
-1.56
-1.00
Output features the result column(s) based on the provided update.
Please Note: The selected Output Feature for the chosen dataset is 1, therefore in the above given image one column has been inserted displaying the result values. Multiple columns can be added to the dataset if the Output Features field is set with more than one number.
The Rolling Data transform is used in time series analysis and feature engineering. It involves creating new features by applying transformations to rolling windows of the original data. These rolling windows move through the time series data, and at each step, summary statistics or other transformations are calculated within the window.
The newly created columns are appended to the dataset, providing additional insights into the trends and patterns within the data. The rolling window transform can be useful for time series analysis, identifying peaks, trends, or other statistical patterns over time. It can also be employed to smooth out fluctuations and better understand the underlying structure of the data.
Check out the illustration on the rolling data transform.
Here are the steps to perform the Rolling Data transform:
Navigate to the Data Preparation workspace.
Select a numeric column (int/float) from your dataset.
Open the Transforms tab.
Select the Rolling Data transform.
Update the Window size. Specify the size of the rolling window. This determines the number of consecutive data points included in each window. The window size should be a numeric value of 2 or larger number.
Select a Method from the given choices (Min, Max, Mean). Users can choose all the methods and apply them to the selected column.
After selecting the Window size and Method, click the Submit option to apply rolling window transformation.
Please Note: Window Size can be updated by any numeric values which must be 2 or larger.
The result columns will be added based on the no. of the selected methods while applying the Rolling Data. The following image displays result columns with mean, min, and max rolling data for the Bonus column.
Please Note: The first cell in each result column is null because there are no previous cells with which to calculate the summary statistic within the initial window. The no. of empty/ null cells in the result columns will be 1 digit less than the selected Window Size number.
The Singular Value Decomposition transform is a powerful linear algebra technique that decomposes a matrix into three other matrices, which can be useful for various tasks, including data compression, noise reduction, and feature extraction. In the context of transformations for data analysis, Singular Value Decomposition (SVD) can be used as a technique for dimensionality reduction or feature extraction. It works by breaking down a matrix into three constituent matrices, representing the original data in a lower-dimensional space.
Here are the steps to perform the Singular Value Decomposition transform:
Navigate to the Data Preparation workspace.
Select a column from the dataset.
Open the Transforms tab.
Select the Singular Value Decomposition transform using the ML category.
The Singular Value Decomposition window opens.
Select multiple numeric types of columns from the dataset using the drop-down menu.
Update the Latent Factors.
Click the Submit option.
The result should be based on the latent factor update size. For example, if it's 2 the result column will be 2.
Target Encoding, also known as Mean Encoding or Likelihood Encoding, is a method for encoding categorical variables based on the target variable(or another summary statistic) for each category. It replaces categorical values with the mean of the target variable for each category. This encoding method is widely used in predictive modeling tasks, especially in classification problems, to convert categorical variables into a numerical format that can be used as input to machine learning algorithms.
Here are the steps to perform the Target Encoding transform:
Navigate to the Data Preparation workspace.
Select a category (string) column type for the transformation.
Open the Transforms tab.
Select the Target Encoding transformation from the ML category.
The Target Encoding dialog box opens.
Select the Target Column using the drop-down option (it should be a numeric/integer column).
Click the Submit option.
The result will be displayed in a new column with the encoded mean values for each category value in the selected column will be displayed.
E.g.,
Category
Target
Result
A
1
0.5257
B
0
0.4247
A
1
0.5257
B
1
0.4247
A
0
0.5257
B
0
0.4247
Target-based Quantile Encoding is particularly useful for regression problems where the target variable is continuous. It helps encode categorical variables in a dataset based on the distribution of the target variable within each category, potentially improving the predictive performance of regression models.
Here are the steps to perform the Target-based Quantile Encoding transform:
Navigate to the Data Preparation workspace.
Select a string column from the dataset on which the Target-based Quantile Encoding can be applied.
Open the Transforms tab.
Select the Target-based Quantile Encoding transformation from the ML category.
The Target-based Quantile Encoding dialog box opens.
Select an integer (numeric) column from the dataset.
Click the Submit option.
The result will be a new encoded column for each value in the selected column.
E.g.,
A
1
0.875
B
0
0.125
A
1
0.875
B
0
0.125
A
1
0.875
B
0
0.125
The Weight of Evidence Encoding is used in binary classification problems to encode categorical variables based on their predictive power to the target variable. It measures the strength of the relationship between a categorical variable and the target variable by examining the distribution of the target variable across different categories.
Check out the illustration on the Weight of Evidence Encoding transform.
Here are the steps to perform the Weight of Evidence Encoding transform:
Navigate to the Data Preparation workspace.
Select a categorical column.
Click on the Transforms tab.
Select the Weight of Evidence Encoding transform from the Transforms tab.
The Weight of Evidence Encoding window opens.
Select a target column with Binary Variables (like true/false, 0/1). For example, the target_value is selected.
Click the Submit option.
The result will be a new column with the distribution of the target variable across different categories.