# ML

## Binarizer

It converts the value of a numerical column to zero when the value in the column is less than or equal to the threshold value and one if the value in the column is greater than the threshold value.

{% hint style="success" %}
*Check out the given illustration on how to apply the Binarizer transform.*
{% endhint %}

{% embed url="<https://files.gitbook.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FMFInoyOhWnPG5CK2N5Pj%2FBianarize_ML.mp4?alt=media&token=7fb7979a-9dbc-4414-885f-41ccad77f0a5>" %}
Applying Bianarizer Transform
{% endembed %}

Steps to apply Binarizer transform:

* Navigate to the Data Preparation landing page with the selected dataset.
* Open the ***Transforms*** tab.
* Select the ***Binarizer*** transform from the ML category of transforms.

  <figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FP9ctZgZF8Bf8PFHPtCHr%2Fimage.png?alt=media&#x26;token=19dd0092-dc81-4e5b-9d50-68a805705230" alt=""><figcaption></figcaption></figure>
* The ***Binarizer*** dialog box opens.
* Provide a ***Threshold*** value.
* Click the ***Submit*** option.&#x20;

  <figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FO8p4pFlkbXhqZgCdHM8p%2Fimage.png?alt=media&#x26;token=46a267d2-59be-4a68-b218-f9d4d6b0a3eb" alt=""><figcaption></figcaption></figure>
* The Dataset gets a new column with the 1 and 0 values by comparing the actual values with the set threshold limit.

  <figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FE0WRUxwcuMerQP84jDhG%2Fimage.png?alt=media&#x26;token=be33e3d4-94ff-4c77-978a-052433c8fb52" alt=""><figcaption></figcaption></figure>

## Binning/ Discretize Values

Binning, or discretization, involves converting continuous data into distinct categories or values. This is commonly done to simplify data analysis, create histogram bins, or prepare data for certain machine-learning algorithms. Here are the steps to perform this transformation:

{% hint style="success" %}
*Check out the illustration on the Binning/ Dicretize Values transform.*
{% endhint %}

{% embed url="<https://files.gitbook.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FKdkiVEKAFBjisVIUOZEL%2FBinning_Discretize%20Values_ML.mp4?alt=media&token=5b018001-463f-4d60-977d-eba0d7c3b86e>" %}

* Navigate to the ***Data Preparation*** landing page.
* Select a column containing the continuous data you want to bin.
* Open the ***Transforms*** section to get the list of available transforms.
* Click the ***Binning/ Discretize Values*** transform method from the ML section.

<figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FE9aIvWcgjbI4rAZC0TqP%2Fimage.png?alt=media&#x26;token=94c2f985-c338-4c01-8ced-1cf93f971bc4" alt=""><figcaption></figcaption></figure>

* The ***Binning/ Discretize Values*** dialog box opens.
* Set the number of the Bins.&#x20;
* Click the ***Submit*** option.

<figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FIFRtiWp8Yx6IYqdngV9V%2Fimage.png?alt=media&#x26;token=231826df-1025-4e5a-b49f-c8d815b23c53" alt=""><figcaption></figcaption></figure>

* The result will be displayed as a new column representing the binned or discretized values of the original continuous data. These steps help you to effectively transform continuous data into discrete categories for further analysis or use in machine learning algorithms.

<figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FFhbnXyDxFJZHLlu95u58%2Fimage.png?alt=media&#x26;token=6350686e-18de-4c45-adfa-c1c27ad35fe2" alt=""><figcaption><p><em><strong>Result column for the Binning/ Discretize Value</strong></em></p></figcaption></figure>

## Expanding Window Transform

Expanding Window Transform is a common technique used in time series analysis and machine learning for feature engineering. It involves creating new features based on rolling statistics or aggregates calculated over expanding windows of historical data. Here are the steps to perform this transformation:

{% hint style="success" %}
*Check out the illustration on the **Expanding Window** transform.*
{% endhint %}

{% embed url="<https://files.gitbook.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FqomXYXdeIpWWKN6azInw%2FExpanding%20Window%20Transform_ML.mp4?alt=media&token=81b2fb69-6fea-4391-bf16-da2f0616531c>" %}

* Navigate to the ***Data Preparation*** landing page.
* Choose a column containing numeric (integer or float) data that you want to transform using the expanding window method.
* Open the ***Transforms*** tab.
* Choose the ***Expanding Window Transform*** option from the available transformations.

<figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FmtcDzHlsxN3JHfxnKsuo%2Fimage.png?alt=media&#x26;token=f9661e1a-0952-4813-86c6-a17b8371e195" alt=""><figcaption></figcaption></figure>

* The ***Expanding Window Transform*** dialog box opens.
* Select Method (Min, Max, Mean): Select the method you want to apply for calculation within the expanding window. Options typically include Minimum (Min), Maximum (Max), and Mean. Users can select multiple columns.
* Execute the expanding window transformation with the chosen column and method(s) by clicking the ***Submit*** option.&#x20;

  <figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FBhGTrbmBdopzHFBa6phz%2Fimage.png?alt=media&#x26;token=a2838fd2-f609-4084-8b60-28df4e1a4744" alt=""><figcaption></figcaption></figure>
* The output will be generated as follows:
  * If multiple methods are selected, new columns will be created with names indicating the method used. For example, if three methods are selected for a column named 'col1', the resulting columns will be named 'col1\_Expanding\_Min', 'col1\_Expanding\_Max', and 'col1\_Expanding\_Mean'.&#x20;
    * col1\_Expanding\_Min: Compares each value to the smallest value from the column and updates the result. The minimum value will always be the least value from the column.&#x20;
    * col2\_Expanding\_Max: Compares each value to the first cell (smallest value) and updates it if a higher value is encountered.&#x20;
    * col1\_Expanding\_Mean: Calculates the mean by adding each value to the first cell value and dividing by the number of elements encountered so far in the expanding window.&#x20;

      <figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2Fh801TYWWvhXur9ocm7PM%2Fimage.png?alt=media&#x26;token=056a80f6-496b-4909-9240-e63dd1aa9e4b" alt=""><figcaption></figcaption></figure>

## Feature Agglomeration

***Feature Agglomeration*** is indeed used in machine learning and dimensionality reduction to combine correlated features into a smaller set of representative features. It's beneficial when dealing with datasets containing a large number of features, some of which may be redundant or highly correlated with each other.

{% hint style="success" %}
*Check out the illustration on the Feature Agglomeration transform.*
{% endhint %}

{% embed url="<https://files.gitbook.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2F6IOKIeHK8NryDOXlSuRE%2FFeature%20Agglomeration_ML.mp4?alt=media&token=3bac71bb-0cc0-4ceb-b0e9-31be0e2acd04>" %}

Here are the steps to perform the transformation:

* Navigate to the ***Data Preparation*** workspace.
* Open the ***Transforms*** tab.
* Select the ***Feature Agglomeration*** transform from the ***ML*** section.&#x20;

  <figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2F4kZ2zEonQ80bm1GdsGlx%2Fimage.png?alt=media&#x26;token=92211999-9a33-46f7-a1d1-0103bf71ab50" alt=""><figcaption></figcaption></figure>
* The ***Feature Agglomeration*** dialog opens.
* Choose multiple numerical columns from your dataset.&#x20;

<figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FAUfvhl994luYMaoVmzYz%2Fimage.png?alt=media&#x26;token=6841933b-1875-49a1-bb50-479c638c5b64" alt=""><figcaption></figcaption></figure>

* Set the samples.&#x20;
* Click the ***Submit*** option.&#x20;

  <figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FLq5PQFDXt2b8zNkH5eDC%2Fimage.png?alt=media&#x26;token=5e489ee6-d420-433e-8fd1-4d9dce04c902" alt=""><figcaption></figcaption></figure>
* The output will contain the transformed features, where the number of resulting columns will be equal to the number of clusters specified or determined by the algorithm.&#x20;
  * Each column will represent a cluster, a combination of the original features. The clusters are formed based on the similarity or correlation between features.
  * If the selected numerical columns are 3 and the sample size is 2, the resulting output will have 2 columns labeled cluster\_1 and cluster\_2, respectively, representing the two clusters obtained from the Feature Agglomeration transformation.&#x20;

    <figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2F1uMH2QVCDbHfdHQCD1TA%2Fimage.png?alt=media&#x26;token=745fd722-ef38-46d9-8784-aa94b0df9704" alt=""><figcaption><p><em><strong>Outcome of the Feature Agglomeration transform</strong></em></p></figcaption></figure>

## Label Encoding

***Label Encoding*** is a technique used to convert categorical columns into numerical ones, enabling them to be utilized by machine learning models that only accept numerical data. It's a crucial pre-processing step in many machine-learning projects.

{% hint style="success" %}
*Check out the illustration on the Label Encoding transform.*
{% endhint %}

{% embed url="<https://files.gitbook.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FWMByBybxTsofrR2v2QUN%2FLabel%20Encoding%20ML%20option.mp4?alt=media&token=ed900e12-2da7-4153-9293-dfbc9b4aeb17>" %}

Here are the steps to perform Label Encoding:

* Navigate to the ***Data Preparation*** workspace.
* Select a column containing string or categorical data from the Data Grid display using the Data Preparation workspace.&#x20;
* Open the ***Transforms*** tab.
* Choose the ***Label Encoding*** transform.&#x20;

  <figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2Fen0qAsoxZNXIorRfGz1z%2Fimage.png?alt=media&#x26;token=e5fa97ce-045c-4ce5-acb1-7bb908c8d743" alt=""><figcaption></figcaption></figure>
* A new column is generated, replacing the categorical values with numerical ones.
  * These numerical values are typically assigned in ascending order starting from 0. Each unique category in the original column is mapped to a unique numerical value.&#x20;

For example:

If a column contains the categories "Tall," "Medium," "Short," and "Tall," after applying Label Encoding, it will show the result as 0, 1, 2, and 0, respectively. Each unique category is assigned a distinct numerical value based on its position in the encoding scheme.&#x20;

<figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FHwnC2vaG5qdxlSIOXdEE%2Fimage.png?alt=media&#x26;token=e8bc8963-f0fa-4ef9-ba40-37e53342d5d8" alt=""><figcaption></figcaption></figure>

## Lag Transform

The lag transformation involves shifting or delaying a time series by a certain number of time units (lags). This transformation is commonly used in time series analysis to study patterns, trends, or dependencies over time.

{% hint style="success" %}
*Check out the illustration on the Lag Transform method.*
{% endhint %}

{% embed url="<https://files.gitbook.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FQH7i5hqO0wPpO9hguOnb%2FLag%20Transform.mp4?alt=media&token=b0c79b29-dc66-48c5-aa7e-1244fbd95b20>" %}

Here are the steps to perform a lag transformation:

* Navigate to the Data Preparation workspace.
* Open the ***Transforms*** tab.
* Select the ***Lag Transform*** from the ***ML*** category.&#x20;

  <figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FDqdxGPsxr8CIztM00PVh%2Fimage.png?alt=media&#x26;token=6b92e0a4-6800-4d73-a874-51f6e54a3996" alt=""><figcaption></figcaption></figure>
* The ***Lag Transform*** dialog box opens.
* Choose a column with numeric values.&#x20;
* Update the Lag parameter to specify the number of time units to shift or delay the time series. Provide a number to the Lag field. The Lag value should be 1 or more.&#x20;
* Click the ***Submit*** option to submit the transformation.&#x20;

  <figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FOrZsvXjTdaQ7ldyZZeU6%2Fimage.png?alt=media&#x26;token=d8115195-c146-4d7e-bc3d-57eabed8bf19" alt=""><figcaption></figcaption></figure>
* After applying the lag transformation, the result will be updated with a new column.&#x20;
  * This new column represents the original data shifted by the specified lag.&#x20;
  * The first few cells in the new column will be empty as they correspond to the lag period specified.&#x20;
  * The subsequent cells will contain the values of the original time series data shifted accordingly.

For example, if we have simple time series data representing the monthly sales of a product over a year with a lag of 2, the first two cells in the new column will be empty, and the subsequent cells will contain the sales data shifted by two months.

| Month | Sales | Sales\_lag\_2 |
| ----- | ----- | ------------- |
| Jan   | 100   |               |
| Feb   | 120   |               |
| Mar   | 90    | 100           |
| April | 60    | 120           |
| May   | 178   | 90            |
| June  | 298   | 60            |

<figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FAzOA2vzK1bD6O1WvJdqp%2Fimage.png?alt=media&#x26;token=4fd98cd8-70f3-44bf-a3b4-98611de53e96" alt=""><figcaption><p><em><strong>Result column for the Lag transform</strong></em></p></figcaption></figure>

## Leave One Out Encoding

The Leave One Out Encoding transform encodes categorical variables in a dataset based on the target variable while avoiding data leakage. It's useful for classification tasks where you want to encode categorical variables without introducing bias or overfitting to the training data.

{% hint style="success" %}
*Check out the illustration on the Leave One Out Encoding transform.*
{% endhint %}

{% embed url="<https://files.gitbook.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FJ58QXBytd1hmuiquXVgr%2FLeave%20One%20out%20Encoding.mp4?alt=media&token=e445b067-bfa3-42f9-b399-30d49685ae58>" %}
***Leave One Out Encoding***
{% endembed %}

&#x20;Here are the steps to perform the ***Leave One Out Encoding*** transformation:

* Navigate to the Data Preparation workspace.
* Select a string column for which the transformation is applied. This column should contain categorical variables.
* Open the ***Transforms*** tab.
* Choose the ***Leave One Out Encoding*** transformation from the ***ML*** section.&#x20;

<figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2Fphx9P3LABmXA352GkD33%2Fimage.png?alt=media&#x26;token=57854129-a5a8-4bad-8880-0c75053259d6" alt=""><figcaption></figcaption></figure>

* The ***Leave One Out Encoding*** dialog box appears.
* Select an integer column representing the target value used to calculate the mean for category values. This column is usually associated with the target variable in your dataset.&#x20;
* Submit the transformation by using the ***Submit*** option.&#x20;

  <figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FRsdRgywro922onoutIM9%2Fimage.png?alt=media&#x26;token=0178c799-bef4-4253-8960-e289b25bead9" alt=""><figcaption></figcaption></figure>
* &#x20;After applying the ***Leave One Out Encoding*** transformation, the result will be displayed as a new column.&#x20;
  * This new column will contain the mean values of the occurrences for each record in the selected categorical column, excluding the target value in that record.&#x20;
  * This encoding method helps to encode categorical variables based on the target variable while avoiding data leakage, making it particularly useful for classification tasks where you want to encode categorical variables without introducing bias or overfitting to the training data. Refer to the following image as an example:

| category | target | <p>Result</p><p> </p> |
| -------- | ------ | --------------------- |
| A        | 1      | 0.5                   |
| B        | 0      | 0.5                   |
| A        | 1      | 0.5                   |
| B        | 1      | 0                     |
| A        | 0      | 1                     |
| B        | 0      | 0.5                   |

<figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FiOZ8p7SHfrvPUxsuK0gv%2Fimage.png?alt=media&#x26;token=4bf4994d-e7e5-4ed3-ba3d-3395799c4ca1" alt=""><figcaption><p><em><strong>Result column for the Leave One Out Encoding Transform</strong></em></p></figcaption></figure>

## One Hot Encoding

***One-Hot Encoding***/ ***Convert Value to Column*** is a data preparation technique used to convert categorical variables into a binary format, making them suitable for machine learning algorithms that require numerical input. It creates binary columns for each category in the original data, where each column represents one category and has a value of 1 if the category is present in the original data and 0 otherwise.

{% hint style="success" %}
*Check out the illustration on applying the One Hot Encoding transform.*
{% endhint %}

{% embed url="<https://files.gitbook.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FhMzarV5e0teFjMkEuZWz%2FOne%20Hot%20Encoding.mp4?alt=media&token=fc88b3fe-380a-43e1-823b-e64869cc70d3>" %}

Here are the steps to perform One-Hot Encoding:

* Navigate to the Data Preparation workspace.
* **Select Categorical Column:** Choose the categorical column(s) from the dataset to be encoded. These columns typically contain string or categorical values.
* Open the ***Transforms*** section.
* Use the One-Hot Encoding transformation to convert the selected categorical column(s) into a binary format. Click the ***One-Hot Encoding*** transform.

  <figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FB69ZZSlb2T1uaSSygGQ4%2Fimage.png?alt=media&#x26;token=91b4525f-280e-4292-801d-b326dd70537d" alt=""><figcaption></figcaption></figure>
* Result Interpretation: The output will be a set of new binary columns, each representing a category in the original categorical column. For each row in the dataset, the value in the corresponding binary column will be 1 if the category is present in that row, and 0 otherwise.
  * Example: Suppose you have a dataset with a categorical column "Color" containing the following values: "Red", "Blue", "Green", and "Red".

    * Original Dataset:&#x20;

      | Color |
      | ----- |
      | Red   |
      | Blue  |
      | Green |
      | Red   |

    * After applying One-Hot Encoding:

      Each row represents a category from the original column, and the presence of that category is indicated by a value of 1 in the corresponding binary column. For instance, the first row has "Red" in the original column, hence "Color\_Red" is 1, while the others are 0. Likewise "Color\_Blue" and "Color\_Green" are displayed.

<table><thead><tr><th width="247">Color_Red</th><th width="246">Color_Blue</th><th>Color_Green</th></tr></thead><tbody><tr><td>1</td><td>0</td><td>0</td></tr><tr><td>0</td><td>1</td><td>0</td></tr><tr><td>0</td><td>0</td><td>1</td></tr><tr><td>1</td><td>0</td><td>0</td></tr></tbody></table>

<figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FdJnjthOElDbcj889IYzH%2Fimage.png?alt=media&#x26;token=c34ea86d-3a38-43cc-9a37-f0a2da8f76d8" alt=""><figcaption><p><em><strong>Result columns after applying the One-Hot Encoding transform to the above mentioned Categorical Column</strong></em></p></figcaption></figure>

## Principal Component Analysis

Principal Component Analysis (PCA) is a dimensionality reduction technique for identifying patterns in data. It involves expressing the data as a new set of orthogonal (uncorrelated) variables called principal components. PCA is widely used in various fields, such as data analysis, pattern recognition, and machine learning.

{% hint style="success" %}
*Check out the illustration on the Principal Component Analysis transform.*
{% endhint %}

{% embed url="<https://files.gitbook.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2Fg6WHowQkMt0AvKGaALCK%2FPCA%20Principal%20Component%20Ananlysis.mp4?alt=media&token=f328b9ae-c001-4271-9f21-69f9d9425cb0>" %}

Here are the steps to perform Principal Component Analysis (PCA):

* Navigate to the ***Data Preparation*** workspace.
* Open the ***Transforms*** tab.
* Select the ***Principal Component Analysis*** transform from the ML category.

<figure><img src="https://content.gitbook.com/content/Kg5pfnNkTs1b1YNYX7rD/blobs/l9jnFfH7s7DMW28WpFEb/image.png" alt=""><figcaption></figcaption></figure>

* The ***Principal Component Analysis*** dialog window opens.
* Select multiple numerical columns by using the given checkboxes.       &#x20;

  <figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FTgE6v1hvitaLEwnXfiOb%2Fimage.png?alt=media&#x26;token=36146583-4ca0-44c2-b552-c23bfdbbd746" alt=""><figcaption></figcaption></figure>
* The selected columns are displayed separated by commas.
* &#x20;**Output Features**: Update output features by providing a number based on the number provided for this field, the result columns are inserted in the data set.
* Click the ***Submit*** option.

<figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2F3clS8h67Al2gfHujOAJ7%2Fimage.png?alt=media&#x26;token=d6c3af8c-6c39-422b-8adc-ab70a812673b" alt=""><figcaption></figcaption></figure>

Here's an illustration to explain the  Principal Component Analysis:

* Suppose we have a dataset with two numerical variables, "Height" and "Weight", and we want to perform PCA on this dataset.
* Original Dataset:

<table><thead><tr><th width="346">Height</th><th>Weight</th></tr></thead><tbody><tr><td>170</td><td>65</td></tr><tr><td>165</td><td>60</td></tr><tr><td>180</td><td>70</td></tr><tr><td>160</td><td>55</td></tr></tbody></table>

* After standardization:

| Height | Weight |
| ------ | ------ |
| 0.44   | 0.50   |
| -0.22  | -0.50  |
| 1.33   | 1.00   |
| -1.56  | -1.00  |

Output features the result column(s) based on the provided update.

<figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FyAZi4dkXsXWpQgDXaLfi%2Fimage.png?alt=media&#x26;token=ff5fc1b9-99af-4af7-871a-e1e507ca1775" alt=""><figcaption><p><em><strong>Output column after applying the Principal Component Analysis transform</strong></em></p></figcaption></figure>

{% hint style="info" %}
*<mark style="color:green;">Please Note:</mark> The selected Output Feature for the chosen dataset is 1, therefore in the above given image one column has been inserted displaying the result values. Multiple columns can be added to the dataset if the Output Features field is set with more than one number.*
{% endhint %}

## Rolling Data

The Rolling Data transform is used in time series analysis and feature engineering. It involves creating new features by applying transformations to rolling windows of the original data. These rolling windows move through the time series data, and at each step, summary statistics or other transformations are calculated within the window.

The newly created columns are appended to the dataset, providing additional insights into the trends and patterns within the data. The rolling window transform can be useful for time series analysis, identifying peaks, trends, or other statistical patterns over time. It can also be employed to smooth out fluctuations and better understand the underlying structure of the data.

{% hint style="success" %}
*Check out the illustration on the rolling data transform.*
{% endhint %}

{% embed url="<https://files.gitbook.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FHIumaewZD0w8Hpt9yOYM%2FRolling%20Data2.mp4?alt=media&token=3eaa376e-89bb-4e9a-9aa6-5cbd49cc5152>" %}

Here are the steps to perform the Rolling Data transform:

* Navigate to the ***Data Preparation*** workspace.
* Select a numeric column (int/float) from your dataset.
* Open the ***Transforms*** tab.
* Select the ***Rolling Data*** transform.&#x20;

  <figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FdrGhcVZHLlb8UMtoXjDJ%2Fimage.png?alt=media&#x26;token=96c95511-3f55-4e88-a818-f22b374844bf" alt=""><figcaption></figcaption></figure>
* Update the ***Window size***. Specify the size of the rolling window. This determines the number of consecutive data points included in each window. The window size should be a numeric value of 2 or larger number.&#x20;
* Select a Method from the given choices (Min, Max, Mean).  Users can choose all the methods and apply them to the selected column.

<figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FZ2jWB6td3ihvo5xPhiSM%2Fimage.png?alt=media&#x26;token=28c06b71-1c64-4bcf-bd4a-268901cc7f47" alt=""><figcaption></figcaption></figure>

* After selecting the Window size and Method, click the ***Submit*** option to apply rolling window transformation.&#x20;

  <figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FKoEEUvSgDiJfdHtRIpoV%2Fimage.png?alt=media&#x26;token=c54d837e-d6f8-467a-991b-c8b45fef32e6" alt=""><figcaption></figcaption></figure>

{% hint style="info" %}
*<mark style="color:green;">Please Note:</mark> Window Size can be updated by any numeric values which must be 2 or larger.*&#x20;
{% endhint %}

* The result columns will be added based on the no. of the selected methods while applying the Rolling Data. The following image displays result columns with mean, min, and max rolling data for the Bonus column.

<figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FoDIbCDTaUpEgtb3Aw8NV%2Fimage.png?alt=media&#x26;token=8fe09947-ccef-4913-a23f-8d8ee84ab3f8" alt=""><figcaption></figcaption></figure>

{% hint style="info" %}
*<mark style="color:green;">Please Note:</mark> The first cell in each result column is null because there are no previous cells with which to calculate the summary statistic within the initial window. The no. of empty/ null cells in the result columns will be 1 digit less than the selected Window Size number.*
{% endhint %}

## Singular Value Decomposition

The Singular Value Decomposition transform is a powerful linear algebra technique that decomposes a matrix into three other matrices, which can be useful for various tasks, including data compression, noise reduction, and feature extraction. In the context of transformations for data analysis, Singular Value Decomposition (SVD) can be used as a technique for dimensionality reduction or feature extraction. It works by breaking down a matrix into three constituent matrices, representing the original data in a lower-dimensional space.

{% embed url="<https://files.gitbook.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FULcjhtmmER6UdrYpE8Rk%2FSingular%20Value%20Decomposition.mp4?alt=media&token=b165af2d-5884-4b98-b70c-331517997fc3>" %}

Here are the steps to perform the ***Singular Value Decomposition*** transform:

* Navigate to the Data Preparation workspace.
* Select a column from the dataset.
* Open the ***Transforms*** tab.
* Select the ***Singular Value Decomposition*** transform using the ML category.&#x20;

  <figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2F518djWBS6eR6cE1HqO5I%2Fimage.png?alt=media&#x26;token=4f7b34a1-6b7f-4397-bd41-a3721aa030e7" alt=""><figcaption></figcaption></figure>
* The ***Singular Value Decomposition*** window opens.
* Select multiple numeric types of columns from the dataset using the drop-down menu.&#x20;

  <figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FuimVGKFDMWl152KAennw%2Fimage.png?alt=media&#x26;token=1f560595-4cb3-4c9b-bea2-ba7c1f6686dd" alt=""><figcaption></figcaption></figure>
* Update the ***Latent Factors***.
* Click the ***Submit*** option.&#x20;

  <figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FyIWz5C5YYC2toR2yJzQe%2Fimage.png?alt=media&#x26;token=fbb18605-8a44-4115-bd51-c3c969bee35b" alt=""><figcaption></figcaption></figure>
* The result should be based on the latent factor update size. For example, if it's 2 the result column will be 2.&#x20;

  <figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FQzky21V53ak876wVL0he%2Fimage.png?alt=media&#x26;token=44573dce-48f7-42fd-b8d0-f35dc3be421a" alt=""><figcaption></figcaption></figure>

## Target Encoding

Target Encoding, also known as Mean Encoding or Likelihood Encoding, is a method for encoding categorical variables based on the target variable(or another summary statistic) for each category. It replaces categorical values with the mean of the target variable for each category. This encoding method is widely used in predictive modeling tasks, especially in classification problems, to convert categorical variables into a numerical format that can be used as input to machine learning algorithms.&#x20;

{% embed url="<https://files.gitbook.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FzPcuCq8H3ZdEChuZArUf%2FTarget%20Encoding.mp4?alt=media&token=8cbe03c3-d0ac-481a-b8ca-30190f229917>" %}

Here are the steps to perform the ***Target Encoding*** transform:

* Navigate to the ***Data Preparation*** workspace.
* Select a category (string) column type for the transformation.
* Open the ***Transforms*** tab.
* Select the ***Target Encoding*** transformation from the ***ML*** category.

  <figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FlDHdvafjeYciguNZgHZY%2Fimage.png?alt=media&#x26;token=a472175f-96a2-4b61-a20d-ae683450b3f6" alt=""><figcaption></figcaption></figure>
* The ***Target Encoding*** dialog box opens.
* Select the ***Target Column*** using the drop-down option (it should be a numeric/integer column).
* Click the ***Submit*** option.

  <figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2Fv1dQHMiWlExeSLimlC9V%2Fimage.png?alt=media&#x26;token=cb53f41d-96ef-4694-9a04-eaec0f192985" alt=""><figcaption></figcaption></figure>
* The result will be displayed in a new column with the encoded mean values for each category value in the selected column will be displayed.&#x20;

  <figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FxRaDb7TWidQ0aahLCy4P%2Fimage.png?alt=media&#x26;token=ba4e1c2a-d806-4279-90fd-e22f7445afe0" alt=""><figcaption></figcaption></figure>

E.g.,

| Category | Target | <p>Result</p><p> </p> |
| -------- | ------ | --------------------- |
| A        | 1      | 0.5257                |
| B        | 0      | 0.4247                |
| A        | 1      | 0.5257                |
| B        | 1      | 0.4247                |
| A        | 0      | 0.5257                |
| B        | 0      | 0.4247                |

&#x20;

## Target-based Quantile Encoding

Target-based Quantile Encoding is particularly useful for regression problems where the target variable is continuous. It helps encode categorical variables in a dataset based on the distribution of the target variable within each category, potentially improving the predictive performance of regression models.

{% embed url="<https://files.gitbook.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FeL3GkLTaobQuTzqvc7q4%2FTarget%20based%20Quantile%20Encoding.mp4?alt=media&token=a062de61-e07b-4db5-8661-b61d9636f4ba>" %}

Here are the steps to perform the Target-based Quantile Encoding transform:

* Navigate to the ***Data Preparation*** workspace.
* Select a string column from the dataset on which the ***Target-based Quantile Encoding*** can be applied.
* Open the ***Transforms*** tab.
* Select the ***Target-based Quantile Encoding*** transformation from the ***ML*** category.&#x20;

  <figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FmCiXjeIrw0m3kOwjIB2H%2Fimage.png?alt=media&#x26;token=18cfd115-f6a0-4e81-b2f2-24fb119d3ec4" alt=""><figcaption></figcaption></figure>
* The ***Target-based Quantile Encoding*** dialog box opens.
* Select an ***integer*** (numeric) column from the dataset.
* Click the ***Submit*** option.&#x20;

  <figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FiQskHoY4KBVg5fi5w0Fo%2Fimage.png?alt=media&#x26;token=f46ff2c6-c160-4e5c-a191-70ec520ac294" alt=""><figcaption></figcaption></figure>
* The result will be a new encoded column for each value in the selected column.&#x20;

  <figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FGhSzREJCDvvcm9asEuqI%2Fimage.png?alt=media&#x26;token=c6a43f1e-e9b8-4af2-b2ea-b221bfe42b06" alt=""><figcaption></figcaption></figure>

E.g.,&#x20;

| Category | Target | Result |
| -------- | ------ | ------ |
| A        | 1      | 0.875  |
| B        | 0      | 0.125  |
| A        | 1      | 0.875  |
| B        | 0      | 0.125  |
| A        | 1      | 0.875  |
| B        | 0      | 0.125  |

## Weight of Evidence Encoding

The Weight of Evidence Encoding is used in binary classification problems to encode categorical variables based on their predictive power to the target variable. It measures the strength of the relationship between a categorical variable and the target variable by examining the distribution of the target variable across different categories.

{% hint style="success" %}
*Check out the illustration on the Weight of Evidence Encoding transform.*
{% endhint %}

{% embed url="<https://files.gitbook.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FF7wWfslQjHLvYYNBbOkW%2FWeight%20of%20Evidence%20Encoding.mp4?alt=media&token=c09270f1-7b35-4fa7-bda6-d6d3ce619fad>" %}
***Applying Weight of Evidence Encoding Transformation***
{% endembed %}

Here are the steps to perform the Weight of Evidence Encoding transform:

* Navigate to the ***Data Preparation*** workspace.
* Select a categorical column.
* Click on the ***Transforms*** tab.
* Select the ***Weight of Evidence Encoding*** transform from the **Transforms** tab.

  <figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2F2DQFrgzHbdC8Gln8Ydx8%2Fimage.png?alt=media&#x26;token=619ddffd-f93b-46fc-8641-2c86909a9906" alt=""><figcaption></figcaption></figure>
* The ***Weight of Evidence Encoding*** window opens.
* Select a target column with ***Binary Variables*** (like true/false, 0/1). For example, the target\_value is selected.
* Click the ***Submit*** option.&#x20;

  <figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2FbRWuo5U6bbECJmcU5e8C%2Fimage.png?alt=media&#x26;token=8a645fa6-1d0c-46f8-b296-5fe24ce784f6" alt=""><figcaption></figcaption></figure>
* The result will be a new column with the distribution of the target variable across different categories.&#x20;

  <figure><img src="https://2657181281-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKg5pfnNkTs1b1YNYX7rD%2Fuploads%2Fm2WLw5M2nVpYv9kKCJUH%2Fimage.png?alt=media&#x26;token=2957178c-e61b-4b47-ab56-243bde9da38c" alt=""><figcaption></figcaption></figure>
