Advanced Data Preparation Transforms

The Advanced Transforms section provides tools like Cluster & Edit, Expression Editor, Find Anomaly, and SQL Transform for complex data cleaning, enrichment, and anomaly detection.

The Advanced Transforms section in the Data Preparation framework provides tools to perform complex data cleaning, enrichment, and anomaly detection. Key transforms include Cluster & Edit, Expression Editor, Find Anomaly, and SQL Transform.

Cluster & Edit

The Cluster & Edit transform groups values based on phonetic similarity (using the Soundex algorithm) and allows bulk editing of similar values in a single step. It is particularly useful for standardizing names or text fields where slight variations exist.

Best Situations to Use

  • Standardizing text data, such as names or codes, with similar phonetic spelling.

  • Correcting inconsistent values in a dataset before analysis.

  • Reducing manual cleaning effort for large datasets.

Steps to Perform Cluster & Edit

  1. Select a column from the dataset.

  2. Navigate to the Transforms tab.

  3. Select Cluster & Edit from the Advanced category.

  4. The Cluster & Edit window opens.

  5. Select a Method (Soundex) for phonetic grouping.

  6. Review the Values Found column for grouped values.

  7. Specify the Replace Value for each selected value using the drop-down menu or search bar.

  8. Select the values to update using checkboxes.

  9. Click Submit.

Expression Editor

The Expression Editor allows users to create, edit, and execute custom formulas on the dataset. Users can update existing columns or create new columns based on computed expressions.

Best Situations to Use

  • Creating derived columns using custom logic.

  • Performing column-level calculations that are not supported by predefined transforms.

  • Dynamically updating datasets based on formulas or expressions.

Steps to Perform Expression Editor Transform

  1. Select a dataset in the Data Preparation workspace.

  2. Navigate to the Transforms tab and open Expression Editor.

  3. The editor displays:

    • Functions Column: Search and double-click functions to add to the formula space.

    • Columns Column: Lists all columns in the dataset.

    • Formula Space: Area to write and execute formulas.

  4. Consume the formula using either:

    • Update Column: Apply to an existing column.

    • Create New Column: Provide a name and apply the formula to generate a new column.

  5. Click Submit.

Find Anomaly

The Find Anomaly transform detects outliers in the dataset using the Isolation Forest algorithm. It flags unusual values for further analysis or cleaning.

Best Situations to Use

  • Identifying outliers in numeric datasets.

  • Performing quality checks to detect abnormal data points.

  • Preparing data for machine learning models by highlighting anomalies.

Steps to Perform Find Anomaly Transform

  1. Select a dataset in the Data Preparation framework.

  2. Navigate to the Transforms tab and select Find Anomaly from Advanced.

  3. Configure parameters:

    • Feature Columns: Columns to analyze.

    • Maximum Sample Size: Training sample size for Isolation Forest.

    • Contamination (%): Expected proportion of outliers (0–1).

    • Anomaly Flag Name: Name of the new column storing the flag (-1 = outlier, 1 = normal).

  4. Click Submit.

Result: A new column is added with anomaly flags.

Note: Estimators and seed values use default settings unless modified.

SQL Transform

The SQL Transform allows execution of custom Pandas SQL queries on the dataset for advanced transformations. Users can combine functions, columns, and conditions to create complex data manipulations.

Best Situations to Use

  • Performing complex transformations using SQL logic.

  • Filtering or aggregating data beyond the capabilities of built-in transforms.

  • Creating dataset subsets or derived metrics for analysis.

Steps to Perform SQL Transform

  1. Select a dataset in the Data Preparation workspace.

  2. Navigate to the Transforms tab and open SQL Transform under Advanced.

  3. The SQL Editor opens, displaying:

    • Functions Column: Available SQL functions with syntax examples.

    • Columns Column: Dataset columns for query construction.

  4. Construct a query by selecting functions and columns.

  5. Click Submit.

Note: Only Pandas SQL queries are supported. Function syntax and examples are available at the bottom of the editor for guidance.

Last updated