Anonymization Transforms

Anonymization transforms (Data Hashing, Masking, Variance, Hashing with Salt & Pepper) in Data Prep modify PII to secure sensitive data for analytics.

Anonymization is a data processing technique that removes or modifies personally identifiable information (PII) to protect privacy. The Data Preparation framework provides several anonymization transforms, including Data Hashing, Data Masking, Data Variance, and Hashing with Salt & Pepper, allowing users to secure sensitive information while maintaining data usability for analytics and machine learning.

Best Situations to Use

Protect sensitive data before sharing datasets.
Ensure privacy compliance (e.g., GDPR, HIPAA).
Generate anonymized datasets for testing or development without exposing real PII.
Maintain statistical properties while masking data for analysis.

Data Hashing

Data Hashing converts raw data into a fixed-length hash value using algorithms such as SHA-1, SHA-2, or MD5. It is widely used to securely anonymize identifiers, emails, or other sensitive fields.

Best Situations to Use

Anonymizing unique identifiers like user IDs or emails.
Protecting PII for compliance or internal testing.
Ensuring consistent but non-reversible representations of data.

Steps

Select a dataset and the column to anonymize.
Navigate to Transforms > Anonymization > Data Hashing.
Choose a Hash Option:
- SHA-1 → SHA-256 (backend)
- SHA-2 → SHA-512
- Hash → MD5
- MD5 → MD5
(Optional) Set Hash Value (e.g., 256, 384, 512 for SHA-2).
Click Submit.

The selected column is replaced with hashed values.

Data Masking

Data Masking hides original data with modified content, creating a structurally similar but inauthentic version of the data.

Best Situations to Use

Masking parts of sensitive strings (e.g., credit card numbers, phone numbers).
Protecting data while preserving format for testing or development.

Steps

Select a dataset and a column to mask.
Navigate to Transforms > Anonymization > Data Masking.
Specify Start Index and End Index for masking.
Click Submit.

The selected portion of the column is replaced with masked values.

Data Variance

Data Variance adds random variations to numeric or date columns while preserving the overall distribution of the data.

Best Situations to Use

Introducing variability to numeric or date fields for privacy protection.
Creating synthetic datasets that mimic the original data distribution.

Steps for Numeric Columns

Select a numeric column.
Navigate to Transforms > Anonymization > Data Variance.
Select Value Type = Numeric, choose an Operator, and set a percentage variance.
Add optional comments.
Click Submit.

Steps for Date Columns

Select a date column.
Navigate to Transforms > Anonymization > Data Variance.
Select Value Type = Date, choose Start Date and End Date.
Add optional comments.
Click Submit.

Numeric columns are varied within the specified percentage; date columns are randomized within the selected range.

Hashing Anonymization (Salt & Pepper Technique)

This technique protects sensitive data by introducing random noise (salt) and a user-provided pepper, while preserving the statistical properties of the dataset.

Best Situations to Use

Protecting sensitive columns in financial, health, or personal datasets.
Creating anonymized datasets for development or testing.

Steps

Select a dataset and a column to anonymize.
Navigate to Transforms > Anonymization > Hashing Anonymization (Salt & Pepper).
Provide a value in Set Values (pepper).
Select the column in Set Fields (salt).
Choose a Hash Option (SHA-1, SHA-2, Hash, MD5).
Click Submit.

Result: The target column is anonymized using the selected hash algorithm. Notes:

The first user-provided value acts as the pepper.
Selected column values act as the salt.

PreviousAdvanced Data Preparation Transforms NextColumns Transforms