Anonymization Transforms
Anonymization transforms (Data Hashing, Masking, Variance, Hashing with Salt & Pepper) in Data Prep modify PII to secure sensitive data for analytics.
Anonymization is a data processing technique that removes or modifies personally identifiable information (PII) to protect privacy. The Data Preparation framework provides several anonymization transforms, including Data Hashing, Data Masking, Data Variance, and Hashing with Salt & Pepper, allowing users to secure sensitive information while maintaining data usability for analytics and machine learning.
Best Situations to Use
Protect sensitive data before sharing datasets.
Ensure privacy compliance (e.g., GDPR, HIPAA).
Generate anonymized datasets for testing or development without exposing real PII.
Maintain statistical properties while masking data for analysis.
Data Hashing
Data Hashing converts raw data into a fixed-length hash value using algorithms such as SHA-1, SHA-2, or MD5. It is widely used to securely anonymize identifiers, emails, or other sensitive fields.
Best Situations to Use
Anonymizing unique identifiers like user IDs or emails.
Protecting PII for compliance or internal testing.
Ensuring consistent but non-reversible representations of data.
Steps
Select a dataset and the column to anonymize.
Navigate to Transforms > Anonymization > Data Hashing.
Choose a Hash Option:
SHA-1 → SHA-256 (backend)
SHA-2 → SHA-512
Hash → MD5
MD5 → MD5
(Optional) Set Hash Value (e.g., 256, 384, 512 for SHA-2).
Click Submit.
Data Masking
Data Masking hides original data with modified content, creating a structurally similar but inauthentic version of the data.
Best Situations to Use
Masking parts of sensitive strings (e.g., credit card numbers, phone numbers).
Protecting data while preserving format for testing or development.
Steps
Select a dataset and a column to mask.
Navigate to Transforms > Anonymization > Data Masking.
Specify Start Index and End Index for masking.
Click Submit.
Data Variance
Data Variance adds random variations to numeric or date columns while preserving the overall distribution of the data.
Best Situations to Use
Introducing variability to numeric or date fields for privacy protection.
Creating synthetic datasets that mimic the original data distribution.
Steps for Numeric Columns
Select a numeric column.
Navigate to Transforms > Anonymization > Data Variance.
Select Value Type = Numeric, choose an Operator, and set a percentage variance.
Add optional comments.
Click Submit.
Steps for Date Columns
Select a date column.
Navigate to Transforms > Anonymization > Data Variance.
Select Value Type = Date, choose Start Date and End Date.
Add optional comments.
Click Submit.
Hashing Anonymization (Salt & Pepper Technique)
This technique protects sensitive data by introducing random noise (salt) and a user-provided pepper, while preserving the statistical properties of the dataset.
Best Situations to Use
Protecting sensitive columns in financial, health, or personal datasets.
Creating anonymized datasets for development or testing.
Steps
Select a dataset and a column to anonymize.
Navigate to Transforms > Anonymization > Hashing Anonymization (Salt & Pepper).
Provide a value in Set Values (pepper).
Select the column in Set Fields (salt).
Choose a Hash Option (SHA-1, SHA-2, Hash, MD5).
Click Submit.
Result: The target column is anonymized using the selected hash algorithm. Notes:
The first user-provided value acts as the pepper.
Selected column values act as the salt.
Last updated