Data Quality

Data Quality measures the accuracy, completeness, and consistency of your data making sure that all subsequent analytics, reports, and models built on the platform are reliable and trustworthy.

What is Data Quality?

Data Quality refers to the state of data as it relates to its accuracy, completeness, consistency, and reliability for its intended purpose. Ensuring high data quality is essential for effective decision-making, operational efficiency, and building trust in the data being used.

Data Quality refers to the state of your data in a platform, specifically how well it aligns with its intended use. It is a critical metric that assesses whether data is accurate, complete, consistent, and reliable. High-quality data is foundational to a platform's value, as it ensures that all subsequent analytics, reports, and models are trustworthy and can be confidently used to drive business decisions.

The Key Dimensions of Data Quality

Data quality is not a single concept but a combination of several key dimensions. A platform application typically evaluates data against these criteria to ensure its fitness for purpose.

  • Accuracy: This dimension ensures that the data correctly reflects the real-world events or objects it is intended to represent. For example, a customer record's address must match their actual physical location.

  • Completeness: This refers to the absence of missing information. A complete dataset has all its required fields populated, with no null or empty values that could hinder analysis or automated processes.

  • Consistency: This guarantees that data is uniform and logically coherent across different systems or within the same dataset. For instance, a customer's name should be spelled identically across the sales, support, and billing systems.

  • Uniqueness: This ensures that each record or entity in a dataset is represented only once. Eliminating duplicate entries is vital to prevent miscalculations in reports and to ensure a single source of truth.

  • Timeliness: This dimension measures how up-to-date the data is. Timely data is available when needed for decision-making, ensuring that insights are based on the most current information.

  • Structure: This focuses on whether the data is in the correct format and conforms to predefined rules. For example, a date field must adhere to a specific format like YYYY-MM-DD, and a numerical field should only contain valid numbers.

A robust data quality framework within a platform enables organizations to continuously monitor these dimensions, proactively identify issues, and remediate them. This disciplined approach builds a strong foundation for reliable and effective data-driven operations.

The Data Quality module provides the tools to create and monitor data quality rules across your datasets.

How to Create Data Quality Flows

This guide outlines the process for building and deploying Data Quality flows.

Step 1: Create a Data Connector

First, you must establish a connection to your data source. Supported connectors include:

  • MySQL

  • MSSQL

  • Oracle

  • PostgreSQL

  • ClickHouse

Ensure the desired connector is successfully created and configured.

Step 2: Access the Data Quality Rule Creation Interface

  1. Navigate to your newly created connector.

  2. Click the context menu (three dots) next to the connector's name.

  3. Select Create Data Quality. This will open a new page that displays the connector and a list of available tables.

Step 3: Configure Data Quality Rules

  1. From the list of tables, select the table for which you wish to create rules.

  2. Click the Rule + button. A new drawer will appear.

  3. Select the desired column from the list. The selected column will be displayed in the rule configuration panel.

  4. Enter a descriptive Rule Name (limited to a maximum of 20 characters).

  5. Select a Rule Category from the provided options. This is used to logically group your rules.

Step 4: Rule Categories and Rule Keys

The platform provides a variety of rule categories and keys to define your data quality checks.

Rule Categories

  • Accuracy: Ensures data correctly represents real-world values.

  • Uniqueness: Guarantees that a record or field intended to be unique contains no duplicates.

  • Consistency: Ensures data is logically coherent across systems or within the same dataset.

  • Completeness: Verifies that all required data fields are present and populated.

  • Timeliness: Ensures data is up-to-date and available when needed.

  • Structure: Confirms that data is in the correct format.

Rule Keys

  • Null Check: Applicable to all data types.

  • Empty Check: Applicable to string data types.

  • Gender Check: Applicable to string data types with expected values of "Male" or "Female."

  • GreaterThan Check (GT): Applicable to numeric data types.

  • GreaterThanOrEqual Check (GTEQ): Applicable to numeric data types.

  • LessThan Check (LT): Applicable to numeric data types.

  • LessThanOrEqual Check (LTEQ): Applicable to numeric data types.

  • Not Equal Check (NE): Applicable to numeric data types.

  • Unique Check: Applicable to all data types.

  • Alphabetical Check: Applicable to string data types.

  • Date Format Check (MM/DD/YY): Applicable to date data types.

  • Date Format Check (YYYY-MM-DD): Applicable to date data types.

  • Length Check: Applicable to string data types.

Step 5: Automate Rule Generation

The Data Agent feature can automate rule creation. After selecting a table, click on the Data Agent button to automatically generate a set of rules for all columns within that table.

Step 6: Set Rule Parameters and Save

  1. Select the desired Rule Key (up to 14 rules are available).

  2. If the rule key requires a value (e.g., GreaterThan Check, Length Check), enter the value you wish to compare against.

  3. Enter a Threshold Value (from 1 to 100). This value represents the percentage of records that must pass the rule for the check to be considered successful.

  4. Click Save to save the configured rules.

Step 7: Schedule and Save

  1. Once the rules are saved, click on the Schedule and Save option.

  2. Set the desired schedule for when the data quality flow should run. The scheduled time will be saved, and the job will execute automatically.

Data Quality Result Check

After a data quality flow has run, you can view the results.

Step 1: Access the Results Page

You can view the results in one of two ways:

  • From the same page where you created the rule, select View Result.

  • From the left navigation panel, choose Data Quality under the Data Center.

Step 2: View Results

  1. Select the Connector Name for which the rules were created. The page will only list connectors that support the Data Quality feature.

  2. Below the connector, tables listing the applied data quality rules will be displayed, along with a count of the rules present.

  3. Select the Table for which you want to view the results.

  4. The results (the last 5 results) will be displayed both graphically and in a table format, showing key metrics such as Total Records, Passed Records, and Failed Records.

circle-info

Please note: To ensure accurate rule generation, the connector must be successfully crawled and updated to the Catalog. If the LLM Service is run before the connector is updated, it may generate rules that do not correspond to the actual columns in the selected table.

Last updated