DBMS Selection for Data Mart Implementation

  • You need ACID transactions, schema enforcement, and time travel/versioning(Example on files stored in Amazon S3).

  • You want direct BI consumption (e.g., Dashboards or external tools) from the same storage without moving data to a separate warehouse.

  • You want unified governance, lineage, and access control across structured and unstructured data.

  • You need faster query performance on top of S3 through optimized formats (like Parquet + Delta, Hudi, etc).

  • Another option is Snowflake, which provides a unified Data Lakehouse platform that combines the scalability of a data lake with the performance and governance features of a data warehouse. It supports both structured and semi-structured data (such as JSON, Parquet, and Avro)

Typical Use Cases:

  • Unified data storage for analytics, AI, and reporting

  • Real-time or incremental updates with transactional consistency

  • Building semantic data layers (for dashboards, agents, or KPIs)

Snowflake provides a unified Data Lakehouse platform that combines the scalability of a data lake with the performance and governance features of a data warehouse. It supports both structured and semi-structured data (such as JSON, Parquet, and Avro) and enables SQL-based analytics, data sharing, and AI/ML integrations within a single architecture. Snowflake also offers seamless integration with cloud storage (like AWS S3).

Define Data Mart Architectural Approach

This decision primarily depends on your data processing complexity, the goals of your data mart, and the scale of your data.

We can decide on whether to go with Medallion Architecture, 2-Layer Architecture, or 3-Layer Architecture.

Medallion Architecture

The Medallion Architecture (Bronze-Silver-Gold Architecture) is typically used in data Lakehouses or large-scale data platforms where there’s a need for complex transformations, cleaning, and enrichment at each stage of the data pipeline. It organizes the data into three layers:

  • Bronze Layer: Raw data, uncleaned, and often in its original format (e.g., logs, raw events, transactional data). If we have data from multiple projects, we can separate them with schemas.

  • Silver Layer: Cleaned, enriched data — typically with basic transformations applied. Listing below a few data quality actions that can be done before transferring data to the silver layer.

    • Selective Data Transfer: Transfer only the required fields and tables from the bronze to the silver layer.

    • Field Case Consistency: Standardize fields like country, name, city…. Across tables using a consistent naming convention.

    • Datetime Formatting: Convert all datetime fields into a consistent format and consistent datatype. Example YYYY-MM-DD HH:MM:SS

    • Duplicate Handling: Need to find the duplicate scenarios and remove/flag them.

    • Filter Data Post-specific year: Filter out data if it's not relevant for analysis. E.g., remove data prior to 2010, load only for a specific country, etc (This is based on the customer’s requirement).

    • Flatten XML/JSON Fields: Flatten the required nested XML/JSON fields.

    • Country Code Unification: Standardize country codes, category codes, etc.

    • Null Value Handling: Replace blank or empty values with NULL for data consistency.

    • Follow proper naming conventions. Keep it consistent across the database(either CamelCase or with _ as a separator)

      • Example: Table: customer, order_detail,

      • Column: customer_id, order_detail_id,order_name etc

    • Never use keywords as table/column names

      • Example: datetime, date, time, type, status, interval, etc.

  • Gold Layer: Curated, high-quality data ready for consumption by business analysts or data scientists (e.g., aggregated data, enriched with business logic). This layer should be designed to improve the performance of analytics.

    • Join Optimization: Simplify and optimize join conditions to enhance query performance in the Gold layer.

o Denormalise wherever required. But when de-normalising, remember to update the fact table when there is an update in master data.

o Decide on the data model to be used in this layer.

When to use Medallion Architecture

  • Complex ETL Pipelines: If your data mart needs to process complex data transformations (e.g., aggregations, enrichment, joins from multiple sources), Medallion architecture helps with step-by-step processing, allowing for cleaner data at each stage.

  • Multiple Data Sources: If you're dealing with a variety of data sources (structured, semi-structured, or unstructured), having these intermediary stages allows for better data integration and organization.

  • Large-Scale Data: This architecture is useful when you're processing large volumes of data, because each layer can be optimized for performance and transformations (e.g., partitioning and indexing).

  • Data Quality and Governance: If you require high-quality, trusted datasets that need to be curated over time, Medallion is ideal, as it allows for traceable data lineage from raw to clean to business-ready data.

  • Analytics/BI/ML Ready: If the end goal is providing datasets for different types of downstream consumers — such as BI tools, data science models, or analytics dashboards — this architecture ensures clean, enriched, and trustworthy datasets in the final gold layer.

  • Incremental Updates: If you require incremental data updates, the three layers help you manage data processing in smaller, manageable chunks (e.g., raw data can be ingested at high velocity, while more complex transformations can be applied in the silver and gold layers).

Two-Layer Architecture

A 2-layer architecture typically involves just Raw (staging) data and Processed (curated) data. It’s simpler and usually faster to implement, but it's best suited for situations where data processing requirements aren’t too complex. The two layers generally consist of:

  • Raw Layer (Staging): The first layer is where you store raw data, either directly from your source systems or from a real-time ingestion pipeline. This data is not yet cleaned or processed.

  • Processed Layer: The second layer holds the curated and transformed data. This is where your cleaned, aggregated, or business-ready data resides.

When to Use a 2-Layer Architecture

  • Simple Data Processing: If your data mart doesn’t require complex data transformation, this simpler approach can work well. You might just need basic cleaning or simple transformations (e.g., filtering out bad data, correcting formats).

  • Faster Time to Insight: A 2-layer architecture can be more efficient if you need to get data into the hands of users (analysts, business users) quickly and don’t need intermediate steps like enrichment.

  • Smaller or Less Complex Datasets: When dealing with smaller datasets or data that doesn’t require complex joins, aggregations, or transformations, a 2-layer architecture can work just fine.

  • Less Data Processing Complexity: If you’re primarily doing simple ETL operations or loading a set of predefined datasets into the data mart, you don’t need the complexity of Medallion.

  • Low Data Governance Overhead: If governance isn’t critical, or you don’t need to track lineage through several transformation stages, a 2-layer architecture can be more straightforward and quicker to implement.

  • Real-Time Data Processing: A 2-layer approach can sometimes be better suited for streaming data, where you want the raw data ingested quickly and then processed/consumed in near-real time.

Data Lake layer Mapping

The template below helps in creating a mapping between layers

Performance Tuning in the Data Mart

Below are a few things to note for enhanced speed, efficiency, and scalability of data retrieval, particularly for large datasets.

Indexing

Indexes allow the database to quickly locate rows without needing to scan the entire table, reducing query time.

Add indexes on

  • Foreign Key Relationships: Index the foreign key columns. When you join tables based on these relationships, indexed columns speed up the search process.

  • Common Join Columns: If certain columns are used frequently in joins, indexing those columns can significantly boost performance.

  • Frequently Filtered columns: Index columns that are often used in WHERE clauses for filtering, especially if they have a high cardinality (lots of distinct values).

  • Aggregated columns: Columns that are commonly used in GROUP BY or aggregate functions (like COUNT, SUM, AVG) should be indexed if they significantly reduce query time.

  • Partitioned Tables: If the DataMart includes partitioned tables, indexing can improve query performance on specific partitions.

  • Avoid Over-Indexing: Don’t index every column—too many indexes can slow down write operations and increase storage costs.

Partitioning

Partitioning is a technique used to divide large tables into smaller, more manageable pieces called partitions.

Partitioning is typically used in large data marts where tables are massive, and queries tend to operate on subsets of data, such as specific time periods, regions, or other business dimensions.

Different ways of partitioning

  • Range Partitioning: Data is partitioned based on a range of values in a column (usually time-related, such as dates or fiscal periods). Example: partition a sales data table by year or month (e.g., sales data for 2023, sales data for 2024).

  • List Partitioning: Data is partitioned based on specific dimension values in a column. Example: Partitioning a customer table by region (e.g., North America, Europe, Asia). It's ideal for categorical data with distinct values (e.g., product categories, geographic regions).

  • Hash Partitioning: Data is partitioned using a hash function that distributes rows across a predefined number of partitions. The data is evenly distributed, regardless of the values in the partitioning column. Hash partitioning could be applied to a customer ID column, where each customer is mapped to a partition based on the hash of their ID. It's ideal for distributing data evenly across partitions when there is no natural range or list.

circle-info

Note:

  • Don’t partition on small tables.

  • Select the proper partitioning key.

Materialized Views

A stored query result that can be refreshed periodically, often containing pre-aggregated data or complex calculations.

Instead of recalculating complex queries on the fly, a materialized view stores the results, speeding up access for end-users. This is particularly useful for large and complex reporting queries.

Data Pruning (Archiving Old Data)

Removing or archiving outdated data that is no longer frequently accessed.

Reducing the size of the tables improves performance because there is less data to query and index. This is often done in conjunction with partitioning.

Sharding in MongoDB

Sharding can help improve scalability and performance when dealing with large data sets. This involves distributing the data across multiple MongoDB instances, each handling a subset of the data. Choose a Good Shard Key that evenly distributes the data and reduces too many requests on a shard.

Last updated