Solution Architecture

Data Collection and Integration

Source Identification & Ingestion

A high-level discovery of data sources was conducted across all relevant domains. The identified data was ingested into the Bronze layer for raw storage.

Core Business Entities Identified

  • Products and product categories

  • Inventory locations (stores and warehouses)

  • Suppliers

  • Customers

  • Payment methods

  • Coupons

  • Stock movement

  • Sales and sale items

  • Transactions

  • Loyalty transactions

  • Shopping carts and cart items

  • Sale fulfilment

  • Customer product reviews

Architecture Approach: Medallion Design

A Medallion Architecture was adopted, organizing data into three layers—Bronze, Silver, and Gold.

Bronze Layer – Amazon S3

  • Chosen for its flexibility, scalability, and cost-effectiveness.

  • Allows ingestion of raw data in its native format without needing immediate schema alignment.

  • Ideal for supporting multiple customers with varying application database structures.

Silver Layer – PostgreSQL

  • Acts as the standardized, cleaned, and unified layer.

  • Designed with a well-defined schema to represent essential retail entities, regardless of the raw source formats.

  • Assumes data has been validated and transformed from the bronze layer.

Gold Layer – PostgreSQL

  • Contains fully curated and optimized data for analytics.

  • Ensures relational integrity across critical domains like customers, sales, and transactions.

  • Serves as the primary source for dashboards, reports, and business insights.

A comprehensive data mapping workbook was developed to define and document all entities and their transformations across layers.

Data Load to Bronze Layer(PostgreSQL to S3)

Data was loaded from the source database, which was PostgreSQL, to the destination - Amazon S3

The data ingestion consists of three main parts:

  1. Data Extraction: Connects to PostgreSQL and executes custom SQL queries

  2. Data Processing: Converts query results to Pandas DataFrames

  3. Data Upload: Saves data as Parquet files to S3 with an organized folder structure

Key Considerations

  • Flexible Date Ranges: Supports single date, date range, or relative date filtering

  • Organized S3 Storage: Uses a partitioned folder structure with insertion dates

  • Comprehensive Logging: Built-in logging for monitoring and debugging

  • Error Handling: Robust exception handling with proper cleanup

  • Parquet Format: Efficient columnar storage format for analytics

  • Environment-Based Configuration: Uses environment variables for security

Prerequisites

Required Python Packages

pip install psycopg2-binary pandas boto3 pyarrow

Environment Variables Setup

Create/set the following environment variables:

Database Configuration

  • DB_HOST=your-postgresql-host

  • DB_PORT=5432

  • DB_NAME=your-database-name

  • DB_USER=your-username

  • DB_PASSWORD=your-password

AWS Configuration

  • AWS_ACCESS_KEY_ID=your-aws-access-key

  • AWS_SECRET_ACCESS_KEY=your-aws-secret-key

  • S3_BUCKET_NAME=your-s3-bucket

  • S3_PREFIX=your-folder-prefix

Optional: Date Configuration

  • START_DATE=2024-01-01

  • END_DATE=2024-01-31

  • DAY_OFFSET=1

Last updated