Solution Architecture

Data Collection and Integration

Source Identification & Ingestion

A high-level discovery of data sources was conducted across all relevant domains. The identified data was ingested into the Bronze layer for raw storage.

Core Business Entities Identified

Products and product categories
Inventory locations (stores and warehouses)
Suppliers
Customers
Payment methods
Coupons
Stock movement
Sales and sale items
Transactions
Loyalty transactions
Shopping carts and cart items
Sale fulfilment
Customer product reviews

Architecture Approach: Medallion Design

A Medallion Architecture was adopted, organizing data into three layers—Bronze, Silver, and Gold.

Bronze Layer – Amazon S3

Chosen for its flexibility, scalability, and cost-effectiveness.
Allows ingestion of raw data in its native format without needing immediate schema alignment.
Ideal for supporting multiple customers with varying application database structures.

Silver Layer – PostgreSQL

Acts as the standardized, cleaned, and unified layer.
Designed with a well-defined schema to represent essential retail entities, regardless of the raw source formats.
Assumes data has been validated and transformed from the bronze layer.

Gold Layer – PostgreSQL

Contains fully curated and optimized data for analytics.
Ensures relational integrity across critical domains like customers, sales, and transactions.
Serves as the primary source for dashboards, reports, and business insights.

A comprehensive data mapping workbook was developed to define and document all entities and their transformations across layers.

Data Load to Bronze Layer(PostgreSQL to S3)

Data was loaded from the source database, which was PostgreSQL, to the destination - Amazon S3

The data ingestion consists of three main parts:

Data Extraction: Connects to PostgreSQL and executes custom SQL queries
Data Processing: Converts query results to Pandas DataFrames
Data Upload: Saves data as Parquet files to S3 with an organized folder structure

Key Considerations

Flexible Date Ranges: Supports single date, date range, or relative date filtering
Organized S3 Storage: Uses a partitioned folder structure with insertion dates
Comprehensive Logging: Built-in logging for monitoring and debugging
Error Handling: Robust exception handling with proper cleanup
Parquet Format: Efficient columnar storage format for analytics
Environment-Based Configuration: Uses environment variables for security

Prerequisites

Required Python Packages

pip install psycopg2-binary pandas boto3 pyarrow

Environment Variables Setup

Create/set the following environment variables:

Database Configuration

DB_HOST=your-postgresql-host
DB_PORT=5432
DB_NAME=your-database-name
DB_USER=your-username
DB_PASSWORD=your-password

AWS Configuration

AWS_ACCESS_KEY_ID=your-aws-access-key
AWS_SECRET_ACCESS_KEY=your-aws-secret-key
S3_BUCKET_NAME=your-s3-bucket
S3_PREFIX=your-folder-prefix

Optional: Date Configuration

START_DATE=2024-01-01
END_DATE=2024-01-31
DAY_OFFSET=1

Last updated 3 months ago

hashtagData Collection and Integration

hashtagSource Identification & Ingestion

hashtagCore Business Entities Identified

hashtagArchitecture Approach: Medallion Design

hashtagBronze Layer – Amazon S3

hashtagSilver Layer – PostgreSQL

hashtagGold Layer – PostgreSQL

hashtagData Load to Bronze Layer(PostgreSQL to S3)

hashtagKey Considerations

hashtagPrerequisites