Solution Architecture
Data Collection and Integration
Source Identification & Ingestion
A high-level discovery of data sources was conducted across all relevant domains. The identified data was ingested into the Bronze layer for raw storage.
Core Business Entities Identified
Products and product categories
Inventory locations (stores and warehouses)
Suppliers
Customers
Payment methods
Coupons
Stock movement
Sales and sale items
Transactions
Loyalty transactions
Shopping carts and cart items
Sale fulfilment
Customer product reviews
Architecture Approach: Medallion Design
A Medallion Architecture was adopted, organizing data into three layers—Bronze, Silver, and Gold.
Bronze Layer – Amazon S3
Chosen for its flexibility, scalability, and cost-effectiveness.
Allows ingestion of raw data in its native format without needing immediate schema alignment.
Ideal for supporting multiple customers with varying application database structures.
Silver Layer – PostgreSQL
Acts as the standardized, cleaned, and unified layer.
Designed with a well-defined schema to represent essential retail entities, regardless of the raw source formats.
Assumes data has been validated and transformed from the bronze layer.
Gold Layer – PostgreSQL
Contains fully curated and optimized data for analytics.
Ensures relational integrity across critical domains like customers, sales, and transactions.
Serves as the primary source for dashboards, reports, and business insights.
A comprehensive data mapping workbook was developed to define and document all entities and their transformations across layers.

Data Load to Bronze Layer(PostgreSQL to S3)
Data was loaded from the source database, which was PostgreSQL, to the destination - Amazon S3
The data ingestion consists of three main parts:
Data Extraction: Connects to PostgreSQL and executes custom SQL queries
Data Processing: Converts query results to Pandas DataFrames
Data Upload: Saves data as Parquet files to S3 with an organized folder structure
Key Considerations
Flexible Date Ranges: Supports single date, date range, or relative date filtering
Organized S3 Storage: Uses a partitioned folder structure with insertion dates
Comprehensive Logging: Built-in logging for monitoring and debugging
Error Handling: Robust exception handling with proper cleanup
Parquet Format: Efficient columnar storage format for analytics
Environment-Based Configuration: Uses environment variables for security
Prerequisites
Required Python Packages
pip install psycopg2-binary pandas boto3 pyarrow
Environment Variables Setup
Create/set the following environment variables:
Database Configuration
DB_HOST=your-postgresql-host
DB_PORT=5432
DB_NAME=your-database-name
DB_USER=your-username
DB_PASSWORD=your-password
AWS Configuration
AWS_ACCESS_KEY_ID=your-aws-access-key
AWS_SECRET_ACCESS_KEY=your-aws-secret-key
S3_BUCKET_NAME=your-s3-bucket
S3_PREFIX=your-folder-prefix
Optional: Date Configuration
START_DATE=2024-01-01
END_DATE=2024-01-31
DAY_OFFSET=1
Last updated