Data Load to Silver Layer (S3 to PostgreSQL )
Overview
The jobs load data from S3 (Bronze layer) to PostgreSQL (Silver layer) with automatic partition discovery and comprehensive audit logging.
Prerequisites
Python Dependencies
pip install boto3 pandas pyarrow sqlalchemy psycopg2-binary
Required Environment Variables
AWS Configuration
AWS_ACCESS_KEY_ID=your-aws-access-key
AWS_SECRET_ACCESS_KEY=your-aws-secret-key
S3_BUCKET_NAME=your-s3-bucket
S3_PREFIX=your-folder-prefix
Database Configuration
DB_HOST=your-postgresql-host
DB_PORT=5432
DB_NAME=your-database-name
DB_USER=your-username
DB_PASSWORD=your-password
DB_SCHEMA=public # Target schema (optional, defaults to public)
Optional: Table Filtering
Audit Tables Setup
Create these tables in your PostgreSQL database:
Master audit table
Detailed audit table
Expected S3 Structure
Key Features
Auto-Discovery: Finds the latest partitions automatically
Column Mapping: Matches DataFrame columns to database tables
Error Handling: Continues processing other tables if one fails
Audit Trail: Tracks all operations with row counts and timestamps
· Flexible Filtering: Process all or specific tables
Pipeline Behaviour
Connects to S3 and discovers table folders
Finds the latest date partition for each table
Reads all Parquet files from that partition
Combines files into a single DataFrame
Maps columns to the target database table
Inserts data with audit logging
· Records success/failure for each table
Quick Troubleshooting
"No dataframes found": Check S3 path and partition structure
Column errors: Ensure target tables exist with matching columns
Connection failures: Verify environment variables and network access
Memory issues: Process fewer tables at once using filter_tables
Monitoring
Monitoring of the audit tables helps in understanding whether the jobs were successful
-- Recent job status
-- Table-level details
Implementation Checklist
Set up environment variables
Create audit tables in PostgreSQL
Verify S3 data structure matches expected format
Create target tables in PostgreSQL
Test with a small dataset first
Set up monitoring and alerting
Schedule for production runs
Last updated