Data Load to Silver Layer (S3 to PostgreSQL )

Overview

The jobs load data from S3 (Bronze layer) to PostgreSQL (Silver layer) with automatic partition discovery and comprehensive audit logging.

Prerequisites

Python Dependencies

pip install boto3 pandas pyarrow sqlalchemy psycopg2-binary

Required Environment Variables

AWS Configuration

  • AWS_ACCESS_KEY_ID=your-aws-access-key

  • AWS_SECRET_ACCESS_KEY=your-aws-secret-key

  • S3_BUCKET_NAME=your-s3-bucket

  • S3_PREFIX=your-folder-prefix

Database Configuration

  • DB_HOST=your-postgresql-host

  • DB_PORT=5432

  • DB_NAME=your-database-name

  • DB_USER=your-username

  • DB_PASSWORD=your-password

  • DB_SCHEMA=public # Target schema (optional, defaults to public)

Optional: Table Filtering

Audit Tables Setup

Create these tables in your PostgreSQL database:

Master audit table

Detailed audit table

Expected S3 Structure

Key Features

  • Auto-Discovery: Finds the latest partitions automatically

  • Column Mapping: Matches DataFrame columns to database tables

  • Error Handling: Continues processing other tables if one fails

  • Audit Trail: Tracks all operations with row counts and timestamps

· Flexible Filtering: Process all or specific tables

Pipeline Behaviour

  • Connects to S3 and discovers table folders

  • Finds the latest date partition for each table

  • Reads all Parquet files from that partition

  • Combines files into a single DataFrame

  • Maps columns to the target database table

  • Inserts data with audit logging

· Records success/failure for each table

Quick Troubleshooting

  • "No dataframes found": Check S3 path and partition structure

  • Column errors: Ensure target tables exist with matching columns

  • Connection failures: Verify environment variables and network access

  • Memory issues: Process fewer tables at once using filter_tables

Monitoring

Monitoring of the audit tables helps in understanding whether the jobs were successful

-- Recent job status

-- Table-level details

Implementation Checklist

  • Set up environment variables

  • Create audit tables in PostgreSQL

  • Verify S3 data structure matches expected format

  • Create target tables in PostgreSQL

  • Test with a small dataset first

  • Set up monitoring and alerting

  • Schedule for production runs

Last updated