Data Load to Silver Layer (S3 to PostgreSQL )

Overview

The jobs load data from S3 (Bronze layer) to PostgreSQL (Silver layer) with automatic partition discovery and comprehensive audit logging.

Prerequisites

Python Dependencies

pip install boto3 pandas pyarrow sqlalchemy psycopg2-binary

Required Environment Variables

AWS Configuration

AWS_ACCESS_KEY_ID=your-aws-access-key
AWS_SECRET_ACCESS_KEY=your-aws-secret-key
S3_BUCKET_NAME=your-s3-bucket
S3_PREFIX=your-folder-prefix

Database Configuration

DB_HOST=your-postgresql-host
DB_PORT=5432
DB_NAME=your-database-name
DB_USER=your-username
DB_PASSWORD=your-password
DB_SCHEMA=public # Target schema (optional, defaults to public)

Optional: Table Filtering

FILTER_TABLES=table1,table2,table3  # Comma-separated list

Audit Tables Setup

Create these tables in your PostgreSQL database:

CREATE SCHEMA silver_etl_master;

Master audit table

CREATE TABLE silver_etl_master.silver_master_etl_audit_tbl (
process_id uuid DEFAULT gen_random_uuid() NOT NULL,
schedule_name varchar(100) NULL,
schedule_timestamp timestamp NULL,
status varchar(20) NULL,
start_time timestamp NULL,
end_time timestamp NULL,
tables_processed int4 NULL,
tables_failed int4 NULL,
tables_successful int4 NULL,
error_message text NULL,
CONSTRAINT pk_silver_master_process_id PRIMARY KEY (process_id)
);

Detailed audit table

CREATE TABLE silver_etl_master.silver_main_etl_audit_tbl (
id int4 NULL,
process_id uuid NULL,
data_source varchar(50) NULL,
source_table_name varchar(100) NULL,
destination_table_name varchar(100) NULL,
operation varchar(50) NULL,
time_from timestamp NULL,
time_to timestamp NULL,
etl_start_time timestamp NULL,
etl_end_time timestamp NULL,
input_row int4 NULL,
output_row int4 NULL,
operation_result varchar(50) NULL,
errorcode varchar(50) NULL,
"result" text NULL
);

Expected S3 Structure

s3://bucket/prefix/
├── table1/
│   └── insertion_date=2024-01-15/
│       └── table1-timestamp.parquet
└── table2/
    └── insertion_date=2024-01-15/
        └── table2-timestamp.parquet

Key Features

Auto-Discovery: Finds the latest partitions automatically
Column Mapping: Matches DataFrame columns to database tables
Error Handling: Continues processing other tables if one fails
Audit Trail: Tracks all operations with row counts and timestamps

· Flexible Filtering: Process all or specific tables

Pipeline Behaviour

Connects to S3 and discovers table folders
Finds the latest date partition for each table
Reads all Parquet files from that partition
Combines files into a single DataFrame
Maps columns to the target database table
Inserts data with audit logging

· Records success/failure for each table

Quick Troubleshooting

"No dataframes found": Check S3 path and partition structure
Column errors: Ensure target tables exist with matching columns
Connection failures: Verify environment variables and network access
Memory issues: Process fewer tables at once using filter_tables

Monitoring

Monitoring of the audit tables helps in understanding whether the jobs were successful

-- Recent job status

SELECT schedule_name, status, tables_processed, tables_successful, tables_failed
FROM silver_etl_master.silver_master_etl_audit_tbl
ORDER BY schedule_timestamp DESC LIMIT 10;

-- Table-level details

SELECT destination_table_name, input_row, output_row, operation_result
FROM silver_etl_master.silver_main_etl_audit_tbl
WHERE process_id = …….

Implementation Checklist

Set up environment variables
Create audit tables in PostgreSQL
Verify S3 data structure matches expected format
Create target tables in PostgreSQL
Test with a small dataset first
Set up monitoring and alerting
Schedule for production runs

Last updated 2 months ago

hashtagOverview

hashtagPrerequisites

hashtagAudit Tables Setup

hashtagKey Features

hashtagPipeline Behaviour

hashtagQuick Troubleshooting

hashtagMonitoring

hashtagImplementation Checklist