Deployment Considerations
Scheduling
Use the scheduling option in the pipeline /Jobs
Consider time zones for daily batch processing
Implement overlap handling for late-arriving data
Environment Management
Maintain separate configurations for dev/staging/prod
Monitoring and Alerting
Monitor ClickHouse storage costs and partition usage.
Track data freshness (lag in hours/days) per source.
Automate alerting for missing data (e.g., no CDRs in the last 1 hour).
Prerequisites
Python Dependencies
pip install Clickhouse-connect# Database Configuration
DB_HOST=your-clickhouse-host
DB_PORT=18123
DB_NAME=your-database-name
DB_USER=your-username
DB_PASSWORD=your-password
DB_SCHEMA=public # Target schema (optional, defaults to public)
# Optional: Table Filtering
FILTER_TABLES=table1, table2, table3 # Comma-separated list
Audit Tables Setup
Create these tables in your PostgreSQL database:
CREATE SCHEMA silver_etl_master;
-- ABDR TABLE--
Key Features
Customer master reference table for telecom/CRM systems.
Tracking customer lifecycle (activation → termination).
Filtering by account type, status, or KYC verification.
Supporting joins with invoices, subscriptions, usage records, etc.
Historical analysis based on created_date and timestamp.
Pipeline Behaviour
Connects to ClickHouse DB and discovers table folders
Finds the latest date partition for each table
Reads all Parquet files from that partition
Combines files into a single DataFrame
Maps columns to the target database table
Inserts data with audit logging
Records success/failure for each table
Quick Troubleshooting
Column errors: Ensure target tables exist with matching columns
Connection failures: Verify environment variables and network access
Memory issues: Use batch insert and batch read.
Monitoring
Monitoring of the audit tables helps in understanding whether the jobs were successful.
Implementation Checklist
Set up environment variables
Create tables in ClickHouse
Create target tables in ClickHouse
Test with a small dataset first
Set up monitoring and alerting
Schedule for production runs
Data Load to Gold layer (ClickHouse to ClickHouse)
The data load from the ABDR layer to the Aggregation layer is managed through parameterized ETL jobs and pipelines. These jobs are designed to follow a controlled and auditable execution process.
To facilitate this, we maintain reporting_logs
We are maintaining this table to record successful aggregation data sent to aggregate tables, which will include start_date and end_date.
End_date is calculated as:
The Aggregate layer was designed with a focus on optimizing reporting performance. Entities in the gold layer were organized into dimension tables, fact tables, and aggregate tables to support efficient analytical queries.
Features
Maintains a log table to track ETL runs (success/failure, reason, date ranges).
Finds the last successful aggregation end date from logs, so it knows where to resume.
Aggregates raw data into daily summaries by usage type and location.
Inserts results into the destination table.
Logs the run outcome (success, failure, or no data).
Last updated