Deployment Considerations

Scheduling

  • Use the scheduling option in the pipeline /Jobs

  • Consider time zones for daily batch processing

  • Implement overlap handling for late-arriving data

Environment Management

  • Maintain separate configurations for dev/staging/prod

Monitoring and Alerting

  • Monitor ClickHouse storage costs and partition usage.

  • Track data freshness (lag in hours/days) per source.

  • Automate alerting for missing data (e.g., no CDRs in the last 1 hour).

Prerequisites

Python Dependencies

pip install Clickhouse-connect

# Database Configuration

  • DB_HOST=your-clickhouse-host

  • DB_PORT=18123

  • DB_NAME=your-database-name

  • DB_USER=your-username

  • DB_PASSWORD=your-password

  • DB_SCHEMA=public # Target schema (optional, defaults to public)

# Optional: Table Filtering

  • FILTER_TABLES=table1, table2, table3 # Comma-separated list

Audit Tables Setup

Create these tables in your PostgreSQL database:

CREATE SCHEMA silver_etl_master;

-- ABDR TABLE--

Key Features

  • Customer master reference table for telecom/CRM systems.

  • Tracking customer lifecycle (activation → termination).

  • Filtering by account type, status, or KYC verification.

  • Supporting joins with invoices, subscriptions, usage records, etc.

  • Historical analysis based on created_date and timestamp.

Pipeline Behaviour

  • Connects to ClickHouse DB and discovers table folders

  • Finds the latest date partition for each table

  • Reads all Parquet files from that partition

  • Combines files into a single DataFrame

  • Maps columns to the target database table

  • Inserts data with audit logging

  • Records success/failure for each table

Quick Troubleshooting

  • Column errors: Ensure target tables exist with matching columns

  • Connection failures: Verify environment variables and network access

  • Memory issues: Use batch insert and batch read.

Monitoring

Monitoring of the audit tables helps in understanding whether the jobs were successful.

Implementation Checklist

  • Set up environment variables

  • Create tables in ClickHouse

  • Create target tables in ClickHouse

  • Test with a small dataset first

  • Set up monitoring and alerting

  • Schedule for production runs

Data Load to Gold layer (ClickHouse to ClickHouse)

The data load from the ABDR layer to the Aggregation layer is managed through parameterized ETL jobs and pipelines. These jobs are designed to follow a controlled and auditable execution process.

To facilitate this, we maintain reporting_logs

We are maintaining this table to record successful aggregation data sent to aggregate tables, which will include start_date and end_date.

End_date is calculated as:

The Aggregate layer was designed with a focus on optimizing reporting performance. Entities in the gold layer were organized into dimension tables, fact tables, and aggregate tables to support efficient analytical queries.

Features

  • Maintains a log table to track ETL runs (success/failure, reason, date ranges).

  • Finds the last successful aggregation end date from logs, so it knows where to resume.

  • Aggregates raw data into daily summaries by usage type and location.

  • Inserts results into the destination table.

  • Logs the run outcome (success, failure, or no data).

Last updated