Data Load & Architechture

Data Collection and Integration

Source Identification & Ingestion

A high-level discovery of data sources was conducted across all relevant domains. The identified data was ingested into the Bronze layer for raw storage.

Core Business Entities Identified

  • Subscribers & Accounts (customers, prepaid, postpaid)

  • Products & Bundles (voice, data, add-ons, VAS)

  • Recharge & Payment Transactions

  • Invoices & Billing Cycles

  • Orders & Service Requests

  • Usage Records (CDRs, SMS, Data, Roaming)

  • Network Elements & Events

  • Campaigns & Offers

  • Loyalty & Rewards Transactions

  • Customer Complaints & Feedback

Architecture Approach: Medallion Design

Raw Layer- ClickHouse

  • Stores ingested raw data in its native format (CSV, JSON, Parquet, Avro, log files).

  • Supports schema-on-read, enabling flexibility for multiple telecom source applications.

  • Ideal for long-term storage of subscriber usage records, recharge events, CRM dumps, OSS/BSS logs, and network KPIs.

Silver Layer – ClickHouse

  • Stores cleaned, standardized, and enriched data.

  • Designed with well-defined schemas representing essential telecom entities.

  • Performs transformations and business logic.

  • Handles data validation, enrichment, de-duplication, and reference mapping.

Gold Layer – ClickHouse

  • Stores fully curated and optimized datasets for advanced analytics and BI.

  • Ensures relational integrity across customer, subscription, product, usage, and payment domains.

  • Serves as the primary source for dashboards, reports, machine learning models, and real-time insights.

  • Serves as the primary source for dashboards, reports, and business insights.

A comprehensive data mapping workbook was developed to define and document all entities and their transformations across layers.

Data Load to Bronze Layer (ClickHouse)

Data is ingested from multiple heterogeneous telecom systems into the Bronze Layer of ClickHouse. The ingestion supports both batch (ETL/ELT pipelines) and real-time streaming (Kafka) methods, depending on source type and business requirement.

Data Sources

  • CRM Systems – customer, account, orders

  • Billing & Revenue Management Systems – invoices, payments, recharge history

  • Recharge & Payment – recharge events, payment transactions

  • OSS/BSS Systems – network faults, element status

  • CDRs (Call Detail Records) & Usage Data – voice, SMS, data, roaming usage (CSV, Parquet, Avro)

  • Campaign Systems – campaigns, offers

Data Ingestion Workflow

Data Extraction

  • Connects to source systems via JDBC, APIs, file ingestion, or Kafka topics.

  • Executes custom SQL queries (for structured DB sources) or stream listeners (for real-time logs/events).

Data Processing

  • Converts extracted data into Pandas Data Frames or direct ClickHouse insert batches.

  • Applies lightweight processing (e.g., timestamp normalization, schema alignment, metadata tagging).

Data Load into ClickHouse (Bronze Layer)

  • Loads raw data into ClickHouse Bronze tables (partitioned by date, region, or system).

  • Retains the native structure of source systems for traceability.

Key Considerations

  • Flexible Date Ranges – supports full load, incremental load, or real-time streaming ingestion.

  • Partitioned Storage in ClickHouse – tables partitioned by ingestion_date, region, or service_type for efficient query pruning.

  • Comprehensive Logging – ingestion jobs write logs to monitoring tables (etl_job_log) for observability.

  • Error Handling & Retry Mechanism – failed batches are logged, quarantined, and retried.

  • Columnar Storage (Parquet / Native ClickHouse format) – efficient for telecom-scale datasets (billions of CDRs, transactions).

  • Environment-Based Configuration – uses secrets managers or environment variables for DB/API credentials.

  • Real-time + Batch – hybrid ingestion supports both daily batch loads (e.g., invoices) and real-time streams (e.g., CDRs, recharges).

Prerequisites

Required Python Packages

pip install clickhouse-connect

Environment Variables Setup

Create/set the following environment variables:

# Database Configuration

  • DB_HOST=your-clickhouse-host

  • DB_PORT=18123

  • DB_NAME=your-database-name

  • DB_USER=your-username

  • DB_PASSWORD=your-password

# Optional: Date Configuration

  • START_DATE=2023-01-01

  • END_DATE=2025-08-31

  • DAY_OFFSET=1

Implementation Steps

Step 1: Database Connection Setup

The pipeline uses ClickHouse-connect to connect to ClickHouse. Please ensure:

  • Your database is accessible from your execution environment

  • The database has the required tables and views

  • User has SELECT permissions on target tables

Step 2: Query Configuration

Define your queries in the queries list within the main() function:

Important Query Guidelines:

  • Use f-string formatting with {start_date} and {end_date} placeholders

  • Include proper date filtering for incremental loads

  • Use meaningful labels that will become S3 file prefixes

  • Ensure queries are optimized for performance

  • Handle NULL values appropriately in transformations

Step 3: Date Parameter Configuration

Date Range:

Single Date:

main(..., start_date='2024-01-01') # end_date defaults to start_date

Last updated