Data Load & Architechture
Data Collection and Integration
Source Identification & Ingestion
A high-level discovery of data sources was conducted across all relevant domains. The identified data was ingested into the Bronze layer for raw storage.
Core Business Entities Identified
Subscribers & Accounts (customers, prepaid, postpaid)
Products & Bundles (voice, data, add-ons, VAS)
Recharge & Payment Transactions
Invoices & Billing Cycles
Orders & Service Requests
Usage Records (CDRs, SMS, Data, Roaming)
Network Elements & Events
Campaigns & Offers
Loyalty & Rewards Transactions
Customer Complaints & Feedback
Architecture Approach: Medallion Design

Raw Layer- ClickHouse
Stores ingested raw data in its native format (CSV, JSON, Parquet, Avro, log files).
Supports schema-on-read, enabling flexibility for multiple telecom source applications.
Ideal for long-term storage of subscriber usage records, recharge events, CRM dumps, OSS/BSS logs, and network KPIs.

Silver Layer – ClickHouse
Stores cleaned, standardized, and enriched data.
Designed with well-defined schemas representing essential telecom entities.
Performs transformations and business logic.
Handles data validation, enrichment, de-duplication, and reference mapping.
Gold Layer – ClickHouse
Stores fully curated and optimized datasets for advanced analytics and BI.
Ensures relational integrity across customer, subscription, product, usage, and payment domains.
Serves as the primary source for dashboards, reports, machine learning models, and real-time insights.
Serves as the primary source for dashboards, reports, and business insights.
A comprehensive data mapping workbook was developed to define and document all entities and their transformations across layers.
Data Load to Bronze Layer (ClickHouse)
Data is ingested from multiple heterogeneous telecom systems into the Bronze Layer of ClickHouse. The ingestion supports both batch (ETL/ELT pipelines) and real-time streaming (Kafka) methods, depending on source type and business requirement.
Data Sources
CRM Systems – customer, account, orders
Billing & Revenue Management Systems – invoices, payments, recharge history
Recharge & Payment – recharge events, payment transactions
OSS/BSS Systems – network faults, element status
CDRs (Call Detail Records) & Usage Data – voice, SMS, data, roaming usage (CSV, Parquet, Avro)
Campaign Systems – campaigns, offers
Data Ingestion Workflow
Data Extraction
Connects to source systems via JDBC, APIs, file ingestion, or Kafka topics.
Executes custom SQL queries (for structured DB sources) or stream listeners (for real-time logs/events).
Data Processing
Converts extracted data into Pandas Data Frames or direct ClickHouse insert batches.
Applies lightweight processing (e.g., timestamp normalization, schema alignment, metadata tagging).
Data Load into ClickHouse (Bronze Layer)
Loads raw data into ClickHouse Bronze tables (partitioned by date, region, or system).
Retains the native structure of source systems for traceability.
Key Considerations
Flexible Date Ranges – supports full load, incremental load, or real-time streaming ingestion.
Partitioned Storage in ClickHouse – tables partitioned by ingestion_date, region, or service_type for efficient query pruning.
Comprehensive Logging – ingestion jobs write logs to monitoring tables (etl_job_log) for observability.
Error Handling & Retry Mechanism – failed batches are logged, quarantined, and retried.
Columnar Storage (Parquet / Native ClickHouse format) – efficient for telecom-scale datasets (billions of CDRs, transactions).
Environment-Based Configuration – uses secrets managers or environment variables for DB/API credentials.
Real-time + Batch – hybrid ingestion supports both daily batch loads (e.g., invoices) and real-time streams (e.g., CDRs, recharges).
Prerequisites
Required Python Packages
pip install clickhouse-connect
Environment Variables Setup
Create/set the following environment variables:
# Database Configuration
DB_HOST=your-clickhouse-host
DB_PORT=18123
DB_NAME=your-database-name
DB_USER=your-username
DB_PASSWORD=your-password
# Optional: Date Configuration
START_DATE=2023-01-01
END_DATE=2025-08-31
DAY_OFFSET=1
Implementation Steps
Step 1: Database Connection Setup
The pipeline uses ClickHouse-connect to connect to ClickHouse. Please ensure:
Your database is accessible from your execution environment
The database has the required tables and views
User has SELECT permissions on target tables
Step 2: Query Configuration
Define your queries in the queries list within the main() function:
Important Query Guidelines:
Use f-string formatting with {start_date} and {end_date} placeholders
Include proper date filtering for incremental loads
Use meaningful labels that will become S3 file prefixes
Ensure queries are optimized for performance
Handle NULL values appropriately in transformations
Step 3: Date Parameter Configuration
Date Range:
Single Date:
main(..., start_date='2024-01-01') # end_date defaults to start_date
Last updated