Data Load & Architechture

Data Collection and Integration

Source Identification & Ingestion

A high-level discovery of data sources was conducted across all relevant domains. The identified data was ingested into the Bronze layer for raw storage.

Core Business Entities Identified

Subscribers & Accounts (customers, prepaid, postpaid)
Products & Bundles (voice, data, add-ons, VAS)
Recharge & Payment Transactions
Invoices & Billing Cycles
Orders & Service Requests
Usage Records (CDRs, SMS, Data, Roaming)
Network Elements & Events
Campaigns & Offers
Loyalty & Rewards Transactions
Customer Complaints & Feedback

Architecture Approach: Medallion Design

Raw Layer- ClickHouse

Stores ingested raw data in its native format (CSV, JSON, Parquet, Avro, log files).
Supports schema-on-read, enabling flexibility for multiple telecom source applications.
Ideal for long-term storage of subscriber usage records, recharge events, CRM dumps, OSS/BSS logs, and network KPIs.

Silver Layer – ClickHouse

Stores cleaned, standardized, and enriched data.
Designed with well-defined schemas representing essential telecom entities.
Performs transformations and business logic.
Handles data validation, enrichment, de-duplication, and reference mapping.

Gold Layer – ClickHouse

Stores fully curated and optimized datasets for advanced analytics and BI.
Ensures relational integrity across customer, subscription, product, usage, and payment domains.
Serves as the primary source for dashboards, reports, machine learning models, and real-time insights.
Serves as the primary source for dashboards, reports, and business insights.

A comprehensive data mapping workbook was developed to define and document all entities and their transformations across layers.

Data Load to Bronze Layer (ClickHouse)

Data is ingested from multiple heterogeneous telecom systems into the Bronze Layer of ClickHouse. The ingestion supports both batch (ETL/ELT pipelines) and real-time streaming (Kafka) methods, depending on source type and business requirement.

Data Sources

CRM Systems – customer, account, orders
Billing & Revenue Management Systems – invoices, payments, recharge history
Recharge & Payment – recharge events, payment transactions
OSS/BSS Systems – network faults, element status
CDRs (Call Detail Records) & Usage Data – voice, SMS, data, roaming usage (CSV, Parquet, Avro)
Campaign Systems – campaigns, offers

Data Ingestion Workflow

Data Extraction

Connects to source systems via JDBC, APIs, file ingestion, or Kafka topics.
Executes custom SQL queries (for structured DB sources) or stream listeners (for real-time logs/events).

Data Processing

Converts extracted data into Pandas Data Frames or direct ClickHouse insert batches.
Applies lightweight processing (e.g., timestamp normalization, schema alignment, metadata tagging).

Data Load into ClickHouse (Bronze Layer)

Loads raw data into ClickHouse Bronze tables (partitioned by date, region, or system).
Retains the native structure of source systems for traceability.

Key Considerations

Flexible Date Ranges – supports full load, incremental load, or real-time streaming ingestion.
Partitioned Storage in ClickHouse – tables partitioned by ingestion_date, region, or service_type for efficient query pruning.
Comprehensive Logging – ingestion jobs write logs to monitoring tables (etl_job_log) for observability.
Error Handling & Retry Mechanism – failed batches are logged, quarantined, and retried.
Columnar Storage (Parquet / Native ClickHouse format) – efficient for telecom-scale datasets (billions of CDRs, transactions).
Environment-Based Configuration – uses secrets managers or environment variables for DB/API credentials.
Real-time + Batch – hybrid ingestion supports both daily batch loads (e.g., invoices) and real-time streams (e.g., CDRs, recharges).

Prerequisites

Required Python Packages

pip install clickhouse-connect

Environment Variables Setup

Create/set the following environment variables:

# Database Configuration

DB_HOST=your-clickhouse-host
DB_PORT=18123
DB_NAME=your-database-name
DB_USER=your-username
DB_PASSWORD=your-password

# Optional: Date Configuration

START_DATE=2023-01-01
END_DATE=2025-08-31
DAY_OFFSET=1

Implementation Steps

Step 1: Database Connection Setup

The pipeline uses ClickHouse-connect to connect to ClickHouse. Please ensure:

Your database is accessible from your execution environment
The database has the required tables and views
User has SELECT permissions on target tables

Step 2: Query Configuration

Define your queries in the queries list within the main() function:

[ { "label": "table_name",
"sql": """ SELECT column1, column2, column3 FROM your_schema.your_table WHERE toDate(date_column) BETWEEN toDate('{start_date}') AND toDate('{end_date}') """ } ]

Important Query Guidelines:

Use f-string formatting with {start_date} and {end_date} placeholders
Include proper date filtering for incremental loads
Use meaningful labels that will become S3 file prefixes
Ensure queries are optimized for performance
Handle NULL values appropriately in transformations

Step 3: Date Parameter Configuration

Date Range:

Single Date:

main(..., start_date='2024-01-01') # end_date defaults to start_date

Last updated 2 months ago

hashtagData Collection and Integration

hashtagSource Identification & Ingestion

hashtagArchitecture Approach: Medallion Design

hashtagRaw Layer- ClickHouse

hashtagSilver Layer – ClickHouse

hashtagGold Layer – ClickHouse

hashtagData Load to Bronze Layer (ClickHouse)

hashtagData Ingestion Workflow

hashtagPrerequisites

hashtagImplementation Steps