Deployment Considerations

Scheduling

Use the scheduling option in the pipeline /Jobs
Consider time zones for daily batch processing
Implement overlap handling for late-arriving data

Environment Management

Maintain separate configurations for dev/staging/prod

Monitoring and Alerting

Monitor ClickHouse storage costs and partition usage.
Track data freshness (lag in hours/days) per source.
Automate alerting for missing data (e.g., no CDRs in the last 1 hour).

Prerequisites

Python Dependencies

pip install Clickhouse-connect

# Database Configuration

DB_HOST=your-clickhouse-host
DB_PORT=18123
DB_NAME=your-database-name
DB_USER=your-username
DB_PASSWORD=your-password
DB_SCHEMA=public # Target schema (optional, defaults to public)

# Optional: Table Filtering

FILTER_TABLES=table1, table2, table3 # Comma-separated list

Audit Tables Setup

Create these tables in your PostgreSQL database:

CREATE SCHEMA silver_etl_master;

-- ABDR TABLE--

CREATE TABLE IF NOT EXISTS {TARGET_TABLE} (
    customer_id String,
    first_name String,
    last_name String,
    date_of_birth String,
    gender String,
    email_address String,
    msisdn String,
    address_id String,
    start_date Nullable(DateTime),
    end_date Nullable(DateTime),
    account_status String,
    customer_type String,
    kyc_status String,
    created_date DateTime,
    timestamp DateTime
) ENGINE = MergeTree()
ORDER BY customer_id;"""
client.command(create_table_sql)
logging.info(f"Table `{TARGET_TABLE}` created or already exists.")

Key Features

Customer master reference table for telecom/CRM systems.
Tracking customer lifecycle (activation → termination).
Filtering by account type, status, or KYC verification.
Supporting joins with invoices, subscriptions, usage records, etc.
Historical analysis based on created_date and timestamp.

Pipeline Behaviour

Connects to ClickHouse DB and discovers table folders
Finds the latest date partition for each table
Reads all Parquet files from that partition
Combines files into a single DataFrame
Maps columns to the target database table
Inserts data with audit logging
Records success/failure for each table

Quick Troubleshooting

Column errors: Ensure target tables exist with matching columns
Connection failures: Verify environment variables and network access
Memory issues: Use batch insert and batch read.

Monitoring

Monitoring of the audit tables helps in understanding whether the jobs were successful.

Implementation Checklist

Set up environment variables
Create tables in ClickHouse
Create target tables in ClickHouse
Test with a small dataset first
Set up monitoring and alerting
Schedule for production runs

Data Load to Gold layer (ClickHouse to ClickHouse)

The data load from the ABDR layer to the Aggregation layer is managed through parameterized ETL jobs and pipelines. These jobs are designed to follow a controlled and auditable execution process.

To facilitate this, we maintain reporting_logs

We are maintaining this table to record successful aggregation data sent to aggregate tables, which will include start_date and end_date.

telecomsolution.reporting_logs
(
    destination_table String,
    start_date DateTime,
    end_date DateTime,
    status String,
    reason String
)
ENGINE = MergeTree()
ORDER BY (destination_table, start_date);

End_date is calculated as:

 get_last_end_date(client, LOG_TABLE, DEST_TABLE, CLICKHOUSE_DB):
    result = client.query(f"""
        SELECT end_date
       FROM {CLICKHOUSE_DB}.{LOG_TABLE}
        WHERE destination_table = '{DEST_TABLE}' AND status = 'Success'
        ORDER BY end_date DESC
        LIMIT 1
    """).result_rows

The Aggregate layer was designed with a focus on optimizing reporting performance. Entities in the gold layer were organized into dimension tables, fact tables, and aggregate tables to support efficient analytical queries.

Features

Maintains a log table to track ETL runs (success/failure, reason, date ranges).
Finds the last successful aggregation end date from logs, so it knows where to resume.
Aggregates raw data into daily summaries by usage type and location.
Inserts results into the destination table.
Logs the run outcome (success, failure, or no data).

Last updated 2 months ago

hashtagScheduling

hashtagEnvironment Management

hashtagMonitoring and Alerting

hashtagPrerequisites

hashtagKey Features

hashtagPipeline Behaviour

hashtagQuick Troubleshooting

hashtagMonitoring

hashtagImplementation Checklist

hashtagData Load to Gold layer (ClickHouse to ClickHouse)

hashtagFeatures