Best Practices

Best Practices for Ingesting Real-Time Data and Optimizing Data Processing in a Data Lake

Managing Small Files in Real-Time Ingestion

The Problem: Excessive Small Files

When streaming data (e.g., from Kafka, Kinesis, or Spark Structured Streaming) is written directly to a data lake, each micro-batch or task often creates its own file. Over time, this leads to millions of small files, which severely impact performance and stability.

Why Small Files Are a Problem

Performance degradation: Query engines must open thousands of files, creating excessive I/O overhead — the most expensive operation in distributed systems.
Metadata overhead: Each file adds metadata entries in S3/ADLS, increasing both latency and cost.
Job instability: Listing or scanning millions of files can exhaust driver memory or cause timeouts.

Best Practice: File Compaction

Schedule periodic compaction jobs (e.g., hourly or daily) to merge small files into larger, optimally sized ones. This reduces I/O overhead, improves read performance, and accelerates downstream data processing.

Unified vs. Modular Data Processing Jobs

Unified (Single) Job Approach

Multiple aggregations or transformations are performed within one Spark job or pipeline — often from the same source DataFrame.

Benefits

Avoids repeated reads: The raw data is read once; multiple outputs share the same in-memory dataset, significantly reducing I/O.
Optimizer reuse: Spark’s Catalyst optimizer can reuse common subplans (e.g., joins, filters), minimizing shuffle, CPU, and memory use.
Reduced storage overhead: Intermediate data doesn’t need to be written back to S3/ADLS between stages.
Consistency and atomicity: All aggregations are based on the same snapshot of data, ensuring consistent results across metrics.

Modular Job Approach

Each aggregation or transformation is implemented as a separate job, writing intermediate results between stages.

When to Use Modular Jobs

Resource constraints: Large unified jobs may require more memory or shuffle capacity than available.
Data reuse: Intermediate outputs can serve multiple downstream teams or pipelines.
Parallel development: Different teams can own, deploy, and evolve their respective modules independently.
Operational isolation: Easier to monitor, debug, and rerun individual stages if failures occur.

Collecting Data, Statistics, and Operational Metrics

When running a unified Spark job that performs multiple aggregations in one DAG, collecting inline data statistics and operational metrics provides strong observability, reliability, and optimization benefits.

Key Advantages

Early detection of data quality issues: Track record counts, null ratios, schema drift, and min/max values in real time. Example: If the daily record count drops 90% or a column type changes, trigger an automated alert before the issue propagates.
Reduced I/O and overhead: Gather all metrics directly from in-memory data without separate validation jobs or re-reads.
Single source of truth for monitoring: Centralize runtime metrics (input volume, skewness, processing time) into one job → simpler dashboards and alerting.
Performance optimization: Use statistics (e.g., partition size distribution, record counts) to detect skew and fine-tune partitioning, shuffle, and parallelism for future runs.
Better resource utilization insights: Correlate Spark metrics (CPU, memory) with data-level metrics to identify performance bottlenecks and scaling needs.
Faster debugging and RCA: Inline logs of record counts, schema summaries, and aggregation outputs simplify failure diagnosis — no need to re-run multiple jobs.
Improved lineage and auditing: Capture key metadata per run:
- Input/output record counts
- Aggregation outputs
- Job version and timestamps Example: Run ID 2025-11-07 processed 8.2M records → produced 5 KPIs with no data loss.
Anomaly detection and alerting: Integrate metrics with Prometheus, Datadog, or CloudWatch to detect deviations (e.g., >20% drop in row count, schema change, performance regression).
Continuous optimization and trend analysis: Aggregate metrics across runs to track long-term trends (data growth rate, column cardinality, storage footprint) for proactive tuning and capacity planning.