Pinot Reader

The Pinot Reader is typically used in scenarios where high-speed analytical queries need to be integrated with batch or streaming workflows running inside the BDB Platform.

The Pinot Reader component enables BDB Data Pipeline users to read data directly from Apache Pinot, a real-time distributed OLAP datastore designed for low-latency analytics. It supports high-performance analytical queries and enables seamless ingestion of real-time and offline datasets for processing, modeling, and operational workflows.

Its flexible configuration options make it suitable for a wide range of advanced analytics, BI, and ML use cases where fast query response times are essential.

Overview

Apache Pinot provides fast OLAP-style queries on real-time and offline datasets. The Pinot Reader component integrates seamlessly with Pinot’s query engine, enabling:

  • Extraction of real-time analytical datasets

  • Execution of SQL queries directly within the pipeline

  • Combining Pinot results with other data sources

  • Triggering downstream processing based on Pinot metrics

When added to the pipeline, the Pinot Reader acts as the entry (Reader) component for ingesting data from a Pinot cluster.

Component Placement

You can add the Pinot Reader from:

Data Engineering → Data Pipeline → Components → Readers

After dragging the component onto the pipeline canvas, selecting it opens the configuration panel with two tabs:

  • Basic Information

  • Meta Information

The image provided reflects the Meta Information tab.

Basic Information Tab

This tab contains general component metadata such as:

  • Component Name

  • Description (optional)

These fields help users identify and document the component within complex pipelines.

Meta Information Tab

The Meta Information tab contains all fields required to connect to a Pinot controller and execute queries.

The fields shown in the provided UI include:

Required Fields

Field
Description

Pinot Host*

Hostname or IP address of the Pinot Controller node. Example: pinot-controller.mycompany.com

Pinot Port*

Port number on which the Pinot Controller query API is exposed. Default Pinot port: 9000.

Pinot Table*

The name of the Pinot table to query (e.g., pageviews, sales_offline, clickstream_rt).

Optional Fields

Field
Description

Fetch Size (1000)

Defines the number of records to fetch per batch. Default is 1000. Adjust based on dataset volume and performance considerations.

Query Section

Field
Description

Pinot Query

SQL query that will be executed against the specified Pinot table. Supports Pinot-compatible SQL syntax.

Query Execution Behavior

During pipeline execution:

  1. The Pinot Reader establishes a connection with the configured Pinot Controller.

  2. It executes the user-defined query under Pinot Query.

  3. Results are fetched in batches defined by Fetch Size.

  4. The resulting dataset is passed to downstream components for processing or transformation.

If the query returns no records, the reader outputs an empty dataset.

circle-info

Note: The Pinot Reader supports SQL queries with or without a filter.

Common Use Cases

Real-Time Analytics Extraction

Execute queries on Pinot’s real-time tables for dashboards, anomaly detection, or event-driven decisions.

Joining Real-Time and Historical Data

Use Pinot data in combination with Data Lake, databases, or streaming sources within the pipeline.

Preprocessing for ML Pipelines

Fetch derived analytics features from Pinot to feed into machine learning training workflows.

Operational Monitoring

Extract metrics for:

  • Latency measurements

  • Request patterns

  • Event counts

  • Usage statistics

Best Practices

Pinot Host & Port

  • Always connect to the Pinot Controller, not the Broker or Server node directly.

  • Ensure network access rules allow secure API communication.

Fetch Size Optimization

  • Use smaller fetch sizes for latency-sensitive pipelines.

  • Increase fetch size for large scans to improve throughput.

Efficient Query Design

  • Prefer SELECT column subsets over SELECT *.

  • Use time-based predicates to reduce scan volume.

  • Avoid complex joins (Pinot is optimized for low-latency aggregations and filtering).

Monitoring & Debugging

  • Validate connectivity via Pinot’s Swagger UI before configuring the pipeline.

  • Review Pinot logs if the pipeline reports query execution errors.

Last updated