Cassandra Reader

The Cassandra Reader is commonly used for operational analytics, real-time data integration, customer activity workflows, and ML feature extraction from NoSQL stores.

The Cassandra Reader component allows BDB Data Pipeline users to read data from Apache Cassandra, a distributed NoSQL database designed for high availability, linear scalability, and low-latency read/write operations. This component enables pipelines to connect to a Cassandra cluster, run CQL queries, and ingest the retrieved data into downstream processing stages.

The Cassandra Reader is commonly used for operational analytics, real-time data integration, customer activity workflows, and ML feature extraction from NoSQL stores.

Overview

Cassandra’s columnar storage model is optimized for fast, scalable access to large datasets. The Cassandra Reader integrates this capability directly into the BDB Data Pipeline by providing:

  • Secure, configurable connection to Cassandra clusters

  • Ability to query specific keyspaces and tables

  • Support for user-defined CQL queries

  • Seamless ingestion of Cassandra data into the BDB workflow

The component acts as a Reader, meaning it introduces external data into the pipeline for further transformation, enrichment, or storage.

Component Placement

The Cassandra Reader is available under:

Data Engineering → Pipelines → Components → Readers

Dragging it onto the canvas and selecting it opens the configuration panel containing:

  • Basic Information

  • Meta Information

Basic Information Tab

This tab includes general metadata, such as:

  • Component Name

  • Description (optional)

These fields allow easy identification of the reader in multi-step pipelines.

Meta Information Tab

The Meta Information tab holds all connection and query configurations required to read from a Cassandra database. While the exact UI layout may vary slightly across versions, the following fields are standard.

Cassandra Connection Fields

Field
Description

Host*

The Cassandra contact point or seed node hostname/IP. This is typically the coordinator node or a reachable cluster node.

Port*

The port for Cassandra connections. Default is 9042 (native transport port).

Username*

Cassandra authentication username (required for clusters with authentication enabled).

Password*

Authentication password corresponding to the provided username.

Keyspace*

The Cassandra keyspace from which data should be queried.

Table Name*

The specific table within the keyspace to read from.

Optional Fields

Field
Description

Consistency Level

Defines the consistency level for read operations such as ONE, QUORUM, LOCAL_QUORUM, ALL. Defaults to cluster settings.

Fetch Size

Sets the fetch size for query pagination. Helps optimize memory usage and performance for large datasets.

Query Field

Field
Description

CQL Query

A CQL (Cassandra Query Language) SELECT query used to retrieve data. If empty, a default SELECT * is executed for the specified table.

Execution Behavior

When the pipeline runs:

  1. The Cassandra Reader initializes a secure session with the Cassandra cluster using the host, port, and authentication details.

  2. The component either:

    • Executes a provided CQL Query, or

    • Executes SELECT * FROM <keyspace>.<table> if no query is provided.

  3. Data is fetched in batches (pagination handled by the Cassandra driver).

  4. Results are passed to the next component as structured pipeline output.

If authentication or cluster connection fails, the pipeline marks the component as Failed, logs the error, and stops further execution.

Supported Query Types

The Cassandra Reader supports:

  • SELECT queries

  • WHERE clause filtering

  • ALLOW FILTERING (when explicitly included in the query)

  • Partition key–based querying for performance optimization

Common Use Cases

Operational Analytics Pipelines

Extract time-series or user-activity records stored in Cassandra for BI dashboards.

ML Feature Engineering

Fetch customer behavior attributes or device telemetry to feed predictive models.

Event Sourcing Pipelines

Ingest Cassandra-stored events into streaming or batch dataflows.

Data Synchronization

Pull data from Cassandra into a data lake, warehouse, or other downstream systems.

Best Practices

Cluster Connectivity

  • Provide multiple contact points for high availability.

  • Ensure the BDB Pipeline environment has network access to the Cassandra cluster.

Query Optimization

  • Query using partition keys whenever possible.

  • Avoid ALLOW FILTERING unless necessary due to performance implications.

Security

  • Store credentials securely and rotate regularly.

  • Use secure connection settings (SSL/TLS) if enabled in the cluster.

Performance Tuning

  • Adjust Fetch Size for large tables.

  • Use appropriate Consistency Level based on latency vs. accuracy requirements.

Error Handling & Troubleshooting

Issue
Possible Cause
Recommendation

Connection failure

Incorrect host/port or network blocks

Verify Cassandra host accessibility and firewall settings.

Authentication error

Wrong username/password

Confirm credentials and cluster authentication mode.

Query timeout

Large unbounded queries

Add filters, reduce fetch size, or query by partition key.

Empty results

Query filters too restrictive

Validate the query against Cassandra directly.

Last updated