Cassandra Reader
The Cassandra Reader is commonly used for operational analytics, real-time data integration, customer activity workflows, and ML feature extraction from NoSQL stores.
The Cassandra Reader component allows BDB Data Pipeline users to read data from Apache Cassandra, a distributed NoSQL database designed for high availability, linear scalability, and low-latency read/write operations. This component enables pipelines to connect to a Cassandra cluster, run CQL queries, and ingest the retrieved data into downstream processing stages.
The Cassandra Reader is commonly used for operational analytics, real-time data integration, customer activity workflows, and ML feature extraction from NoSQL stores.
Overview
Cassandra’s columnar storage model is optimized for fast, scalable access to large datasets. The Cassandra Reader integrates this capability directly into the BDB Data Pipeline by providing:
Secure, configurable connection to Cassandra clusters
Ability to query specific keyspaces and tables
Support for user-defined CQL queries
Seamless ingestion of Cassandra data into the BDB workflow
The component acts as a Reader, meaning it introduces external data into the pipeline for further transformation, enrichment, or storage.
Component Placement
The Cassandra Reader is available under:
Data Engineering → Pipelines → Components → Readers
Dragging it onto the canvas and selecting it opens the configuration panel containing:
Basic Information
Meta Information
Basic Information Tab
This tab includes general metadata, such as:
Component Name
Description (optional)
These fields allow easy identification of the reader in multi-step pipelines.
Meta Information Tab
The Meta Information tab holds all connection and query configurations required to read from a Cassandra database. While the exact UI layout may vary slightly across versions, the following fields are standard.
Cassandra Connection Fields
Host*
The Cassandra contact point or seed node hostname/IP. This is typically the coordinator node or a reachable cluster node.
Port*
The port for Cassandra connections. Default is 9042 (native transport port).
Username*
Cassandra authentication username (required for clusters with authentication enabled).
Password*
Authentication password corresponding to the provided username.
Keyspace*
The Cassandra keyspace from which data should be queried.
Table Name*
The specific table within the keyspace to read from.
Optional Fields
Consistency Level
Defines the consistency level for read operations such as ONE, QUORUM, LOCAL_QUORUM, ALL. Defaults to cluster settings.
Fetch Size
Sets the fetch size for query pagination. Helps optimize memory usage and performance for large datasets.
Query Field
CQL Query
A CQL (Cassandra Query Language) SELECT query used to retrieve data. If empty, a default SELECT * is executed for the specified table.
Execution Behavior
When the pipeline runs:
The Cassandra Reader initializes a secure session with the Cassandra cluster using the host, port, and authentication details.
The component either:
Executes a provided CQL Query, or
Executes
SELECT * FROM <keyspace>.<table>if no query is provided.
Data is fetched in batches (pagination handled by the Cassandra driver).
Results are passed to the next component as structured pipeline output.
If authentication or cluster connection fails, the pipeline marks the component as Failed, logs the error, and stops further execution.
Supported Query Types
The Cassandra Reader supports:
SELECTqueriesWHERE clause filtering
ALLOW FILTERING (when explicitly included in the query)
Partition key–based querying for performance optimization
Common Use Cases
Operational Analytics Pipelines
Extract time-series or user-activity records stored in Cassandra for BI dashboards.
ML Feature Engineering
Fetch customer behavior attributes or device telemetry to feed predictive models.
Event Sourcing Pipelines
Ingest Cassandra-stored events into streaming or batch dataflows.
Data Synchronization
Pull data from Cassandra into a data lake, warehouse, or other downstream systems.
Best Practices
Cluster Connectivity
Provide multiple contact points for high availability.
Ensure the BDB Pipeline environment has network access to the Cassandra cluster.
Query Optimization
Query using partition keys whenever possible.
Avoid ALLOW FILTERING unless necessary due to performance implications.
Security
Store credentials securely and rotate regularly.
Use secure connection settings (SSL/TLS) if enabled in the cluster.
Performance Tuning
Adjust Fetch Size for large tables.
Use appropriate Consistency Level based on latency vs. accuracy requirements.
Error Handling & Troubleshooting
Connection failure
Incorrect host/port or network blocks
Verify Cassandra host accessibility and firewall settings.
Authentication error
Wrong username/password
Confirm credentials and cluster authentication mode.
Query timeout
Large unbounded queries
Add filters, reduce fetch size, or query by partition key.
Empty results
Query filters too restrictive
Validate the query against Cassandra directly.
Last updated