Pandas Query Component
The Pandas Query Component allows users to filter and transform datasets by applying pandas queries on input events. This component is particularly useful when working with tabular data streams, enabling SQL-like filtering, grouping, and conditional logic directly within a pipeline.
Key Capabilities
Apply pandas queries to filter, sort, and transform data.
Supports both batch and real-time execution modes.
Works seamlessly with ingestion components, readers, or shared events.
Compare SQL queries with equivalent pandas queries for ease of transition.
Configuration Overview
All Pandas Query configurations are organized into the following sections:
Basic Information
Meta Information
Resource Configuration
Steps to Configure the Pandas Query Component
Add the Component
Drag and drop the Pandas Query Component into the Workflow Editor.
Connect Events
Create two events (input and output).
Connect:
Input Event → feeds data from ingestion, reader, or shared events.
Output Event → receives the filtered/transformed data.
Access Properties
Click the Pandas Query Component to open its configuration tabs.
Basic Information Tab
Invocation Type – Select the running mode:
Real-Time
Batch
Deployment Type – Pre-selected; indicates component deployment mode.
Container Image Version – Pre-selected docker image version.
Failover Event – Select a fallback event in case of failure.
Batch Size – Enter the maximum number of records per execution cycle (minimum = 10).
Meta Information Tab
Pandas Query – Enter a pandas query string to filter the input event data.
Table Name – Provide the table name (represents the in-event dataset).
Example:
df[(df.gender == 'Female') & (df.department == 'Sales')]
Here,
df
is the table name containing data from the input event.The query filters rows where
gender = 'Female'
anddepartment = 'Sales'
.
Saving the Component Configuration
Click Save Component in Storage to save configuration settings.
A notification message will confirm successful update.
Sample Queries
The table below shows equivalent SQL queries and their pandas query syntax:
SELECT id FROM airports WHERE ident = 'KLAX'
airports[airports.ident == 'KLAX'].id
SELECT * FROM airport_freq WHERE airport_ident = 'KLAX' ORDER BY type
airport_freq[airport_freq.airport_ident == 'KLAX'].sort_values('type')
SELECT * FROM airports WHERE iso_region = 'US-CA' AND type = 'seaplane_base'
airports[(airports.iso_region == 'US-CA') & (airports.type == 'seaplane_base')]
SELECT type, COUNT(*) FROM airports WHERE iso_country = 'US' GROUP BY type HAVING COUNT(*) > 1000 ORDER BY COUNT(*) DESC
airports[airports.iso_country == 'US'].groupby('type').filter(lambda g: len(g) > 1000).groupby('type').size().sort_values(ascending=False)
Example Use Cases
Filter employee data by department and gender.
Extract all airport records for a given region and type.
Perform group-level filtering, such as selecting categories with more than 1000 records.