Pandas Query Component

The Pandas Query Component allows users to filter and transform datasets by applying pandas queries on input events. This component is particularly useful when working with tabular data streams, enabling SQL-like filtering, grouping, and conditional logic directly within a pipeline.

Key Capabilities

  • Apply pandas queries to filter, sort, and transform data.

  • Supports both batch and real-time execution modes.

  • Works seamlessly with ingestion components, readers, or shared events.

  • Compare SQL queries with equivalent pandas queries for ease of transition.

Configuration Overview

All Pandas Query configurations are organized into the following sections:

  • Basic Information

  • Meta Information

  • Resource Configuration

Steps to Configure the Pandas Query Component

  1. Add the Component

    • Drag and drop the Pandas Query Component into the Workflow Editor.

  2. Connect Events

    • Create two events (input and output).

    • Connect:

      • Input Event → feeds data from ingestion, reader, or shared events.

      • Output Event → receives the filtered/transformed data.

  3. Access Properties

    • Click the Pandas Query Component to open its configuration tabs.

Basic Information Tab

  • Invocation Type – Select the running mode:

    • Real-Time

    • Batch

  • Deployment Type – Pre-selected; indicates component deployment mode.

  • Container Image Version – Pre-selected docker image version.

  • Failover Event – Select a fallback event in case of failure.

  • Batch Size – Enter the maximum number of records per execution cycle (minimum = 10).

Meta Information Tab

  • Pandas Query – Enter a pandas query string to filter the input event data.

  • Table Name – Provide the table name (represents the in-event dataset).

Example:

df[(df.gender == 'Female') & (df.department == 'Sales')]
  • Here, df is the table name containing data from the input event.

  • The query filters rows where gender = 'Female' and department = 'Sales'.

Saving the Component Configuration

  1. Click Save Component in Storage to save configuration settings.

  2. A notification message will confirm successful update.

Sample Queries

The table below shows equivalent SQL queries and their pandas query syntax:

SQL Query
Pandas Query

SELECT id FROM airports WHERE ident = 'KLAX'

airports[airports.ident == 'KLAX'].id

SELECT * FROM airport_freq WHERE airport_ident = 'KLAX' ORDER BY type

airport_freq[airport_freq.airport_ident == 'KLAX'].sort_values('type')

SELECT * FROM airports WHERE iso_region = 'US-CA' AND type = 'seaplane_base'

airports[(airports.iso_region == 'US-CA') & (airports.type == 'seaplane_base')]

SELECT type, COUNT(*) FROM airports WHERE iso_country = 'US' GROUP BY type HAVING COUNT(*) > 1000 ORDER BY COUNT(*) DESC

airports[airports.iso_country == 'US'].groupby('type').filter(lambda g: len(g) > 1000).groupby('type').size().sort_values(ascending=False)

Example Use Cases

  • Filter employee data by department and gender.

  • Extract all airport records for a given region and type.

  • Perform group-level filtering, such as selecting categories with more than 1000 records.