MongoDB Reader

The MongoDB Reader component reads data from a specified collection in a MongoDB database. It supports filtering data using Spark SQL queries, column-level filtering, and sharded cluster connections. Authentication is supported via username/password or connection string mechanisms provided by MongoDB.

Configuration Sections

The MongoDB Reader component configurations are organized into the following sections:

  • Basic Information

  • Meta Information

  • Resource Configuration

  • Connection Validation

Accessing the Component Properties

  1. Drag and drop the MongoDB Reader component into the Workflow Editor.

  2. Click on the component to open the configuration panel.

  3. Configure parameters across the tabs described below.

Basic Information Tab

Parameter
Description
Example
Required

Invocation Type

Defines how the reader runs. Options: Real-Time or Batch.

Batch

Yes

Deployment Type

Displays the deployment type for the component (pre-selected).

Kubernetes

No

Container Image Version

Displays the container image version for the component (pre-selected).

v1.2.3

No

Failover Event

Select a failover event from the drop-down menu.

retry_event

Optional

Batch Size

Maximum number of records to be processed in a single execution cycle.

5000

Yes

Note: When Batch is selected, a grace period option appears to allow the component to shut down gracefully after a defined time window.

Meta Information Tab

Connection Settings

Parameter
Description
Example
Required

Connection Type

Options: Standard, SRV, Connection String.

Standard

Yes

Host IP Address

Hostname or IP address of MongoDB (Standard only).

192.168.1.100

Yes

Port

MongoDB port (Standard only).

27017

Yes

Username

Username for authentication.

mongo_user

Yes

Password

Password for authentication.

********

Yes

Database Name

Target database name.

analytics_db

Yes

Collection Name

Target collection name.

employee_table

Yes

Connection String

MongoDB URI (appears only for Connection String type).

mongodb+srv://cluster0…

Yes (if type=Connection String)

Query and Schema Settings

Parameter
Description
Example
Required

Partition Column

Unique numeric column used for partitioning reads.

_id

Optional

Query

Spark SQL query. Use the collection name as the table name.

SELECT * FROM employee_table WHERE department='HR'

Optional

Limit

Restricts the number of records retrieved from the collection.

1000

Optional

Schema File Name

Upload a Spark schema in JSON format.

employee_schema.json

Optional

Cluster Sharded

Enable if reading from a sharded MongoDB cluster.

true

Optional

Additional Parameters

Additional connection options.

replicaSet=rs0&authSource=admin

Optional

SSL Configuration

Parameter
Description
Example
Required

Enable SSL

Enable SSL for secure MongoDB connections. Requires certificates.

true

Optional

Certificate Folder

Select the folder containing uploaded SSL certificates.

mongo_certs

Conditional

Required SSL files (uploaded via Admin Settings → Certificate Upload):

  • Certificate file (.pem)

  • Key file (.key)

Column Filtering

You can select specific columns to read instead of retrieving the entire collection:

  • Column Filter Section: Choose columns, assign alias names, and specify column types.

  • Upload File: Upload schema files (CSV or JSON, < 2 MB).

  • Download Data (Schema): Download the schema structure in JSON format.

Example Spark SQL Query

SELECT department, AVG(salary) AS average_salary 
FROM employee_table 
GROUP BY department;

In this example, employee_table corresponds to the Collection Name specified in the Meta Information tab.

Example Configuration: Standard Connection

Connection Type: Standard
Host IP Address: 192.168.1.100
Port: 27017
Username: mongo_user
Password: ********
Database Name: analytics_db
Collection Name: employee_table
Query: SELECT * FROM employee_table WHERE status = 'active'
Cluster Sharded: false
Enable SSL: true
Certificate Folder: mongo_certs

Notes

  • Use Partition Column when working with large collections to improve performance.

  • For sharded clusters, ensure the target cluster is properly configured and accessible.

  • Schema uploads enforce structure and are recommended for production workloads.