MongoDB Reader
The MongoDB Reader component reads data from a specified collection in a MongoDB database. It supports filtering data using Spark SQL queries, column-level filtering, and sharded cluster connections. Authentication is supported via username/password or connection string mechanisms provided by MongoDB.
Configuration Sections
The MongoDB Reader component configurations are organized into the following sections:
Basic Information
Meta Information
Resource Configuration
Connection Validation
Accessing the Component Properties
Drag and drop the MongoDB Reader component into the Workflow Editor.
Click on the component to open the configuration panel.
Configure parameters across the tabs described below.
Basic Information Tab
Invocation Type
Defines how the reader runs. Options: Real-Time
or Batch
.
Batch
Yes
Deployment Type
Displays the deployment type for the component (pre-selected).
Kubernetes
No
Container Image Version
Displays the container image version for the component (pre-selected).
v1.2.3
No
Failover Event
Select a failover event from the drop-down menu.
retry_event
Optional
Batch Size
Maximum number of records to be processed in a single execution cycle.
5000
Yes
Note: When Batch is selected, a grace period option appears to allow the component to shut down gracefully after a defined time window.
Meta Information Tab
Connection Settings
Connection Type
Options: Standard
, SRV
, Connection String
.
Standard
Yes
Host IP Address
Hostname or IP address of MongoDB (Standard only).
192.168.1.100
Yes
Port
MongoDB port (Standard only).
27017
Yes
Username
Username for authentication.
mongo_user
Yes
Password
Password for authentication.
********
Yes
Database Name
Target database name.
analytics_db
Yes
Collection Name
Target collection name.
employee_table
Yes
Connection String
MongoDB URI (appears only for Connection String type).
mongodb+srv://cluster0…
Yes (if type=Connection String)
Query and Schema Settings
Partition Column
Unique numeric column used for partitioning reads.
_id
Optional
Query
Spark SQL query. Use the collection name as the table name.
SELECT * FROM employee_table WHERE department='HR'
Optional
Limit
Restricts the number of records retrieved from the collection.
1000
Optional
Schema File Name
Upload a Spark schema in JSON format.
employee_schema.json
Optional
Cluster Sharded
Enable if reading from a sharded MongoDB cluster.
true
Optional
Additional Parameters
Additional connection options.
replicaSet=rs0&authSource=admin
Optional
SSL Configuration
Enable SSL
Enable SSL for secure MongoDB connections. Requires certificates.
true
Optional
Certificate Folder
Select the folder containing uploaded SSL certificates.
mongo_certs
Conditional
Required SSL files (uploaded via Admin Settings → Certificate Upload):
Certificate file (
.pem
)Key file (
.key
)
Column Filtering
You can select specific columns to read instead of retrieving the entire collection:
Column Filter Section: Choose columns, assign alias names, and specify column types.
Upload File: Upload schema files (CSV or JSON, < 2 MB).
Download Data (Schema): Download the schema structure in JSON format.
Example Spark SQL Query
SELECT department, AVG(salary) AS average_salary
FROM employee_table
GROUP BY department;
In this example, employee_table
corresponds to the Collection Name specified in the Meta Information tab.
Example Configuration: Standard Connection
Connection Type: Standard
Host IP Address: 192.168.1.100
Port: 27017
Username: mongo_user
Password: ********
Database Name: analytics_db
Collection Name: employee_table
Query: SELECT * FROM employee_table WHERE status = 'active'
Cluster Sharded: false
Enable SSL: true
Certificate Folder: mongo_certs
Notes
Use Partition Column when working with large collections to improve performance.
For sharded clusters, ensure the target cluster is properly configured and accessible.
Schema uploads enforce structure and are recommended for production workloads.