MongoDB Reader
Use the MongoDB Reader task to read data from a MongoDB collection into the pipeline. This guide explains how to configure the Meta Information tab—connection types, authentication, schema options, and query entry—plus validation, performance, and troubleshooting.
Prerequisites
Network access to the MongoDB deployment (firewall rules/security groups allow inbound traffic from the pipeline runtime).
A MongoDB user with read privileges on the target database/collection (e.g.,
read
role).If using TLS/SSL, the necessary CA certificates must be trusted on the pipeline runtime.
(Optional) A Spark schema JSON file if you want to enforce a schema rather than infer it.
Quick Start (Summary)
Drag MongoDB Reader → open Meta Information.
Choose Connection Type (
Standard
,SRV
, orConnection String
).Enter endpoint & credentials; set Database Name.
(Optional) Upload Schema JSON.
Provide an Aggregation Pipeline in Query.
Validate with a small
$limit
, then finalize downstream mapping.
Configuration Overview
On the canvas, drag the MongoDB Reader to the workspace and select it. The Meta Information tab opens by default. Configure the fields described below.
Connection Type
Choose one of the following:
Standard Use when you connect via host/IP and port to a standalone server, a replica set, or a sharded cluster through
mongos
.SRV Use when you have a DNS SRV record (typical for MongoDB Atlas). The URI starts with
mongodb+srv://
and the port is discovered automatically.Connection String Use when you want to supply the entire MongoDB URI (including advanced options) yourself.
Meta Information — Field Reference
Required fields are marked with (*). Some fields appear only for specific Connection Type selections.
Connection Type
*
All
Standard / SRV / Connection String
Select the mode of connection.
Port
*
Standard only
27017
Default is 27017. Hidden for SRV/Connection String.
Host IP Address
*
Standard only
10.10.20.5
For replica set or sharded cluster via mongos
, enter the appropriate host.
Username
*
Standard, SRV
bdb_reader
Supply a user with read access.
Password
*
Standard, SRV
••••••••
Stored securely by the platform’s connection manager.
Database Name
*
Standard, SRV
sales
Database to read from (also used for auth if authSource
isn’t specified).
Connection String
*
Connection String only
mongodb://bdb_reader:***@10.10.20.5:27017/sales?authSource=admin&tls=true
Full MongoDB URI including options. Username/password may be embedded.
Additional Parameters
Standard, SRV
authSource=admin&tls=true&retryReads=true
Passed through to the driver. Use &
to separate key‑value pairs.
Cluster Shared
Standard, SRV
Enabled
Turn on when connecting to a sharded cluster (via mongos
).
Schema File Name
All
Upload JSON file
Optional Spark schema to control types (see example below).
Query
All
Aggregation pipeline JSON
MongoDB Aggregation Pipeline used to select/shape data (see examples below).
Accepted Formats & Examples
Standard
Host IP Address:
10.10.20.5
Port:
27017
Username/Password:
bdb_reader
/*****
Database Name:
sales
Additional Parameters:
authSource=admin&tls=true&retryReads=true
Cluster Shared: enable if you connect through
mongos
to a sharded cluster.
Resulting driver URI (conceptual):
mongodb://bdb_reader:*****@10.10.20.5:27017/sales?authSource=admin&tls=true&retryReads=true
SRV (e.g., MongoDB Atlas)
Connection Type: SRV
Host (SRV domain):
cluster0.abcde.mongodb.net
(The UI may display “Host IP Address”; enter the SRV hostname.)Username/Password: as provided
Database Name:
sales
Port: not required (SRV discovers it)
Additional Parameters (optional):
retryReads=true
Equivalent URI:
mongodb+srv://bdb_reader:*****@cluster0.abcde.mongodb.net/sales?retryWrites=true&w=majority
Full Connection String
Paste the full URI:
mongodb://bdb_reader:*****@10.10.20.5:27017/sales?authSource=admin&tls=true&readPreference=secondaryPreferred&retryReads=true
Use this mode if you need advanced options (e.g., replicaSet
, readPreference
, compressors
, connectTimeoutMS
, appName
, directConnection
).
Schema (Optional) — Spark JSON Example
Upload a JSON file to enforce the read schema and avoid on‑the‑fly inference:
{
"type": "struct",
"fields": [
{"name": "_id", "type": "string", "nullable": false},
{"name": "customerId", "type": "string", "nullable": true},
{"name": "orderDate", "type": "timestamp", "nullable": true},
{"name": "subtotal", "type": "double", "nullable": true},
{"name": "tax", "type": "double", "nullable": true},
{"name": "discount", "type": "double", "nullable": true},
{"name": "tags", "type": {"type": "array", "elementType": "string", "containsNull": true}, "nullable": true}
]
}
Tip: If your documents include ObjectId or Date types, map them to string
/timestamp
as shown—or supply a compatible logical type per your runtime’s MongoDB/Spark connector.
Query — Aggregation Pipeline (JSON)
Enter a valid MongoDB Aggregation Pipeline (JSON array). Use double quotes for keys/strings and Extended JSON for dates/ObjectIds.
Example: monthly revenue per customer over 2024
[
{ "$match": {
"status": "Shipped",
"orderDate": { "$gte": { "$date": "2024-01-01T00:00:00Z" } }
}},
{ "$project": {
"_id": 0,
"customerId": 1,
"orderDate": 1,
"orderTotal": { "$add": ["$subtotal", "$tax", { "$multiply": [-1, "$discount"] }] },
"orderMonth": { "$dateToString": { "format": "%Y-%m", "date": "$orderDate" } }
}},
{ "$group": {
"_id": { "customerId": "$customerId", "orderMonth": "$orderMonth" },
"orders": { "$sum": 1 },
"amount": { "$sum": "$orderTotal" }
}},
{ "$sort": { "_id.orderMonth": 1, "_id.customerId": 1 }},
{ "$limit": 100000 }
]
Preview
[
{ "$match": { "_id": { "$exists": true } } },
{ "$limit": 100 }
]
Validation Checklist
Connectivity: (If available) click Test Connection; otherwise, run a small Preview with
$limit
.Permissions: Confirm at least one row returns; errors like not authorized on <db> indicate missing privileges.
Schema: If you uploaded a schema, verify field types and nullability match your data; adjust before full runs.
Performance: Ensure your pipeline
$match
aligns with existing indexes (e.g., onorderDate
,status
).Data correctness: Row counts and aggregates (totals) match expectations from a direct MongoDB shell query.
Security & Governance
Use a least‑privilege user (read‑only) scoped to the required database/collections.
Prefer TLS/SSL (
tls=true
) for data in transit; verify certificate chains are trusted.Avoid embedding passwords directly in pipeline text where possible; use the platform’s credential store/connection manager.
Keep audit logs for changes to connection details and query text.
Performance & Scaling
Filter early: push down
$match
to minimize data transfer.Project only needed fields:
$project
to drop large arrays/blobs.Read preference: If replica sets are used for analytics, consider
readPreference=secondaryPreferred
to offload primaries (Connection String mode).Sharded clusters: Enable Cluster Shared when reading via
mongos
; include a shard key in$match
where possible for targeted routing.Batching: Use reasonable
$limit
during development; remove or increase for production.