MongoDB Reader

Use the MongoDB Reader task to read data from a MongoDB collection into the pipeline. This guide explains how to configure the Meta Information tab—connection types, authentication, schema options, and query entry—plus validation, performance, and troubleshooting.

Prerequisites

  • Network access to the MongoDB deployment (firewall rules/security groups allow inbound traffic from the pipeline runtime).

  • A MongoDB user with read privileges on the target database/collection (e.g., read role).

  • If using TLS/SSL, the necessary CA certificates must be trusted on the pipeline runtime.

  • (Optional) A Spark schema JSON file if you want to enforce a schema rather than infer it.

Tip: Start with a narrow job (filter early, limit rows) to confirm connectivity and permissions before running a full load.

Quick Start (Summary)

  1. Drag MongoDB Reader → open Meta Information.

  2. Choose Connection Type (Standard, SRV, or Connection String).

  3. Enter endpoint & credentials; set Database Name.

  4. (Optional) Upload Schema JSON.

  5. Provide an Aggregation Pipeline in Query.

  6. Validate with a small $limit, then finalize downstream mapping.

Configuration Overview

On the canvas, drag the MongoDB Reader to the workspace and select it. The Meta Information tab opens by default. Configure the fields described below.

Connection Type

Choose one of the following:

  1. Standard Use when you connect via host/IP and port to a standalone server, a replica set, or a sharded cluster through mongos.

  2. SRV Use when you have a DNS SRV record (typical for MongoDB Atlas). The URI starts with mongodb+srv:// and the port is discovered automatically.

  3. Connection String Use when you want to supply the entire MongoDB URI (including advanced options) yourself.

Meta Information — Field Reference

Required fields are marked with (*). Some fields appear only for specific Connection Type selections.

Field
Required
Appears For
Example
Notes

Connection Type

*

All

Standard / SRV / Connection String

Select the mode of connection.

Port

*

Standard only

27017

Default is 27017. Hidden for SRV/Connection String.

Host IP Address

*

Standard only

10.10.20.5

For replica set or sharded cluster via mongos, enter the appropriate host.

Username

*

Standard, SRV

bdb_reader

Supply a user with read access.

Password

*

Standard, SRV

••••••••

Stored securely by the platform’s connection manager.

Database Name

*

Standard, SRV

sales

Database to read from (also used for auth if authSource isn’t specified).

Connection String

*

Connection String only

mongodb://bdb_reader:***@10.10.20.5:27017/sales?authSource=admin&tls=true

Full MongoDB URI including options. Username/password may be embedded.

Additional Parameters

Standard, SRV

authSource=admin&tls=true&retryReads=true

Passed through to the driver. Use & to separate key‑value pairs.

Cluster Shared

Standard, SRV

Enabled

Turn on when connecting to a sharded cluster (via mongos).

Schema File Name

All

Upload JSON file

Optional Spark schema to control types (see example below).

Query

All

Aggregation pipeline JSON

MongoDB Aggregation Pipeline used to select/shape data (see examples below).

Please note: The Meta Information tab configures connectivity and global read behavior. The target collection is typically specified in a subsequent tab (e.g., Data Configuration). If your UI places the collection here, enter it exactly as <database>.<collection> or provide the database in this tab and the collection later.

Accepted Formats & Examples

Standard

  • Host IP Address: 10.10.20.5

  • Port: 27017

  • Username/Password: bdb_reader / *****

  • Database Name: sales

  • Additional Parameters: authSource=admin&tls=true&retryReads=true

  • Cluster Shared: enable if you connect through mongos to a sharded cluster.

Resulting driver URI (conceptual):

mongodb://bdb_reader:*****@10.10.20.5:27017/sales?authSource=admin&tls=true&retryReads=true

SRV (e.g., MongoDB Atlas)

  • Connection Type: SRV

  • Host (SRV domain): cluster0.abcde.mongodb.net (The UI may display “Host IP Address”; enter the SRV hostname.)

  • Username/Password: as provided

  • Database Name: sales

  • Port: not required (SRV discovers it)

  • Additional Parameters (optional): retryReads=true

Equivalent URI:

mongodb+srv://bdb_reader:*****@cluster0.abcde.mongodb.net/sales?retryWrites=true&w=majority

Full Connection String

Paste the full URI:

mongodb://bdb_reader:*****@10.10.20.5:27017/sales?authSource=admin&tls=true&readPreference=secondaryPreferred&retryReads=true

Use this mode if you need advanced options (e.g., replicaSet, readPreference, compressors, connectTimeoutMS, appName, directConnection).

Schema (Optional) — Spark JSON Example

Upload a JSON file to enforce the read schema and avoid on‑the‑fly inference:

{
  "type": "struct",
  "fields": [
    {"name": "_id", "type": "string", "nullable": false},
    {"name": "customerId", "type": "string", "nullable": true},
    {"name": "orderDate", "type": "timestamp", "nullable": true},
    {"name": "subtotal", "type": "double", "nullable": true},
    {"name": "tax", "type": "double", "nullable": true},
    {"name": "discount", "type": "double", "nullable": true},
    {"name": "tags", "type": {"type": "array", "elementType": "string", "containsNull": true}, "nullable": true}
  ]
}

Query — Aggregation Pipeline (JSON)

Enter a valid MongoDB Aggregation Pipeline (JSON array). Use double quotes for keys/strings and Extended JSON for dates/ObjectIds.

Example: monthly revenue per customer over 2024

[
  { "$match": {
      "status": "Shipped",
      "orderDate": { "$gte": { "$date": "2024-01-01T00:00:00Z" } }
    }},
  { "$project": {
      "_id": 0,
      "customerId": 1,
      "orderDate": 1,
      "orderTotal": { "$add": ["$subtotal", "$tax", { "$multiply": [-1, "$discount"] }] },
      "orderMonth": { "$dateToString": { "format": "%Y-%m", "date": "$orderDate" } }
    }},
  { "$group": {
      "_id": { "customerId": "$customerId", "orderMonth": "$orderMonth" },
      "orders": { "$sum": 1 },
      "amount": { "$sum": "$orderTotal" }
    }},
  { "$sort": { "_id.orderMonth": 1, "_id.customerId": 1 }},
  { "$limit": 100000 }
]

Preview

[
  { "$match": { "_id": { "$exists": true } } },
  { "$limit": 100 }
]

Please Note: Click the Save Task In Storage icon to save the configuration for the dragged reader task.

Validation Checklist

  1. Connectivity: (If available) click Test Connection; otherwise, run a small Preview with $limit.

  2. Permissions: Confirm at least one row returns; errors like not authorized on <db> indicate missing privileges.

  3. Schema: If you uploaded a schema, verify field types and nullability match your data; adjust before full runs.

  4. Performance: Ensure your pipeline $match aligns with existing indexes (e.g., on orderDate, status).

  5. Data correctness: Row counts and aggregates (totals) match expectations from a direct MongoDB shell query.

Security & Governance

  • Use a least‑privilege user (read‑only) scoped to the required database/collections.

  • Prefer TLS/SSL (tls=true) for data in transit; verify certificate chains are trusted.

  • Avoid embedding passwords directly in pipeline text where possible; use the platform’s credential store/connection manager.

  • Keep audit logs for changes to connection details and query text.

Performance & Scaling

  • Filter early: push down $match to minimize data transfer.

  • Project only needed fields: $project to drop large arrays/blobs.

  • Read preference: If replica sets are used for analytics, consider readPreference=secondaryPreferred to offload primaries (Connection String mode).

  • Sharded clusters: Enable Cluster Shared when reading via mongos; include a shard key in $match where possible for targeted routing.

  • Batching: Use reasonable $limit during development; remove or increase for production.