ES Reader

Configure the Elasticsearch (ES) Reader task to ingest documents from an Elasticsearch cluster into your data job (Spark) for downstream transformations and analytics.

Prerequisites

  • Network access from the job’s compute environment to the ES endpoint (HTTP/HTTPS).

  • An ES user with read privileges on the target index/alias.

  • The host, port, and (if applicable) credentials.

  • Knowledge of the target index/alias and any document ID you intend to read directly.

Tip: Start with a small sample (use a narrow time window or a LIMIT in your Spark SQL Query) to validate connectivity and schema before running full loads.

Quick Start

  1. Drag the ES Reader task to the workspace and open it. The Meta Information tab opens by default.

  2. Enter the Host IP Address (or hostname) and Port (default 9200 for HTTP; Elastic Cloud often uses 443).

  3. (Optional) Provide Username/Password if your cluster requires Basic Auth.

  4. Supply Resource Type (see note below) and Index ID if you need a specific document; otherwise, leave Index ID blank to read sets of documents.

  5. (Optional) Enable Is Date Rich True if your dataset contains time fields and you want enhanced date handling.

  6. (Optional) Provide a Spark SQL Query to filter/project/join after load.

  7. Click Save Task In Storage to persist the configuration.


Meta Information — Field Reference

Field
Required
Example
Description / Best Practices

Host IP Address

*

10.10.20.15 or es-prod.company.com

ES endpoint (IP or DNS). Use the load balancer or coordinator node address, not a data node IP that may rotate.

Port

*

9200 (HTTP) / 443 (Elastic Cloud)

Ensure firewall/security groups allow egress from your compute to this port.

Index ID

AWn3…Zq

Document ID to read a single document. Leave blank to read multiple documents from an index/alias via the reader’s default selection and/or your Query.

Resource Type

_doc

Logical document type. ES 7+/8: use _doc or omit (types are deprecated/removed).

Is Date Rich True

Enabled

Turns on enhanced date handling in the reader (parse/normalize ES date fields for Spark operations). Keep time zones explicit in downstream logic.

Username

es_reader

Basic Auth user with read privileges on target index/alias.

Password

••••••••

Store via the platform’s credential store; avoid plain text.

Query

Spark SQL

Optional post‑load Spark SQL (filters, projections, joins). When used, Index ID is typically not needed.

Usage Patterns

Read a specific document by ID

  • Set Host, Port, Username/Password (if required).

  • Set Resource Type (e.g., _doc).

  • Set Index ID (document _id).

  • Leave the Query blank (or use it only for post‑processing once the single document is registered).

Read a set of documents (typical)

  • Configure Host, Port, and credentials.

  • Ensure the reader is pointed to the desired index/alias (via your platform’s index/pattern field in another tab).

  • Leave Index ID blank.

  • Use Query (Spark SQL) to filter/project the dataset inside Spark.

Query (Spark SQL) — Examples

Filter by recent time window (assuming the reader registers a logical table named <table>):

SELECT _id, user, event_type, event_ts
FROM <table>
WHERE event_ts >= TIMESTAMP '2025-09-01 00:00:00'

Project only needed fields

SELECT _id, order_id, customer_id, total_amount
FROM <table>

Join with a dimension already registered in the job

SELECT f._id, f.order_id, d.customer_name, f.total_amount
FROM <table> AS f
JOIN dim_customers AS d
  ON f.customer_id = d.customer_id
WHERE f.event_ts >= CURRENT_TIMESTAMP() - INTERVAL '7' DAY

Pushdown: Depending on the connector, simple predicates may be pushed down to ES. Still, prefer index‑time partitioning (e.g., date‑based indices with aliases) to limit scanned data.

Validation Checklist

  1. Connectivity: Test with a small time window or a known document ID.

  2. Auth: Verify credentials (401/403 indicates wrong user/roles or realm).

  3. Index/alias: Confirm the reader points to the expected index pattern/alias.

  4. Schema & dates: If Is Date Rich True is enabled, validate parsed timestamps and time zone behavior on a sample.

  5. Row counts: Cross‑check a small sample vs. Kibana or an ES DSL query for sanity.

Performance & Scaling

  • Index layout: Prefer date‑partitioned indices and query via aliases (e.g., logs-2025-*), then narrow with Query.

  • Column pruning: Select only required fields in the Query; avoid SELECT * in production.

  • Parallelism: If your connector exposes slicing/scroll or batch size options, increase for large scans; otherwise, scale Spark executors and use time‑based filters to balance workload.

  • Avoid hotspots: Queries on unanalyzed keyword fields aggregate/index more efficiently than on text fields.

  • Mapping hygiene: Ensure numeric/date fields are mapped as such in ES to avoid expensive conversions.

Security & Governance

  • Use TLS/HTTPS; trust the cluster CA (or import custom certs on the runtime).

  • Prefer least‑privilege accounts (read‑only role on required indices/aliases).

  • Store secrets in the platform’s credential store; avoid embedding credentials in plain text.

  • Audit changes to Host, Port, and Query fields.

Save & Next Steps

  • Click Save Task In Storage to persist your ES Reader configuration.

  • Connect downstream transforms and loads; schedule the data job.

  • For ongoing ingestion, align index/alias strategy (e.g., roll daily indices and a moving alias like logs-recent) with your pipeline’s incremental windows.

Example Configurations

Single document by ID (troubleshooting lookup)

  • Host: es-prod.company.com • Port: 9200

  • Resource Type: _doc

  • Index ID: AWn3…Zq

  • Query: (blank)

Recent events from an alias (last 7 days)

  • Host: es-prod.company.com • Port: 9200

  • (Index/alias configured in Data Configuration as events-recent)

  • Query:

SELECT _id, user_id, event_type, event_ts
FROM <table>
WHERE event_ts >= CURRENT_TIMESTAMP() - INTERVAL '7' DAY