ES Reader
Configure the Elasticsearch (ES) Reader task to ingest documents from an Elasticsearch cluster into your data job (Spark) for downstream transformations and analytics.
Prerequisites
Network access from the job’s compute environment to the ES endpoint (HTTP/HTTPS).
An ES user with read privileges on the target index/alias.
The host, port, and (if applicable) credentials.
Knowledge of the target index/alias and any document ID you intend to read directly.
Quick Start
Drag the ES Reader task to the workspace and open it. The Meta Information tab opens by default.
Enter the Host IP Address (or hostname) and Port (default 9200 for HTTP; Elastic Cloud often uses 443).
(Optional) Provide Username/Password if your cluster requires Basic Auth.
Supply Resource Type (see note below) and Index ID if you need a specific document; otherwise, leave Index ID blank to read sets of documents.
(Optional) Enable Is Date Rich True if your dataset contains time fields and you want enhanced date handling.
(Optional) Provide a Spark SQL Query to filter/project/join after load.
Click Save Task In Storage to persist the configuration.
Meta Information — Field Reference
Host IP Address
*
10.10.20.15
or es-prod.company.com
ES endpoint (IP or DNS). Use the load balancer or coordinator node address, not a data node IP that may rotate.
Port
*
9200
(HTTP) / 443
(Elastic Cloud)
Ensure firewall/security groups allow egress from your compute to this port.
Index ID
AWn3…Zq
Document ID to read a single document. Leave blank to read multiple documents from an index/alias via the reader’s default selection and/or your Query.
Resource Type
_doc
Logical document type. ES 7+/8: use _doc
or omit (types are deprecated/removed).
Is Date Rich True
Enabled
Turns on enhanced date handling in the reader (parse/normalize ES date fields for Spark operations). Keep time zones explicit in downstream logic.
Username
es_reader
Basic Auth user with read privileges on target index/alias.
Password
••••••••
Store via the platform’s credential store; avoid plain text.
Query
Spark SQL
Optional post‑load Spark SQL (filters, projections, joins). When used, Index ID is typically not needed.
Usage Patterns
Read a specific document by ID
Set Host, Port, Username/Password (if required).
Set Resource Type (e.g.,
_doc
).Set Index ID (document
_id
).Leave the Query blank (or use it only for post‑processing once the single document is registered).
Read a set of documents (typical)
Configure Host, Port, and credentials.
Ensure the reader is pointed to the desired index/alias (via your platform’s index/pattern field in another tab).
Leave Index ID blank.
Use Query (Spark SQL) to filter/project the dataset inside Spark.
Query (Spark SQL) — Examples
Filter by recent time window (assuming the reader registers a logical table named <table>
):
SELECT _id, user, event_type, event_ts
FROM <table>
WHERE event_ts >= TIMESTAMP '2025-09-01 00:00:00'
Project only needed fields
SELECT _id, order_id, customer_id, total_amount
FROM <table>
Join with a dimension already registered in the job
SELECT f._id, f.order_id, d.customer_name, f.total_amount
FROM <table> AS f
JOIN dim_customers AS d
ON f.customer_id = d.customer_id
WHERE f.event_ts >= CURRENT_TIMESTAMP() - INTERVAL '7' DAY
Validation Checklist
Connectivity: Test with a small time window or a known document ID.
Auth: Verify credentials (401/403 indicates wrong user/roles or realm).
Index/alias: Confirm the reader points to the expected index pattern/alias.
Schema & dates: If Is Date Rich True is enabled, validate parsed timestamps and time zone behavior on a sample.
Row counts: Cross‑check a small sample vs. Kibana or an ES DSL query for sanity.
Performance & Scaling
Index layout: Prefer date‑partitioned indices and query via aliases (e.g.,
logs-2025-*
), then narrow with Query.Column pruning: Select only required fields in the Query; avoid
SELECT *
in production.Parallelism: If your connector exposes slicing/scroll or batch size options, increase for large scans; otherwise, scale Spark executors and use time‑based filters to balance workload.
Avoid hotspots: Queries on unanalyzed
keyword
fields aggregate/index more efficiently than ontext
fields.Mapping hygiene: Ensure numeric/date fields are mapped as such in ES to avoid expensive conversions.
Security & Governance
Use TLS/HTTPS; trust the cluster CA (or import custom certs on the runtime).
Prefer least‑privilege accounts (read‑only role on required indices/aliases).
Store secrets in the platform’s credential store; avoid embedding credentials in plain text.
Audit changes to Host, Port, and Query fields.
Save & Next Steps
Click Save Task In Storage to persist your ES Reader configuration.
Connect downstream transforms and loads; schedule the data job.
For ongoing ingestion, align index/alias strategy (e.g., roll daily indices and a moving alias like
logs-recent
) with your pipeline’s incremental windows.
Example Configurations
Single document by ID (troubleshooting lookup)
Host:
es-prod.company.com
• Port:9200
Resource Type:
_doc
Index ID:
AWn3…Zq
Query: (blank)
Recent events from an alias (last 7 days)
Host:
es-prod.company.com
• Port:9200
(Index/alias configured in Data Configuration as
events-recent
)Query:
SELECT _id, user_id, event_type, event_ts
FROM <table>
WHERE event_ts >= CURRENT_TIMESTAMP() - INTERVAL '7' DAY