S3 Reader
Configure the S3 Reader task to ingest objects from Amazon S3 into Spark for downstream transformations and analytics.
Prerequisites
An AWS account with access to the target S3 bucket/prefix.
Valid AWS credentials (Access Key and Secret Key) or an assumed role mechanism supported by your runtime.
Network egress from the compute environment to S3 endpoints (or VPC endpoints if private).
Knowledge of the object layout (bucket, prefix, file type).
Quick Start
Drag the S3 Reader task to the workspace and open it. The Meta Information tab opens by default.
Select file settings based on whether SNS Monitor is disabled or enabled (see below).
Provide credentials, target table/object, file type, and optional limit or query.
Optionally set Partition Columns to increase read parallelism in Spark.
Click Save Task In Storage to persist the configuration.
Configuration — SNS Monitor Disabled
Use this mode for on-demand or scheduled reads directly from S3 paths.
Field
Required
Example
Description
Notes
Bucket Name
*
my-data-lake
Name of the S3 bucket.
Do not include s3:// prefix here.
Region
*
us-east-1
AWS region that hosts the bucket.
Must match the bucket’s region.
Access Key
*
AKIA...XXXX
AWS Access Key ID.
Use least-privilege IAM user/role.
Secret Key
*
wJalrXUtnFEMI/K7MDENG/bPxRfiCY.../XXXX
AWS Secret Access Key.
Store securely in the platform’s credential store.
Table
*
raw/customers/2025/*.parquet
Target object name or logical table to read.
Supports prefixes and wildcards where applicable.
File Type
*
CSV / JSON / PARQUET / AVRO / XML
Type of input files.
Determines additional options (see “File Type Options”).
Limit
100000
Maximum records to read.
Use for sampling or smoke tests; remove for full loads.
Query
Spark SQL
Optional SQL to filter/join after load.
Useful for projections, filters, and joins with other registered views.
Configuration — SNS Monitor Enabled
Enable this mode when your job is triggered by S3 events published to Amazon SNS. In this configuration the reader uses the event context to locate new/changed objects.
Field
Required
Example
Description
Notes
Access Key
*
AKIA...XXXX
AWS Access Key ID.
Use credentials that can read the bucket and subscribe to the SNS topic (if applicable).
Secret Key
*
wJalrXUtnFEMI/…
AWS Secret Access Key.
Use secure credential storage.
Table
*
raw/events
Logical object/table to read.
Often corresponds to an S3 prefix derived from the event payload.
File Type
*
CSV / JSON / PARQUET / AVRO / XML
Type of input files.
Determines additional options (see “File Type Options”).
Limit
5000
Optional cap on records.
Helpful for debugging event-driven runs.
Query
Spark SQL
Optional SQL to filter/join after load.
Join against other sources already registered in the job context.
Partition Columns
Provide a unique key column to partition the loaded DataFrame in Spark for parallel processing (e.g., `customer_id`, `event_time`).
Prefer evenly distributed numeric or timestamp columns.
Start with 8–32 partitions and adjust based on cluster cores and data size.
Partitioning happens on the Spark side; it does not alter or index data in S3.
File Type Options
Additional fields appear after selecting a File Type. Configure as follows:
CSV
Header — Enable to treat the first row as column headers.
Infer Schema — Enable automatic type inference from sample rows.
JSON
Multiline — Enable if JSON documents span multiple lines.
Charset — Specify character encoding (e.g., UTF-8).
PARQUET
No additional fields. Parquet is columnar and typically offers the best performance.
AVRO
Compression — Choose Deflate or Snappy.
Compression Level — Available for Deflate (0–9).
XML
Infer schema — Enable automatic type inference for elements
Path — Full S3 path or relative object path to the XML file(s).
Root Tag — The root XML element name.
Row Tags — Element(s) that represent individual rows/records.
Join Row Tags — Enable to join multiple row tags into a single record.
Query Examples (Spark SQL)
If you specify a Query, it runs over the dataset loaded by the S3 Reader. Replace `<table>` with the value you provided in Table.
Filter & project:
SELECT order_id, customer_id, amount
FROM <table>
WHERE order_date >= DATE '2025-01-01'
Join with an already-registered view (e.g., `dim_customers` from another reader task):
SELECT f.order_id, d.customer_name, f.amount
FROM <table> AS f
JOIN dim_customers AS d
ON f.customer_id = d.customer_id
Validation Checklist
Verify credentials and bucket permissions by loading a small sample (use Limit).
Confirm that the selected File Type and options align with the actual files.
If using Query, test it on a small sample and validate row counts.
Check partitioning effectiveness by inspecting task parallelism and skew.
Performance & Scaling
Prefer Parquet or Avro for large datasets; avoid CSV for long-term storage.
Use filters in Query to prune data early (e.g., time windows).
Consolidate small files where possible; Spark performs better with fewer, larger files.
Use Partition Columns to increase parallelism and reduce skew.
Security & Governance
Use least-privilege IAM policies (read-only access to required prefixes).
Rotate access keys regularly or prefer role-based access where available.
Encrypt data at rest (SSE-S3 or SSE-KMS) and in transit (HTTPS).
Avoid embedding secrets in plain text; use the platform’s credential store.
Save & Next Steps
Click Save Task In Storage to persist the S3 Reader configuration.
Proceed to downstream transforms and loads; schedule the job.
For event-driven pipelines, verify SNS topic/subscription and message format.