S3 Reader

Configure the S3 Reader task to ingest objects from Amazon S3 into Spark for downstream transformations and analytics.

Prerequisites

An AWS account with access to the target S3 bucket/prefix.
Valid AWS credentials (Access Key and Secret Key) or an assumed role mechanism supported by your runtime.
Network egress from the compute environment to S3 endpoints (or VPC endpoints if private).
Knowledge of the object layout (bucket, prefix, file type).

Quick Start

Drag the S3 Reader task to the workspace and open it. The Meta Information tab opens by default.
Select file settings based on whether SNS Monitor is disabled or enabled (see below).
Provide credentials, target table/object, file type, and optional limit or query.
Optionally set Partition Columns to increase read parallelism in Spark.
Click Save Task In Storage to persist the configuration.

Use this mode for on-demand or scheduled reads directly from S3 paths.

Field

Required

Example

Description

Notes

Bucket Name

my-data-lake

Name of the S3 bucket.

Do not include s3:// prefix here.

Region

us-east-1

AWS region that hosts the bucket.

Must match the bucket’s region.

Access Key

AKIA...XXXX

AWS Access Key ID.

Use least-privilege IAM user/role.

Secret Key

wJalrXUtnFEMI/K7MDENG/bPxRfiCY.../XXXX

AWS Secret Access Key.

Store securely in the platform’s credential store.

Table

raw/customers/2025/*.parquet

Target object name or logical table to read.

Supports prefixes and wildcards where applicable.

File Type

CSV / JSON / PARQUET / AVRO / XML

Type of input files.

Determines additional options (see “File Type Options”).

Limit

100000

Maximum records to read.

Use for sampling or smoke tests; remove for full loads.

Query

Spark SQL

Optional SQL to filter/join after load.

Useful for projections, filters, and joins with other registered views.

Enable this mode when your job is triggered by S3 events published to Amazon SNS. In this configuration the reader uses the event context to locate new/changed objects.

Field

Required

Example

Description

Notes

Access Key

AKIA...XXXX

AWS Access Key ID.

Use credentials that can read the bucket and subscribe to the SNS topic (if applicable).

Secret Key

wJalrXUtnFEMI/…

AWS Secret Access Key.

Use secure credential storage.

Table

raw/events

Logical object/table to read.

Often corresponds to an S3 prefix derived from the event payload.

File Type

CSV / JSON / PARQUET / AVRO / XML

Type of input files.

Determines additional options (see “File Type Options”).

Limit

5000

Optional cap on records.

Helpful for debugging event-driven runs.

Query

Spark SQL

Optional SQL to filter/join after load.

Join against other sources already registered in the job context.

Partition Columns

Provide a unique key column to partition the loaded DataFrame in Spark for parallel processing (e.g., `customer_id`, `event_time`).

Prefer evenly distributed numeric or timestamp columns.
Start with 8–32 partitions and adjust based on cluster cores and data size.
Partitioning happens on the Spark side; it does not alter or index data in S3.

File Type Options

Additional fields appear after selecting a File Type. Configure as follows:

CSV

Header — Enable to treat the first row as column headers.
Infer Schema — Enable automatic type inference from sample rows.

Best practices: Provide an explicit schema in downstream transforms for large datasets; keep consistent delimiters and quoting.

JSON

Multiline — Enable if JSON documents span multiple lines.
Charset — Specify character encoding (e.g., UTF-8).

Best practices: Use line-delimited JSON if possible for scalable parallel reads.

PARQUET

No additional fields. Parquet is columnar and typically offers the best performance.

AVRO

Compression — Choose Deflate or Snappy.
Compression Level — Available for Deflate (0–9).

XML

Infer schema — Enable automatic type inference for elements
Path — Full S3 path or relative object path to the XML file(s).
Root Tag — The root XML element name.
Row Tags — Element(s) that represent individual rows/records.
Join Row Tags — Enable to join multiple row tags into a single record.

Tip: For large XML, consider pre-converting to Parquet for significantly faster reads.

Query Examples (Spark SQL)

If you specify a Query, it runs over the dataset loaded by the S3 Reader. Replace `<table>` with the value you provided in Table.

Filter & project:

SELECT order_id, customer_id, amount
FROM <table>
WHERE order_date >= DATE '2025-01-01'

Join with an already-registered view (e.g., `dim_customers` from another reader task):

SELECT f.order_id, d.customer_name, f.amount
FROM <table> AS f
JOIN dim_customers AS d
  ON f.customer_id = d.customer_id

Validation Checklist

Verify credentials and bucket permissions by loading a small sample (use Limit).
Confirm that the selected File Type and options align with the actual files.
If using Query, test it on a small sample and validate row counts.
Check partitioning effectiveness by inspecting task parallelism and skew.

Performance & Scaling

Prefer Parquet or Avro for large datasets; avoid CSV for long-term storage.
Use filters in Query to prune data early (e.g., time windows).
Consolidate small files where possible; Spark performs better with fewer, larger files.
Use Partition Columns to increase parallelism and reduce skew.

Security & Governance

Use least-privilege IAM policies (read-only access to required prefixes).
Rotate access keys regularly or prefer role-based access where available.
Encrypt data at rest (SSE-S3 or SSE-KMS) and in transit (HTTPS).
Avoid embedding secrets in plain text; use the platform’s credential store.

Save & Next Steps

Click Save Task In Storage to persist the S3 Reader configuration.
Proceed to downstream transforms and loads; schedule the job.
For event-driven pipelines, verify SNS topic/subscription and message format.

PreviousDB Reader NextAzure Blob Reader