S3 Reader

Configure the S3 Reader task to ingest objects from Amazon S3 into Spark for downstream transformations and analytics.

Prerequisites

  • An AWS account with access to the target S3 bucket/prefix.

  • Valid AWS credentials (Access Key and Secret Key) or an assumed role mechanism supported by your runtime.

  • Network egress from the compute environment to S3 endpoints (or VPC endpoints if private).

  • Knowledge of the object layout (bucket, prefix, file type).

Quick Start

  • Drag the S3 Reader task to the workspace and open it. The Meta Information tab opens by default.

  • Select file settings based on whether SNS Monitor is disabled or enabled (see below).

  • Provide credentials, target table/object, file type, and optional limit or query.

  • Optionally set Partition Columns to increase read parallelism in Spark.

  • Click Save Task In Storage to persist the configuration.

Configuration — SNS Monitor Disabled

Use this mode for on-demand or scheduled reads directly from S3 paths.

Field

Required

Example

Description

Notes

Bucket Name

*

my-data-lake

Name of the S3 bucket.

Do not include s3:// prefix here.

Region

*

us-east-1

AWS region that hosts the bucket.

Must match the bucket’s region.

Access Key

*

AKIA...XXXX

AWS Access Key ID.

Use least-privilege IAM user/role.

Secret Key

*

wJalrXUtnFEMI/K7MDENG/bPxRfiCY.../XXXX

AWS Secret Access Key.

Store securely in the platform’s credential store.

Table

*

raw/customers/2025/*.parquet

Target object name or logical table to read.

Supports prefixes and wildcards where applicable.

File Type

*

CSV / JSON / PARQUET / AVRO / XML

Type of input files.

Determines additional options (see “File Type Options”).

Limit

100000

Maximum records to read.

Use for sampling or smoke tests; remove for full loads.

Query

Spark SQL

Optional SQL to filter/join after load.

Useful for projections, filters, and joins with other registered views.

Configuration — SNS Monitor Enabled

Enable this mode when your job is triggered by S3 events published to Amazon SNS. In this configuration the reader uses the event context to locate new/changed objects.

Field

Required

Example

Description

Notes

Access Key

*

AKIA...XXXX

AWS Access Key ID.

Use credentials that can read the bucket and subscribe to the SNS topic (if applicable).

Secret Key

*

wJalrXUtnFEMI/…

AWS Secret Access Key.

Use secure credential storage.

Table

*

raw/events

Logical object/table to read.

Often corresponds to an S3 prefix derived from the event payload.

File Type

*

CSV / JSON / PARQUET / AVRO / XML

Type of input files.

Determines additional options (see “File Type Options”).

Limit

5000

Optional cap on records.

Helpful for debugging event-driven runs.

Query

Spark SQL

Optional SQL to filter/join after load.

Join against other sources already registered in the job context.

Partition Columns

Provide a unique key column to partition the loaded DataFrame in Spark for parallel processing (e.g., `customer_id`, `event_time`).

  • Prefer evenly distributed numeric or timestamp columns.

  • Start with 8–32 partitions and adjust based on cluster cores and data size.

  • Partitioning happens on the Spark side; it does not alter or index data in S3.

File Type Options

Additional fields appear after selecting a File Type. Configure as follows:

CSV

  • Header — Enable to treat the first row as column headers.

  • Infer Schema — Enable automatic type inference from sample rows.

JSON

  • Multiline — Enable if JSON documents span multiple lines.

  • Charset — Specify character encoding (e.g., UTF-8).

PARQUET

  • No additional fields. Parquet is columnar and typically offers the best performance.

AVRO

  • Compression — Choose Deflate or Snappy.

  • Compression Level — Available for Deflate (0–9).

XML

  • Infer schema — Enable automatic type inference for elements

  • Path — Full S3 path or relative object path to the XML file(s).

  • Root Tag — The root XML element name.

  • Row Tags — Element(s) that represent individual rows/records.

  • Join Row Tags — Enable to join multiple row tags into a single record.

Query Examples (Spark SQL)

If you specify a Query, it runs over the dataset loaded by the S3 Reader. Replace `<table>` with the value you provided in Table.

Filter & project:

SELECT order_id, customer_id, amount
FROM <table>
WHERE order_date >= DATE '2025-01-01'

Join with an already-registered view (e.g., `dim_customers` from another reader task):

SELECT f.order_id, d.customer_name, f.amount
FROM <table> AS f
JOIN dim_customers AS d
  ON f.customer_id = d.customer_id

Validation Checklist

  • Verify credentials and bucket permissions by loading a small sample (use Limit).

  • Confirm that the selected File Type and options align with the actual files.

  • If using Query, test it on a small sample and validate row counts.

  • Check partitioning effectiveness by inspecting task parallelism and skew.

Performance & Scaling

  • Prefer Parquet or Avro for large datasets; avoid CSV for long-term storage.

  • Use filters in Query to prune data early (e.g., time windows).

  • Consolidate small files where possible; Spark performs better with fewer, larger files.

  • Use Partition Columns to increase parallelism and reduce skew.

Security & Governance

  • Use least-privilege IAM policies (read-only access to required prefixes).

  • Rotate access keys regularly or prefer role-based access where available.

  • Encrypt data at rest (SSE-S3 or SSE-KMS) and in transit (HTTPS).

  • Avoid embedding secrets in plain text; use the platform’s credential store.

Save & Next Steps

  • Click Save Task In Storage to persist the S3 Reader configuration.

  • Proceed to downstream transforms and loads; schedule the job.

  • For event-driven pipelines, verify SNS topic/subscription and message format.