S3 Reader

The S3 Reader component connects to Amazon S3 and reads objects from a specified S3 bucket for use in pipeline workflows. It authenticates using AWS credentials (Access Key ID and Secret Access Key) a

The component supports both real-time and batch modes and can be configured with or without an SNS Monitor.

Prerequisites

AWS credentials with read permissions for the target S3 bucket.
Bucket and file path details.
Data Pipeline Editor access.
Optional: SNS Monitor configuration (for event-driven processing).

Configuration Sections

Each S3 Reader configuration is grouped into the following sections:

Basic Information – General runtime and execution settings.
Meta Information – S3 connection details, file type, and schema settings.
Resource Configuration – Resource-specific execution parameters.
Connection Validation – Test connectivity and schema before saving.

Add the S3 Reader to a Workflow

Open the Data Pipeline Editor.
Expand the Reader section in the Component Palette.
Drag and drop the S3 Reader into the Workflow Editor.
Select the component to configure its properties in the tabs below.

Basic Information

The Basic Information tab opens by default.

Invocation Type – Select Real-Time or Batch.
Deployment Type – Pre-selected based on deployment environment.
Container Image Version – Pre-selected; shows the Docker image version.
Failover Event – Choose a failover event from the drop-down menu.
Batch Size (min 10) – Maximum number of records to process per cycle (minimum is 10).

Meta Information

The Meta Information tab requires S3-specific details. Fields vary depending on whether SNS Monitor is enabled.

Bucket Name (*) – Name of the S3 bucket.
Zone (*) – AWS region, e.g., us-west-2.
Access Key (*) – AWS Access Key ID.
Secret Key (*) – AWS Secret Access Key.
Table (*) – File name or table from the S3 location.
File Type (*) – Choose one of the supported formats: CSV, JSON, PARQUET, AVRO, XML, ORC.
Limit – Maximum records to read.
Query – Spark SQL query (use inputDf as the table name).

Access Key (*) – AWS Access Key ID.
Secret Key (*) – AWS Secret Access Key.
Table (*) – File name or table from the S3 location.
File Type (*) – Choose one of the supported formats: CSV, JSON, PARQUET, AVRO.
Limit – Maximum records to read.
Query – Spark SQL query (use inputDf as the table name).

Sample Spark SQL query:
SELECT * FROM inputDf WHERE Gender = 'Male';

File Type–Specific Settings

Depending on the selected File Type, additional fields appear:

CSV – Options: Header, Infer Schema.
JSON – Options: Multiline, Charset.
Parquet – No extra fields.
Avro – Options:
- Compression: Deflate or Snappy
- Compression Level: 0–9 (for Deflate).
XML – Options:
- Infer Schema
- Path
- Root Tag
- Row Tags
- Join Row Tags
ORC – Option: Push Down
- True: Pushes predicate filters to the storage layer for faster queries.
- False: Filtering happens after loading data into memory.

Additional Options

Selected Columns – Select specific columns to read, with optional aliasing and type definitions.
Upload File – Upload schema files (CSV/JSON; max size 2 MB).
Download Data (Schema) – Download the schema structure in JSON format.
Partition Columns – Provide the key column name used for partitioning.

Saving the Component

After completing the configuration, click the Save Component in Storage icon.
A notification confirms successful component configuration.

Notes

(*) Fields marked with an asterisk are mandatory.
Either a Table or a Query must be specified.
Ensure no data type mismatches when configuring Selected Columns.
The fields under Meta Information may vary based on the File Type.
For large datasets, enable schema inference cautiously to avoid long processing times.

PreviousGCS Reader NextHDFS Reader