GCS Reader

The GCS Reader component is used to read data from Google Cloud Storage (GCS) buckets into pipeline workflows. It retrieves files or objects stored in GCS and makes them available for further process.

The GCS Reader can be configured in two modes:

  • Docker Deployment (GCS Reader with Docker)

  • PySpark Deployment (PySpark GCS Reader)

The component always requires a GCS Monitor in the pipeline to function.

Note: Refer to the GCS Monitor section of this guide for setup details before configuring the GCS Reader.

Prerequisites

  • A pipeline workflow with GCS Monitor and Event components.

  • Required access credentials for the target GCS bucket.

  • Correct deployment type selected (Docker or PySpark).

Component Configuration Sections

Each GCS Reader configuration includes:

  • Basic Information – Runtime and deployment settings.

  • Meta Information – GCS bucket, file path, and file type configuration.

  • Resource Configuration – Resource usage and execution parameters (if applicable).

GCS Reader with Docker Deployment

Configure the Component

  1. Open the Pipeline Workflow Editor for an existing workflow containing a GCS Monitor and Event component.

  2. From the Reader section of the component palette, drag GCS Reader into the workflow editor.

  3. Click the GCS Reader component to open its configuration tabs.

Basic Information (Docker Deployment)

  • Invocation Type – Select Real-Time or Batch.

  • Deployment Type – Auto-filled; pre-selected by the system.

  • Container Image Version – Auto-filled; displays image version for the Docker container.

  • Failover Event – Select a failover event from the drop-down.

  • Batch Size (min 10) – Enter the maximum records to process per execution cycle. Minimum is 10.

Meta Information (Docker Deployment)

  • Bucket Name – Enter the GCS bucket name.

  • Directory Path – Enter the path where the file(s) are located.

  • File Name – Enter the specific file name to read.

PySpark GCS Reader

Configure the Component

  1. Open the Pipeline Workflow Editor for an existing workflow with PySpark GCS Reader and Event component.

    • Or create a new pipeline with these components.

  2. From the Reader section of the component palette, drag PySpark GCS Reader into the workflow editor.

  3. Click the component to open its configuration tabs.

Basic Information (Spark Deployment)

  • Invocation Type – Select Real-Time or Batch.

  • Deployment Type – Auto-filled; pre-selected by the system.

  • Container Image Version – Auto-filled; displays image version for the Spark container.

  • Failover Event – Select a failover event from the drop-down.

  • Batch Size (min 10) – Enter the maximum records per cycle. Minimum is 10.

Meta Information (Spark Deployment)

  • Secret File (*) – Upload the JSON credentials file from Google Cloud Storage.

  • Bucket Name (*) – Enter the GCS bucket name.

  • Path – Enter the path to the file or directory.

  • Read Directory – Enable/disable reading all files from a directory.

  • Limit – Specify a record limit.

  • File Type (*) – Choose the file type from the drop-down:

Supported File Types

  • CSV – Options: Header, Multiline, Infer Schema.

  • JSON – Options: Multiline, Charset.

  • Parquet – No extra options required.

  • Avro – Options:

    • Compression: Deflate, Snappy

    • Compression Level: Levels 0–9 (for Deflate).

  • XML – Options:

    • Infer Schema

    • Path

    • Root Tag

    • Row Tags

    • Join Row Tags

    • Query (Spark SQL).

Additional Options

  • Upload File – Upload CSV/JSON (max 2 MB) directly.

  • Download Data (Schema) – Download schema as JSON.

  • Column Filter – Select specific columns for output.

Save Component Configuration

  1. After filling out the Basic Information and Meta Information tabs, click Save Component in Storage.

  2. A notification confirms that the component configuration has been successfully saved.

Notes:

  • Minimum Batch Size is 10 records.

  • Always configure the Secret File for Spark-based deployments.

  • GCS bucket, directory, and file names must match exactly to avoid failures.

  • CSV and JSON files with multiline data must enable the Multiline option.

  • Schema can be automatically inferred (for CSV, JSON, XML) or explicitly defined.