Data Pipeline
  • Data Pipeline
    • About Data Pipeline
    • Design Philosophy
    • Low Code Visual Authoring
    • Real-time and Batch Orchestration
    • Event based Process Orchestration
    • ML and Data Ops
    • Distributed Compute
    • Fault Tolerant and Auto-recovery
    • Extensibility via Custom Scripting
  • Getting Started
    • Homepage
      • Create
        • Creating a New Pipeline
          • Adding Components to Canvas
          • Connecting Components
            • Events [Kafka and Data Sync]
          • Memory and CPU Allocations
        • Creating a New Job
          • Page
          • Job Editor Page
          • Spark Job
            • Readers
              • HDFS Reader
              • MongoDB Reader
              • DB Reader
              • S3 Reader
              • Azure Blob Reader
              • ES Reader
              • Sandbox Reader
              • Athena Query Executer
            • Writers
              • HDFS Writer
              • Azure Writer
              • DB Writer
              • ES Writer
              • S3 Writer
              • Sandbox Writer
              • Mongodb Writer
              • Kafka Producer
            • Transformations
          • PySpark Job
          • Python Job
          • Python Job (On demand)
          • Script Executer Job
          • Job Alerts
        • Register as Job
        • Exporting a Script From Data Science Lab
        • Utility
        • Git Sync
      • Overview
        • Jobs
        • Pipeline
      • List Jobs
      • List Pipelines
      • Scheduler
      • Data Channel & Cluster Events
      • Trash
      • Settings
    • Pipeline Workflow Editor
      • Pipeline Toolbar
        • Pipeline Overview
        • Pipeline Testing
        • Search Component in Pipelines
        • Push & Pull Pipeline
        • Update Pipeline Components
        • Full Screen
        • Log Panel
        • Event Panel
        • Activate/Deactivate Pipeline
        • Update Pipeline
        • Failure Analysis
        • Delete Pipeline
        • Pipeline Component Configuration
        • Pipeline Failure Alert History
        • Format Flowchart
        • Zoom In/Zoom Out
        • Update Component Version
      • Component Panel
      • Right-side Panel
    • Testing Suite
    • Activating Pipeline
    • Pipeline Monitoring
    • Job Monitoring
  • Components
    • Adding Components to Workflow
    • Component Architecture
    • Component Base Configuration
    • Resource Configuration
    • Intelligent Scaling
    • Connection Validation
    • Readers
      • GCS Reader
      • S3 Reader
      • HDFS Reader
      • DB Reader
      • ES Reader
      • SFTP Stream Reader
      • SFTP Reader
      • Mongo DB Reader
        • MongoDB Reader Lite (PyMongo Reader)
        • MongoDB Reader
      • Azure Blob Reader
      • Azure Metadata Reader
      • ClickHouse Reader (Docker)
      • Sandbox Reader
      • Azure Blob Reader (Docker)
      • Athena Query Executer
      • Big Query Reader
    • Writers
      • S3 Writer
      • DB Writer
      • HDFS Writer
      • ES Writer
      • Video Writer
      • Azure Writer
      • ClickHouse Writer (Docker)
      • Sandbox Writer
      • MongoDB Writers
        • MongoDB Writer
        • MongoDB Writer Lite (PyMongo Writer)
    • Machine Learning
      • DSLab Runner
      • AutoML Runner
    • Consumers
      • GCS Monitor
      • Sqoop Executer
      • OPC UA
      • SFTP Monitor
      • MQTT Consumer
      • Video Stream Consumer
      • Eventhub Subscriber
      • Twitter Scrapper
      • Mongo ChangeStream
      • Rabbit MQ Consumer
      • AWS SNS Monitor
      • Kafka Consumer
      • API Ingestion and Webhook Listener
    • Producers
      • WebSocket Producer
      • Eventhub Publisher
      • EventGrid Producer
      • RabbitMQ Producer
      • Kafka Producer
      • Synthetic Data Generator
    • Transformations
      • SQL Component
      • File Splitter
      • Rule Splitter
      • Stored Producer Runner
      • Flatten JSON
      • Pandas Query Component
      • Enrichment Component
      • Mongo Aggregation
      • Data Loss Protection
      • Data Preparation (Docker)
      • Rest Api Component
      • Schema Validator
    • Scripting
      • Script Runner
      • Python Script
        • Keeping Different Versions of the Python Script in VCS
      • PySpark Script
    • Scheduler
    • Alerts
      • Alerts
      • Email Component
    • Job Trigger
  • Custom Components
  • Advance Configuration & Monitoring
    • Configuration
      • Default Component Configuration
      • Logger
    • Data Channel
    • Cluster Events
    • System Component Status
  • Version Control
  • Use Cases
Powered by GitBook
On this page
  • GCS Reader with Docker Deployment
  • Basic Information
  • Steps to Configure the Meta Information of GCS Reader (with Docker Deployment Type)
  • PySpark GCS Reader
  • Basic Information
  • Steps to configure the Meta Information of GCS Reader (with Spark Deployment Type)
  • Saving the Component Configuration
Export as PDF
  1. Components
  2. Readers

GCS Reader

PreviousReadersNextS3 Reader

Last updated 5 months ago

GCS Reader component is typically designed to read data from Google Cloud Storage (GCS), a cloud-based object storage service provided by Google Cloud Platform. A GCS Reader can be a part of an application or system that needs to access data stored in GCS buckets. It allows you to retrieve, read, and process data from GCS, making it accessible for various use cases, such as data analysis, data processing, backups, and more.

GCS Reader pulls data from the GCS Monitor, so the first step is to implement .

Note: The users can refer to the section of this document for the details.

All component configurations are classified broadly into the following sections:

  • Meta Information

  • ​

GCS Reader with Docker Deployment

  • Navigate to the Pipeline Workflow Editor page for an existing pipeline workflow with GCS Monitor and Event component.

  • Open the Reader section of the Component Pallet.

  • Drag the GCS Reader to the Workflow Editor.

  • Click on the dragged GCS Reader component to get the component properties tabs below.

Basic Information

It is the default tab to open for the component while configuring it.

  • Invocation Type: Select an invocation mode from the ‘Real-Time’ or ‘Batch’ using the drop-down menu.

  • Deployment Type: It displays the deployment type for the reader component. This field comes pre-selected.

  • Container Image Version: It displays the image version for the docker container. This field comes pre-selected.

  • Failover Event: Select a failover Event from the drop-down menu.

  • Batch Size (min 10): Provide the maximum number of records to be processed in one execution cycle (the minimum limit for this field is 10).

Steps to Configure the Meta Information of GCS Reader (with Docker Deployment Type)

  • Bucket Name: Enter the Bucket name for GCS Reader. A bucket is a top-level container for storing objects in GCS.

  • Directory Path: Enter the path where the file is located, which needs to be read.

  • File Name: Enter the file name.

PySpark GCS Reader

  • Navigate to the Pipeline Workflow Editor page for an existing pipeline workflow with the PySpark GCS Reader and Event component.

OR

  • You may create a new pipeline with the mentioned components.

  • Open the Reader section of the Component Pallet.

  • Drag the PySpark GCS Reader to the Workflow Editor.

  • Click the dragged GCS Reader component to get the component properties tabs below.

Basic Information

  • Invocation Type: Select an invocation mode from the ‘Real-Time’ or ‘Batch’ using the drop-down menu.

  • Deployment Type: It displays the deployment type for the reader component. This field comes pre-selected.

  • Container Image Version: It displays the image version for the docker container. This field comes pre-selected.

  • Failover Event: Select a failover Event from the drop-down menu.

  • Batch Size (min 10): Provide the maximum number of records to be processed in one execution cycle (the minimum limit for this field is 10).

Steps to configure the Meta Information of GCS Reader (with Spark Deployment Type)

  • Secret File (*): Upload the JSON from the Google Cloud Storage.

  • Bucket Name (*): Enter the Bucket name for GCS Reader. A bucket is a top-level container for storing objects in GCS.

  • Path: Enter the path where the file is located, which needs to be read.

  • Read Directory: Disable reading single files from the directory.

  • Limit: Set a limit for the number of records to be read.

  • File-Type: Select the File-Type from the drop-down.

    • File Type (*): Supported file formats are:

      • CSV: The Header, Multilibe, and Infer Schema fields will be displayed with CSV as the selected File Type. Enable the Header option to get the Header of the reading file and enable the Infer Schema option to get the true schema of the column in the CSV file. Check the Multiline option if there is any Multiline string in the file.

      • JSON: The Multiline and Charset fields are displayed with JSON as the selected File Type. Check in the Multiline option if there is any Multiline string in the file.

      • PARQUET: No extra field gets displayed with PARQUET as the selected File Type.

      • AVRO: This File Type provides two drop-down menus.

        • Compression: Select an option out of the Deflate and Snappy options.

        • Compression Level: This field appears for the Deflate compression option. It provides 0 to 9 levels via a drop-down menu.

      • XML: Select this option to read the XML file. If this option is selected, the following fields will be displayed:

        • Infer schema: Enable this option to get the true schema of the column.

        • Path: Provide the path of the file.

        • Root Tag: Provide the root tag from the XML files.

        • Row Tags: Provide the row tags from the XML files.

        • Join Row Tags: Enable this option to join multiple row tags.

  • Query: Enter the Spark SQL query.

Select the desired columns using the Download Data and Upload File options.

Or

The user can also use the Column Filter section to select columns.

Saving the Component Configuration

  • Click the Save Component in Storage icon after doing all the configurations to save the reader component.

  • A notification message appears to inform about the component configuration success.

​

GCS Monitor
GCS Monitor
​Basic Information​
Resource Configuration​
Configuring GCS Reader in pipeline workflow
Basic Infomration tab with Docker Deployment Type
Meta information tab of GCS Reader (with Docker Type)
Basic Infomration tab with Spark Deployment Type