GCS Writer

The GCS Writer is essential for teams using Google Cloud Platform (GCP) for large-scale storage, ingestion, and downstream processing.

The GCS Writer component enables BDB Data Pipeline users to write processed or transformed datasets into Google Cloud Storage (GCS). GCS is a highly durable, scalable, and cost-effective cloud object storage service widely used for data lakes, archival storage, machine learning workloads, and analytics pipelines.

The GCS Writer supports configurable authentication, file formats, save modes, path-based writing, and column-level mapping. This flexibility makes it a key component for building cloud-native data pipelines.

Overview

The GCS Writer provides the ability to:

Export pipeline output data into GCS buckets
Store data in multiple file formats (CSV, JSON, Parquet, etc.)
Define naming and folder structure via a configurable object path
Apply custom schema and column mappings
Control overwrite or append behavior through a Save Mode setting

The component is commonly used for:

Data lake ingestion
Archival storage
ML training dataset creation
Cloud-based analytics workloads
Integration with BigQuery or downstream ETL tools

Component Placement

You can add the GCS Writer from:

Data Engineering → Pipelines → Components → Writers

Selecting the component opens two tabs:

Basic Information
Meta Information

The screenshot provided corresponds to Meta Information.

Basic Information Tab

This tab includes:

Component Name
Description (optional)

These fields allow clear identification and documentation of the component within complex pipelines.

Meta Information Tab

The Meta Information tab contains all configuration fields required to authenticate, define storage details, and control data-writing behavior.

Authentication

Field

Description

Secret File*

Upload the Google Cloud service account JSON key file. This file contains credentials required for GCS access.

Important: The service account must have permissions such as storage. objects. create, storage.objects.get, and storage.objects.list for the target bucket.

GCS Target Configuration

Field

Description

Bucket Name*

Name of the GCS bucket where the output files will be stored. Example: my-data-bucket.

Path*

The folder path or object prefix inside the bucket. Example: processed/sales/2025/. GCS Writer will write files under this path.

Path conventions typically follow a directory-like structure but represent object prefixes.

File Format and Schema

Field

Description

File Format*

Format of the exported file. Common supported formats:

CSV
JSON
Parquet
Avro

Write Behavior

Field

Description

Save Mode

Controls how output is written. Supported modes typically include:

Append – Adds new files under the specified path.
Overwrite – Deletes existing files at the path and writes new output

Notes:

Because GCS is object storage, overwrite behavior replaces objects but does not support row-level updates.
Append mode is preferred for incremental pipeline outputs.

Column Mapping (Selected Columns Section)

This section allows the user to specify which columns from the pipeline output should be written to GCS and how they should be represented.

Field

Description

Name

Name of the input dataset column.

Alias Name

Optional new name for the column in the output file.

Column Type

Specifies the data type for the output column—useful when writing strongly typed file formats.

Users can click Add New Column to add custom mappings.

Use Cases:

Selecting only a subset of fields
Renaming columns for downstream compatibility
Enforcing specific schema types (e.g., converting string timestamps to Datetime)

Execution Behavior

When the pipeline executes:

The Writer authenticates with GCS using the provided service account key.
Output data from upstream pipeline components is collected.
Columns are filtered and transformed based on the Selected Columns configuration.
Files are generated using the specified File Format.
Files are written to: gs://<bucket>/<path>/
Save Mode determines whether existing files are overwritten or new ones are appended.

If any error occurs (authentication failure, bucket not found, insufficient permissions), the pipeline marks the component as Failed and provides detailed logs.

Supported Write Operations

Write data in selected file formats (CSV, JSON, Parquet, etc.)
Create folder structures dynamically based on the provided path
Append or fully overwrite existing data
Utilize schema files for strongly typed output formats
Map columns and enforce output schema

Unsupported operations:

Row-level updates in existing GCS files
Deleting nested folder structures beyond the specified path
Direct BigQuery writes (handled through a separate writer)

Best Practices

Authentication & Security

Use least-privilege service accounts.
Rotate service account keys regularly.
Store secret files securely.

File Organization

Use structured paths such as: year=2025/month=01/day=15/
Helps partition data for downstream analytics (e.g., BigQuery external tables).

File Format Selection

Parquet / Avro for analytics and ML workloads
CSV for lightweight interoperability
JSON for semi-structured or nested datasets

Save Mode Usage

Append for incremental writes
Overwrite for full refresh pipelines

Schema Control

Provide a schema file when writing complex data structures.
Ensure column types match the expected output format.

Troubleshooting Guide

Issue

Possible Cause

Recommended Action

Authentication error

Invalid service account key

Re-upload a valid JSON key file.

“Bucket not found”

Incorrect bucket name

Verify bucket spelling and region.

Cannot write file

Insufficient permissions

Assign roles/storage.objectCreator or higher.

Overwrite not happening

GCS folder still contains objects

Ensure Save Mode is set to Overwrite.

Wrong schema in output

Column mapping mismatch

Review Selected Columns and Schema File.

Last updated 1 month ago

hashtagOverview

hashtagComponent Placement

hashtagBasic Information Tab

hashtagMeta Information Tab

hashtagAuthentication

hashtagGCS Target Configuration

hashtagFile Format and Schema

hashtagWrite Behavior

hashtagColumn Mapping (Selected Columns Section)

hashtagExecution Behavior

hashtagSupported Write Operations

hashtagBest Practices

hashtagAuthentication & Security

hashtagFile Organization

hashtagFile Format Selection

hashtagSave Mode Usage

hashtagSchema Control

hashtagTroubleshooting Guide

Overview

Component Placement

Basic Information Tab

Meta Information Tab

Authentication

GCS Target Configuration

File Format and Schema

Write Behavior

Column Mapping (Selected Columns Section)

Execution Behavior

Supported Write Operations

Best Practices

Authentication & Security

File Organization

File Format Selection

Save Mode Usage

Schema Control

Troubleshooting Guide