GCS Writer

The GCS Writer is essential for teams using Google Cloud Platform (GCP) for large-scale storage, ingestion, and downstream processing.

The GCS Writer component enables BDB Data Pipeline users to write processed or transformed datasets into Google Cloud Storage (GCS). GCS is a highly durable, scalable, and cost-effective cloud object storage service widely used for data lakes, archival storage, machine learning workloads, and analytics pipelines.

The GCS Writer supports configurable authentication, file formats, save modes, path-based writing, and column-level mapping. This flexibility makes it a key component for building cloud-native data pipelines.

Overview

The GCS Writer provides the ability to:

  • Export pipeline output data into GCS buckets

  • Store data in multiple file formats (CSV, JSON, Parquet, etc.)

  • Define naming and folder structure via a configurable object path

  • Apply custom schema and column mappings

  • Control overwrite or append behavior through a Save Mode setting

The component is commonly used for:

  • Data lake ingestion

  • Archival storage

  • ML training dataset creation

  • Cloud-based analytics workloads

  • Integration with BigQuery or downstream ETL tools

Component Placement

You can add the GCS Writer from:

Data Engineering → Pipelines → Components → Writers

Selecting the component opens two tabs:

  • Basic Information

  • Meta Information

The screenshot provided corresponds to Meta Information.

Basic Information Tab

This tab includes:

  • Component Name

  • Description (optional)

These fields allow clear identification and documentation of the component within complex pipelines.

Meta Information Tab

The Meta Information tab contains all configuration fields required to authenticate, define storage details, and control data-writing behavior.

Authentication

Field
Description

Secret File*

Upload the Google Cloud service account JSON key file. This file contains credentials required for GCS access.

circle-info

Important: The service account must have permissions such as storage. objects. create, storage.objects.get, and storage.objects.list for the target bucket.

GCS Target Configuration

Field
Description

Bucket Name*

Name of the GCS bucket where the output files will be stored. Example: my-data-bucket.

Path*

The folder path or object prefix inside the bucket. Example: processed/sales/2025/. GCS Writer will write files under this path.

Path conventions typically follow a directory-like structure but represent object prefixes.

File Format and Schema

Field
Description

File Format*

Format of the exported file. Common supported formats:

  • CSV

  • JSON

  • Parquet

  • Avro

Write Behavior

Field
Description

Save Mode

Controls how output is written. Supported modes typically include:

  • Append – Adds new files under the specified path.

  • Overwrite – Deletes existing files at the path and writes new output

circle-info

Notes:

  • Because GCS is object storage, overwrite behavior replaces objects but does not support row-level updates.

  • Append mode is preferred for incremental pipeline outputs.

Column Mapping (Selected Columns Section)

This section allows the user to specify which columns from the pipeline output should be written to GCS and how they should be represented.

Field
Description

Name

Name of the input dataset column.

Alias Name

Optional new name for the column in the output file.

Column Type

Specifies the data type for the output column—useful when writing strongly typed file formats.

Users can click Add New Column to add custom mappings.

Use Cases:

  • Selecting only a subset of fields

  • Renaming columns for downstream compatibility

  • Enforcing specific schema types (e.g., converting string timestamps to Datetime)

Execution Behavior

When the pipeline executes:

  1. The Writer authenticates with GCS using the provided service account key.

  2. Output data from upstream pipeline components is collected.

  3. Columns are filtered and transformed based on the Selected Columns configuration.

  4. Files are generated using the specified File Format.

  5. Files are written to: gs://<bucket>/<path>/

  6. Save Mode determines whether existing files are overwritten or new ones are appended.

If any error occurs (authentication failure, bucket not found, insufficient permissions), the pipeline marks the component as Failed and provides detailed logs.

Supported Write Operations

  • Write data in selected file formats (CSV, JSON, Parquet, etc.)

  • Create folder structures dynamically based on the provided path

  • Append or fully overwrite existing data

  • Utilize schema files for strongly typed output formats

  • Map columns and enforce output schema

Unsupported operations:

  • Row-level updates in existing GCS files

  • Deleting nested folder structures beyond the specified path

  • Direct BigQuery writes (handled through a separate writer)

Best Practices

Authentication & Security

  • Use least-privilege service accounts.

  • Rotate service account keys regularly.

  • Store secret files securely.

File Organization

  • Use structured paths such as: year=2025/month=01/day=15/

  • Helps partition data for downstream analytics (e.g., BigQuery external tables).

File Format Selection

  • Parquet / Avro for analytics and ML workloads

  • CSV for lightweight interoperability

  • JSON for semi-structured or nested datasets

Save Mode Usage

  • Append for incremental writes

  • Overwrite for full refresh pipelines

Schema Control

  • Provide a schema file when writing complex data structures.

  • Ensure column types match the expected output format.

Troubleshooting Guide

Issue
Possible Cause
Recommended Action

Authentication error

Invalid service account key

Re-upload a valid JSON key file.

“Bucket not found”

Incorrect bucket name

Verify bucket spelling and region.

Cannot write file

Insufficient permissions

Assign roles/storage.objectCreator or higher.

Overwrite not happening

GCS folder still contains objects

Ensure Save Mode is set to Overwrite.

Wrong schema in output

Column mapping mismatch

Review Selected Columns and Schema File.

Last updated