GCS Writer
The GCS Writer is essential for teams using Google Cloud Platform (GCP) for large-scale storage, ingestion, and downstream processing.
The GCS Writer component enables BDB Data Pipeline users to write processed or transformed datasets into Google Cloud Storage (GCS). GCS is a highly durable, scalable, and cost-effective cloud object storage service widely used for data lakes, archival storage, machine learning workloads, and analytics pipelines.
The GCS Writer supports configurable authentication, file formats, save modes, path-based writing, and column-level mapping. This flexibility makes it a key component for building cloud-native data pipelines.
Overview
The GCS Writer provides the ability to:
Export pipeline output data into GCS buckets
Store data in multiple file formats (CSV, JSON, Parquet, etc.)
Define naming and folder structure via a configurable object path
Apply custom schema and column mappings
Control overwrite or append behavior through a Save Mode setting
The component is commonly used for:
Data lake ingestion
Archival storage
ML training dataset creation
Cloud-based analytics workloads
Integration with BigQuery or downstream ETL tools
Component Placement
You can add the GCS Writer from:
Data Engineering → Pipelines → Components → Writers
Selecting the component opens two tabs:
Basic Information
Meta Information
The screenshot provided corresponds to Meta Information.
Basic Information Tab
This tab includes:
Component Name
Description (optional)
These fields allow clear identification and documentation of the component within complex pipelines.
Meta Information Tab
The Meta Information tab contains all configuration fields required to authenticate, define storage details, and control data-writing behavior.
Authentication
Secret File*
Upload the Google Cloud service account JSON key file. This file contains credentials required for GCS access.
GCS Target Configuration
Bucket Name*
Name of the GCS bucket where the output files will be stored. Example: my-data-bucket.
Path*
The folder path or object prefix inside the bucket. Example: processed/sales/2025/.
GCS Writer will write files under this path.
Path conventions typically follow a directory-like structure but represent object prefixes.
File Format and Schema
File Format*
Format of the exported file. Common supported formats:
CSV
JSON
Parquet
Avro
Write Behavior
Save Mode
Controls how output is written. Supported modes typically include:
Append – Adds new files under the specified path.
Overwrite – Deletes existing files at the path and writes new output
Column Mapping (Selected Columns Section)
This section allows the user to specify which columns from the pipeline output should be written to GCS and how they should be represented.
Name
Name of the input dataset column.
Alias Name
Optional new name for the column in the output file.
Column Type
Specifies the data type for the output column—useful when writing strongly typed file formats.
Users can click Add New Column to add custom mappings.
Use Cases:
Selecting only a subset of fields
Renaming columns for downstream compatibility
Enforcing specific schema types (e.g., converting string timestamps to Datetime)
Execution Behavior
When the pipeline executes:
The Writer authenticates with GCS using the provided service account key.
Output data from upstream pipeline components is collected.
Columns are filtered and transformed based on the Selected Columns configuration.
Files are generated using the specified File Format.
Files are written to: gs://<bucket>/<path>/
Save Mode determines whether existing files are overwritten or new ones are appended.
If any error occurs (authentication failure, bucket not found, insufficient permissions), the pipeline marks the component as Failed and provides detailed logs.
Supported Write Operations
Write data in selected file formats (CSV, JSON, Parquet, etc.)
Create folder structures dynamically based on the provided path
Append or fully overwrite existing data
Utilize schema files for strongly typed output formats
Map columns and enforce output schema
Unsupported operations:
Row-level updates in existing GCS files
Deleting nested folder structures beyond the specified path
Direct BigQuery writes (handled through a separate writer)
Best Practices
Authentication & Security
Use least-privilege service accounts.
Rotate service account keys regularly.
Store secret files securely.
File Organization
Use structured paths such as:
year=2025/month=01/day=15/Helps partition data for downstream analytics (e.g., BigQuery external tables).
File Format Selection
Parquet / Avro for analytics and ML workloads
CSV for lightweight interoperability
JSON for semi-structured or nested datasets
Save Mode Usage
Append for incremental writes
Overwrite for full refresh pipelines
Schema Control
Provide a schema file when writing complex data structures.
Ensure column types match the expected output format.
Troubleshooting Guide
Authentication error
Invalid service account key
Re-upload a valid JSON key file.
“Bucket not found”
Incorrect bucket name
Verify bucket spelling and region.
Cannot write file
Insufficient permissions
Assign roles/storage.objectCreator or higher.
Overwrite not happening
GCS folder still contains objects
Ensure Save Mode is set to Overwrite.
Wrong schema in output
Column mapping mismatch
Review Selected Columns and Schema File.
Last updated