HDFS Writer

The HDFS Writer component writes data into the Hadoop Distributed File System (HDFS). HDFS is a scalable, fault-tolerant file system commonly used in big data ecosystems. The component supports multiple file formats, schema enforcement, and partitioning.

Configuration Sections

The HDFS Writer component configurations are organized into three sections:

  • Basic Information

  • Meta Information

  • Resource Configuration

Meta Information Tab

Parameter
Description
Example

Host IP Address

Host IP address of the HDFS cluster.

192.168.1.50

Port

Port number of the HDFS service.

8020

Table

Logical table or directory name where the data will be written.

employee_data

Zone

Special HDFS directory with transparent encryption/decryption.

/zone1

File Format

Output file format. Supported: CSV, JSON, PARQUET, AVRO, ORC.

PARQUET

Save Mode

Defines how the data is written. Options: Append, Overwrite.

Overwrite

Schema File Name

Upload a Spark schema file in JSON format to enforce schema.

employee_schema.json

Partition Columns

Key column(s) used to partition data in Spark, improving query performance.

department_id

Save Mode Options

  • Append: Adds new data to the existing files.

  • Overwrite: Replaces existing files with the new data.

Partitioning

Partitioning organizes data into subdirectories within HDFS for efficient querying and storage management.

Example: Partition by date column

/hdfs/employee_data/date=2025-01-01/
/hdfs/employee_data/date=2025-01-02/
/hdfs/employee_data/date=2025-01-03/

Notes

  • Ensure that the HDFS client libraries are properly configured in your environment.

  • Use Parquet or ORC for production workloads due to better compression and performance.

  • Partitioning improves query performance by reducing scan size in analytical queries.

  • Encrypted zones in HDFS provide transparent security; configure zones via the HDFS administrator.