HDFS Writer
The HDFS Writer component writes data into the Hadoop Distributed File System (HDFS). HDFS is a scalable, fault-tolerant file system commonly used in big data ecosystems. The component supports multiple file formats, schema enforcement, and partitioning.
Configuration Sections
The HDFS Writer component configurations are organized into three sections:
Basic Information
Meta Information
Resource Configuration
Meta Information Tab
Host IP Address
Host IP address of the HDFS cluster.
192.168.1.50
Port
Port number of the HDFS service.
8020
Table
Logical table or directory name where the data will be written.
employee_data
Zone
Special HDFS directory with transparent encryption/decryption.
/zone1
File Format
Output file format. Supported: CSV
, JSON
, PARQUET
, AVRO
, ORC
.
PARQUET
Save Mode
Defines how the data is written. Options: Append
, Overwrite
.
Overwrite
Schema File Name
Upload a Spark schema file in JSON format to enforce schema.
employee_schema.json
Partition Columns
Key column(s) used to partition data in Spark, improving query performance.
department_id
Save Mode Options
Append: Adds new data to the existing files.
Overwrite: Replaces existing files with the new data.
Partitioning
Partitioning organizes data into subdirectories within HDFS for efficient querying and storage management.
Example: Partition by date column
/hdfs/employee_data/date=2025-01-01/
/hdfs/employee_data/date=2025-01-02/
/hdfs/employee_data/date=2025-01-03/
Notes
Ensure that the HDFS client libraries are properly configured in your environment.
Use Parquet or ORC for production workloads due to better compression and performance.
Partitioning improves query performance by reducing scan size in analytical queries.
Encrypted zones in HDFS provide transparent security; configure zones via the HDFS administrator.