HDFS Reader
The HDFS Reader component reads files stored in the Hadoop Distributed File System (HDFS). It allows you to pull structured or semi-structured data directly from HDFS into pipeline workflows for processing and analytics.
HDFS is designed to reliably store large datasets across distributed clusters with fault tolerance and scalability, making the HDFS Reader a key connector for big data applications.
Configuration Sections
The HDFS Reader configuration is divided into three main sections:
Basic Information – Runtime and execution details.
Meta Information – HDFS connection properties and file schema options.
Resource Configuration – Resource and execution tuning parameters.
Meta Information Configuration
Open the Meta Information tab and provide the following details:
Host IP Address – The IP address of the HDFS host.
Port – Port number for HDFS access.
Zone – Zone directory for HDFS. Data stored here is automatically encrypted during write and decrypted during read.
File Type – Select the file type to be read. Supported types and their options are listed below.
Supported File Types
CSV
Header – Enable to read column headers.
Infer Schema – Enable to automatically detect column data types.
JSON
Multiline – Enable if JSON records span multiple lines.
Charset – Specify the character set (if different from default).
PARQUET
No additional fields required.
AVRO
Compression – Select compression type: Deflate or Snappy.
Compression Level – Available only for Deflate; choose level 0–9.
XML
Infer Schema – Enable to detect schema from XML structure.
Path – Path to the XML file.
Root Tag – XML root tag.
Row Tags – XML row tags for iteration.
Join Row Tags – Enable to join multiple row tags.
ORC
Push Down – Controls predicate pushdown for query optimization:
True – Pushes filters to storage layer for faster queries.
False – Filtering happens after loading data into memory (slower).
Additional Parameters
Path – Path to the file in HDFS.
Partition Columns – Unique key column used to partition data in Spark for optimized parallel processing.
Notes
File Type options dynamically change depending on the format selected.
Enabling Infer Schema for CSV, JSON, or XML can simplify setup but may increase load times for very large files.
Predicate Pushdown (ORC) improves query performance by reducing I/O.
Ensure the HDFS client on the cluster has the correct permissions to read the target file or directory.