HDFS Reader

The HDFS Reader component reads files stored in the Hadoop Distributed File System (HDFS). It allows you to pull structured or semi-structured data directly from HDFS into pipeline workflows for processing and analytics.

HDFS is designed to reliably store large datasets across distributed clusters with fault tolerance and scalability, making the HDFS Reader a key connector for big data applications.

Configuration Sections

The HDFS Reader configuration is divided into three main sections:

  • Basic Information – Runtime and execution details.

  • Meta Information – HDFS connection properties and file schema options.

  • Resource Configuration – Resource and execution tuning parameters.

Meta Information Configuration

Open the Meta Information tab and provide the following details:

  • Host IP Address – The IP address of the HDFS host.

  • Port – Port number for HDFS access.

  • Zone – Zone directory for HDFS. Data stored here is automatically encrypted during write and decrypted during read.

  • File Type – Select the file type to be read. Supported types and their options are listed below.

Supported File Types

CSV

  • Header – Enable to read column headers.

  • Infer Schema – Enable to automatically detect column data types.

JSON

  • Multiline – Enable if JSON records span multiple lines.

  • Charset – Specify the character set (if different from default).

PARQUET

  • No additional fields required.

AVRO

  • Compression – Select compression type: Deflate or Snappy.

  • Compression Level – Available only for Deflate; choose level 0–9.

XML

  • Infer Schema – Enable to detect schema from XML structure.

  • Path – Path to the XML file.

  • Root Tag – XML root tag.

  • Row Tags – XML row tags for iteration.

  • Join Row Tags – Enable to join multiple row tags.

ORC

  • Push Down – Controls predicate pushdown for query optimization:

    • True – Pushes filters to storage layer for faster queries.

    • False – Filtering happens after loading data into memory (slower).

Additional Parameters

  • Path – Path to the file in HDFS.

  • Partition Columns – Unique key column used to partition data in Spark for optimized parallel processing.

Notes

  • File Type options dynamically change depending on the format selected.

  • Enabling Infer Schema for CSV, JSON, or XML can simplify setup but may increase load times for very large files.

  • Predicate Pushdown (ORC) improves query performance by reducing I/O.

  • Ensure the HDFS client on the cluster has the correct permissions to read the target file or directory.