HDFS Reader

This section describes the steps required to access and configure the Meta Information tab fields when using the HDFS Reader Task.

The HDFS Reader Task is used to read files located in the Hadoop Distributed File System (HDFS). HDFS is a distributed, fault-tolerant, and scalable file system designed to store and manage large datasets.

Accessing the Meta Information Tab

Drag the HDFS Reader Task from the Task Panel to the Workspace.
Select the task to open its configuration tabs.
The Meta Information tab opens by default.

Configuring Meta Information Fields

Host IP Address

Description: Enter the host IP address of the HDFS cluster.
Example: 192.168.1.10

Port

Description: Provide the port number used by the HDFS service.
Example: 8020

Zone

Description: Specify the HDFS zone.
Note: A zone is a special directory whose contents are transparently encrypted on write and decrypted on read.

File Type

Description: Select the type of file to be read from HDFS.
Options:
- CSV
  - Additional Fields:
    Header: Enable this option to include the header row from the file.
    Infer Schema: Enable this option to derive the true schema of the columns automatically.
- JSON
  - Additional Fields:
    Multiline: Enable this option if the file contains multi-line JSON records.
    Charset: Specify the character set encoding of the file (e.g., UTF-8).
- PARQUET
  - No additional configuration fields are required.
- AVRO
  - Additional Fields:
    Compression: Select the compression type. Supported values: Deflate, Snappy.
    Compression Level: Appears only when "Deflate "is selected. Choose a level between 0–9.
- XML
  - Additional Fields:
    Infer Schema: Enable to automatically infer schema of the columns.
    Path: Provide the file path.
    Root Tag: Enter the root tag name from the XML file.
    Row Tags: Specify the row tag(s) to extract rows from.
    Join Row Tags: Enable this option to join multiple row tags.

Path

Description: Provide the full HDFS file path for the dataset to be read.
Example: /user/hdfs/data/customers.csv

Partition Columns

Description: Provide a unique key column name to partition data in Spark.
Purpose: Ensures efficient distributed processing during downstream tasks.

Please Note: Click the Save Task In Storage icon to save the configuration for the dragged reader task.

Next Steps: After configuring the Meta Information tab and running the job, proceed to the Schema and Data Preview tabs (if available) to validate the file structure and confirm correct ingestion.

PreviousReaders NextMongoDB Reader