HDFS Reader
This section describes the steps required to access and configure the Meta Information tab fields when using the HDFS Reader Task.
The HDFS Reader Task is used to read files located in the Hadoop Distributed File System (HDFS). HDFS is a distributed, fault-tolerant, and scalable file system designed to store and manage large datasets.
Accessing the Meta Information Tab
Drag the HDFS Reader Task from the Task Panel to the Workspace.
Select the task to open its configuration tabs.
The Meta Information tab opens by default.
Configuring Meta Information Fields
Host IP Address
Description: Enter the host IP address of the HDFS cluster.
Example:
192.168.1.10
Port
Description: Provide the port number used by the HDFS service.
Example:
8020
Zone
Description: Specify the HDFS zone.
Note: A zone is a special directory whose contents are transparently encrypted on write and decrypted on read.
File Type
Description: Select the type of file to be read from HDFS.
Options:
CSV
Additional Fields:
Header: Enable this option to include the header row from the file.
Infer Schema: Enable this option to derive the true schema of the columns automatically.
JSON
Additional Fields:
Multiline: Enable this option if the file contains multi-line JSON records.
Charset: Specify the character set encoding of the file (e.g., UTF-8).
PARQUET
No additional configuration fields are required.
AVRO
Additional Fields:
Compression: Select the compression type. Supported values:
Deflate
,Snappy
.Compression Level: Appears only when "
Deflate
"is selected. Choose a level between0–9
.
XML
Additional Fields:
Infer Schema: Enable to automatically infer schema of the columns.
Path: Provide the file path.
Root Tag: Enter the root tag name from the XML file.
Row Tags: Specify the row tag(s) to extract rows from.
Join Row Tags: Enable this option to join multiple row tags.
Path
Description: Provide the full HDFS file path for the dataset to be read.
Example:
/user/hdfs/data/customers.csv
Partition Columns
Description: Provide a unique key column name to partition data in Spark.
Purpose: Ensures efficient distributed processing during downstream tasks.