HDFS Writer
Use the HDFS Writer task to write datasets from your data job (Spark) into HDFS (Hadoop Distributed File System) in reliable, scalable, analytics‑ready formats.
Prerequisites
Network connectivity from the job compute to the HDFS NameNode (RPC) and DataNodes.
Access to the HDFS namespace (target directories) with appropriate permissions/ACLs.
If using encryption zones, a configured KMS, and permissions to write to the target Zone.
(Kerberos environments) A valid ticket/keytab is configured for the job runtime.
Agreed output schema, file format, and partitioning strategy.
Quick Start
Drag the HDFS Writer task to the workspace and open it (the Meta Information tab opens by default).
Enter Host IP Address (NameNode/HA name service) and Port (NameNode RPC port, typically 8020; the Web UI port, like 9870 is not used for writes).
Provide Table (target directory or sink name) and, if applicable, the Zone (encryption zone path).
Choose File Format (CSV, JSON, PARQUET, AVRO) and Save Mode (Append/Overwrite/ErrorIfExists/Ignore).
(Optional) Upload a Schema file name (Spark schema in JSON) and set Partition Columns.
Save Task In Storage and run a small test write.
Meta Information — Field Reference
Host IP Address
*
nn1.cluster.local
or nameservice1
HDFS NameNode host or HA name service (preferred). Must be resolvable from the job cluster.
Port
*
8020
NameNode RPC port for HDFS client I/O. Do not use the HTTP/HTTPS Web UI port (e.g., 9870/8443).
Table
*
/data/finance/fact_txn/
Target HDFS directory (absolute path). Some deployments label this “table” when paths mirror Hive tables.
Zone
/zones/secure/
Encryption Zone directory. Files written under this path are transparently encrypted/decrypted by HDFS. Ensure KMS and permissions are configured.
File Format
*
PARQUET
Output encoding. Prefer PARQUET/AVRO for analytics; CSV/JSON for interoperability.
Save Mode
*
Overwrite
Write behavior: Append, Overwrite, ErrorIfExists, Ignore. See details below.
Schema file name
schema_fact_txn.json
Optional Spark schema (JSON). Use to enforce types and nullability; useful for CSV/JSON writes.
Partition Columns
year,month,day
Comma‑separated columns to partitionBy (creates folder hierarchy, e.g., .../year=2025/month=09/day=12/
). Choose low‑to‑moderate cardinality.
File Format Options
PARQUET (recommended) Columnar, compressed by default (often Snappy). Best for query performance and cost. Supports nested types and predicate pushdown.
AVRO Row‑oriented with schema evolution support. Common for interchange and change data capture use cases.
CSV Human‑readable; larger and slower for big data. If enabled, set consistent delimiter/quote/escape and consider a schema to avoid type drift.
JSON Flexible and self‑descriptive; larger than Parquet/Avro. Prefer line‑delimited JSON for scalability.
Save Mode Semantics
Append
Adds new files under the target path/partitions.
Incremental loads without overwriting previous data.
Overwrite
Replaces existing data at the target path. Some writers support dynamic partition overwrite (overwrite only touched partitions).
Daily rebuilds; backfills.
Schema Handling
Schema file name (JSON) lets you supply a Spark schema to enforce data types on write (especially useful when the upstream schema is inferred from text).
Sample schema snippet:
{
"type": "struct",
"fields": [
{"name":"txn_id","type":"string","nullable":false},
{"name":"amount","type":"decimal(18,2)","nullable":true},
{"name":"event_ts","type":"timestamp","nullable":true},
{"name":"year","type":"integer","nullable":true},
{"name":"month","type":"integer","nullable":true},
{"name":"day","type":"integer","nullable":true}
]
}
Ensure precision/scale for decimals and timezone strategy for timestamps are consistent with downstream consumers.
Partitioning Strategy (Partition Columns)
Choose low‑to‑moderate cardinality columns (e.g.,
year,month,day
,country,region
).Avoid high‑cardinality IDs; they create too many small directories and files.
Keep file sizes healthy (target 128–512 MB each) using
repartition
/coalesce
upstream to tune parallelism.Example layout:
/data/finance/fact_txn/year=2025/month=09/day=12/part-00000-...snappy.parquet
Validation Checklist
Connectivity: The Host/Port resolves and is reachable from job nodes.
Permissions: You can create directories and write files under Table (and Zone, if used).
Schema: Validate key fields, nullability, and numeric/timestamp types match expectations.
Partitioning: Confirm the directory structure and that each partition produces reasonably sized files.
Save Mode: Verify intended behavior on rerun (e.g., Overwrite really replaces only what you expect).
Example Configurations
Daily partitioned Parquet (recommended)
Host:
nameservice1
• Port:8020
Table (path):
/data/finance/fact_txn/
Zone:
/zones/secure/
(optional)File Format: PARQUET
Save Mode: Overwrite (dynamic partition)
Partition Columns:
year,month,day
Schema file name:
schema_fact_txn.json
CSV append to staging (interoperability)
Host:
nn1.cluster.local
• Port:8020
Table:
/staging/exports/customer/
File Format: CSV
Save Mode: Append
Save & Next Steps
Click Save Task In Storage to persist the HDFS Writer configuration.
Run a small test, write and verify the directory layout, file sizes, and schema.
Promote to production with monitoring (file counts/sizes), and implement compaction if needed.