HDFS Writer

Use the HDFS Writer task to write datasets from your data job (Spark) into HDFS (Hadoop Distributed File System) in reliable, scalable, analytics‑ready formats.

Prerequisites

Network connectivity from the job compute to the HDFS NameNode (RPC) and DataNodes.
Access to the HDFS namespace (target directories) with appropriate permissions/ACLs.
If using encryption zones, a configured KMS, and permissions to write to the target Zone.
(Kerberos environments) A valid ticket/keytab is configured for the job runtime.
Agreed output schema, file format, and partitioning strategy.

Tip: Before the first production write, validate on a staging path and confirm permissions, file sizes, and partition layout.

Quick Start

Drag the HDFS Writer task to the workspace and open it (the Meta Information tab opens by default).
Enter Host IP Address (NameNode/HA name service) and Port (NameNode RPC port, typically 8020; the Web UI port, like 9870 is not used for writes).
Provide Table (target directory or sink name) and, if applicable, the Zone (encryption zone path).
Choose File Format (CSV, JSON, PARQUET, AVRO) and Save Mode (Append/Overwrite/ErrorIfExists/Ignore).
(Optional) Upload a Schema file name (Spark schema in JSON) and set Partition Columns.
Save Task In Storage and run a small test write.

Meta Information — Field Reference

Field

Required

Example

Description / Best Practices

Host IP Address

nn1.cluster.local or nameservice1

HDFS NameNode host or HA name service (preferred). Must be resolvable from the job cluster.

Port

8020

NameNode RPC port for HDFS client I/O. Do not use the HTTP/HTTPS Web UI port (e.g., 9870/8443).

Table

/data/finance/fact_txn/

Target HDFS directory (absolute path). Some deployments label this “table” when paths mirror Hive tables.

Zone

/zones/secure/

Encryption Zone directory. Files written under this path are transparently encrypted/decrypted by HDFS. Ensure KMS and permissions are configured.

File Format

PARQUET

Output encoding. Prefer PARQUET/AVRO for analytics; CSV/JSON for interoperability.

Save Mode

Overwrite

Write behavior: Append, Overwrite, ErrorIfExists, Ignore. See details below.

Schema file name

schema_fact_txn.json

Optional Spark schema (JSON). Use to enforce types and nullability; useful for CSV/JSON writes.

Partition Columns

year,month,day

Comma‑separated columns to partitionBy (creates folder hierarchy, e.g., .../year=2025/month=09/day=12/). Choose low‑to‑moderate cardinality.

File Format Options

PARQUET (recommended) Columnar, compressed by default (often Snappy). Best for query performance and cost. Supports nested types and predicate pushdown.
AVRO Row‑oriented with schema evolution support. Common for interchange and change data capture use cases.
CSV Human‑readable; larger and slower for big data. If enabled, set consistent delimiter/quote/escape and consider a schema to avoid type drift.
JSON Flexible and self‑descriptive; larger than Parquet/Avro. Prefer line‑delimited JSON for scalability.

Compression: For CSV/JSON, configure codec (e.g., gzip) in writer/cluster defaults if exposed. For Parquet, Snappy is typical; ZSTD provides better compression at higher CPU cost (if supported).

Save Mode Semantics

Mode

Behavior

Typical Use

Append

Adds new files under the target path/partitions.

Incremental loads without overwriting previous data.

Overwrite

Replaces existing data at the target path. Some writers support dynamic partition overwrite (overwrite only touched partitions).

Daily rebuilds; backfills.

Partition‑aware overwrites: If supported, prefer dynamic partition overwrite to avoid deleting untouched partitions.

Schema Handling

Schema file name (JSON) lets you supply a Spark schema to enforce data types on write (especially useful when the upstream schema is inferred from text).
Sample schema snippet:

{
  "type": "struct",
  "fields": [
    {"name":"txn_id","type":"string","nullable":false},
    {"name":"amount","type":"decimal(18,2)","nullable":true},
    {"name":"event_ts","type":"timestamp","nullable":true},
    {"name":"year","type":"integer","nullable":true},
    {"name":"month","type":"integer","nullable":true},
    {"name":"day","type":"integer","nullable":true}
  ]
}

Ensure precision/scale for decimals and timezone strategy for timestamps are consistent with downstream consumers.

Partitioning Strategy (Partition Columns)

Choose low‑to‑moderate cardinality columns (e.g., year,month,day, country,region).
Avoid high‑cardinality IDs; they create too many small directories and files.
Keep file sizes healthy (target 128–512 MB each) using repartition/coalesce upstream to tune parallelism.
Example layout:

/data/finance/fact_txn/year=2025/month=09/day=12/part-00000-...snappy.parquet

Validation Checklist

Connectivity: The Host/Port resolves and is reachable from job nodes.
Permissions: You can create directories and write files under Table (and Zone, if used).
Schema: Validate key fields, nullability, and numeric/timestamp types match expectations.
Partitioning: Confirm the directory structure and that each partition produces reasonably sized files.
Save Mode: Verify intended behavior on rerun (e.g., Overwrite really replaces only what you expect).

Example Configurations

Daily partitioned Parquet (recommended)

Host: nameservice1 • Port: 8020
Table (path): /data/finance/fact_txn/
Zone: /zones/secure/ (optional)
File Format: PARQUET
Save Mode: Overwrite (dynamic partition)
Partition Columns: year,month,day
Schema file name: schema_fact_txn.json

CSV append to staging (interoperability)

Host: nn1.cluster.local • Port: 8020
Table: /staging/exports/customer/
File Format: CSV
Save Mode: Append

Save & Next Steps

Click Save Task In Storage to persist the HDFS Writer configuration.
Run a small test, write and verify the directory layout, file sizes, and schema.
Promote to production with monitoring (file counts/sizes), and implement compaction if needed.

PreviousWriters NextAzure Writer