Azure Blob Reader

Configure the Azure Blob Reader task to ingest files from an Azure Blob Storage container into your data job (Spark) for downstream transforms and analytics.

Prerequisites

Access to the target Azure Storage account and container.
One of the supported authentication methods:
- Shared Access Signature (SAS)
- Account Secret Key
- Azure AD Application (Service Principal) — Client ID + Tenant ID + Client Secret
Network egress from the job’s compute environment to Azure Storage endpoints.
Knowledge of the container, path/prefix, and file type (CSV/JSON/Parquet/Avro/XML).

Tip: Start with a small sample (use a narrow Path and/or a LIMIT in Query) to validate connectivity and schema before full loads.

Quick Start

Drag the Azure Blob Reader task to the workspace and open it.
On Meta Information, choose Read using (SAS, Secret Key, or Principal Secret).
Provide required fields (account, container, credentials).
Select File type; complete the options that appear for that type.
Set Path (and optionally Read Directory) and, if needed, enter a Spark SQL Query.
Click Save Task In Storage to persist the configuration.

Read using (Authentication Modes)

You can authenticate in three ways. Configure the fields for the option you select.

1) Read using Shared Access Signature (SAS)

A SAS grants time‑bound, scoped access to account resources without sharing the account key.

Field

Required

Example

Description / Notes

Shared Access Signature

?sv=2024-01-01&ss=b&srt=co&sp=rl&se=2025-09-30T23:59Z&sig=…

Full SAS query string (or complete SAS URL if your UI supports it). Ensure scope includes the target container and permissions (read/list).

Account Name

mystorageacct

Storage account name.

Container

landing

Container holding the files.

File type

CSV / JSON / PARQUET / AVRO / XML

Determines additional options (see File Type Options).

Path

raw/2025/09/*.parquet

Object path/prefix inside the container. Supports wildcards where applicable.

Read Directory

✓ / blank

When enabled, reads all files under the specified path/prefix.

Query

Spark SQL

Optional post‑load SQL (filters, projections, joins).

Best practice: Use least‑privilege SAS (only read/list), set a short expiry, and scope to the narrowest path you need.

2) Read using Secret Key (Account Access Key)

Account keys provide full access; prefer SAS or service principal in production where possible.

Field

Required

Example

Description / Notes

Account Key

Eby8vdM02x…

Storage account key. Store securely.

Account Name

mystorageacct

Storage account name.

Container

curated

Container holding the files.

File type

CSV / JSON / PARQUET / AVRO / XML

See File Type Options.

Path

sales/2025-09/*.json

Path/prefix to the files.

Read Directory

✓ / blank

Read all files under the prefix.

Query

Spark SQL

Optional post‑load SQL.

Note: Because account keys are highly privileged, rotate them regularly and restrict task access to the minimum set of operators.

3) Read using Principal Secret (Azure AD Service Principal)

Use a registered app with Azure AD and an assigned role (e.g., Storage Blob Data Reader) on the storage account/container.

Field

Required

Example

Description / Notes

Client ID

1e2a3b4c-…

Azure AD Application (client) ID of your app registration.

Tenant ID

72f988bf-…

Azure AD Directory (tenant) ID.

Client Secret

****

App secret used to authenticate the service principal.

Account Name

mystorageacct

Storage account name.

Container

bronze

Container holding the files.

File type

CSV / JSON / PARQUET / AVRO / XML

See File Type Options.

Path

events/2025/09/

Path/prefix within the container.

Read Directory

✓ / blank

Read all files under the prefix.

Query

Spark SQL

Optional post‑load SQL.

RBAC: Ensure your service principal has Storage Blob Data Reader (or equivalent) on the account or container scope.

File Type Options

After selecting File type, additional options appear:

CSV

Header — Treat first row as column names.
Infer Schema — Infer column types from sample rows.

Tips

For production, prefer an explicit schema downstream for reliability.
Keep delimiter/quote/escape consistent across files.

JSON

Multiline — Enable if records span multiple lines.
Charset — Character encoding (e.g., UTF‑8).

Tip: Line‑delimited JSON (one JSON object per line) reads most efficiently at scale.

PARQUET

No extra fields. Columnar, compressed; best default for performance and schema fidelity.

AVRO

Compression — Deflate or Snappy.
Compression Level — 0–9 (when Deflate is selected).

XML

Infer schema — Attempt to infer element types.
Path — Full or relative path to the XML file(s) within the container path context.
Root Tag — Root XML element name.
Row Tags — Element name(s) that represent individual records.
Join Row Tags — Merge elements from multiple row tags into a single record.

Path & Directory Selection

Path points to a directory, prefix, or file pattern under the specified Container (e.g., raw/2025/09/*.parquet).
Read Directory processes all files beneath the Path (recursive when supported by your runtime).

Tip: Organize data by partition‑like prefixes (e.g., …/year=2025/month=09/day=11/) to enable selective reads and faster queries.

Query (Spark SQL)

Use Query to filter, project, or join after the dataset is registered by the reader.

SELECT order_id, customer_id, amount
FROM <table>
WHERE order_date >= DATE '2025-01-01'

Join with an already‑registered dimension

SELECT f.order_id, d.customer_name, f.amount
FROM <table> AS f
JOIN dim_customers AS d
  ON f.customer_id = d.customer_id

Please note: Replace <table> with the logical name your platform assigns to the reader’s dataset (often derived from Container/Path or a Table field if present).

Validation Checklist

Credentials & RBAC: Confirm SAS scope/expiry or service principal role is valid.
Path: Verify container and prefix exist; test with a small Path subset or add a LIMIT in Query.
File Type: Match options to the actual data (e.g., CSV Header/Infer Schema, JSON Multiline).
Schema: Validate expected columns and types with a small sample.
Query: Run against a small sample to confirm filters/joins and row counts.

Performance & Scaling

Prefer Parquet/Avro over CSV for large‑scale reads.
Prune early via partitioned prefixes (e.g., date‑based folders) and Query filters.
Avoid many small files; where possible, compact upstream to fewer, larger files.
Ensure adequate executor parallelism for large directories; monitor task skew and adjust the layout (e.g., balance file sizes across prefixes).

Security & Governance

SAS: Issue read/list‑only, narrow scope, short expiry.
Secret Key: Restrict usage, rotate regularly; prefer SAS or service principal for production.
Service Principal: Use Azure AD with Storage Blob Data Reader role; rotate Client Secret per policy.
Always use HTTPS endpoints; avoid embedding secrets in plain text—use the platform’s credential store.

Troubleshooting

Symptom / Error

Likely Cause

Resolution

AuthorizationPermissionMismatch / AuthenticationFailed

Insufficient permissions or invalid credentials

Verify SAS scope/expiry; confirm service principal role; check account key validity.

BlobNotFound / InvalidResourceName

Wrong container/path or typo

Confirm container name and exact path/prefix; beware of case sensitivity.

Signature fields not well formed (SAS)

Malformed or URL‑encoded SAS string

Paste the full SAS query string or URL; ensure it hasn’t expired; avoid extra whitespace.

Unexpected schema / parse errors

File type option mismatch (e.g., CSV header off, JSON multiline off)

Align options to actual data; test on a single file first.

Slow reads / too many tasks

Fragmented small files or wide scans

Compact to Parquet, narrow Path/Query, and ensure balanced file sizes.

Save & Next Steps

Click Save Task In Storage to persist your configuration.
Connect downstream transforms and loads; schedule the data job.
For ongoing ingestion, align your container prefix layout with business partitioning (e.g., by date or source system) to simplify incremental processing.

Example Configurations

SAS (CSV, header, and schema inference)

Read using: Shared Access Signature
Account Name: mystorageacct
Container: landing
File type: CSV → Header=On, Infer Schema=On
Path: hr/hires/2025/09/*.csv
Query:

SELECT name, team, CAST(salary AS DOUBLE) AS salary
FROM <table>
WHERE hire_date >= DATE '2025-01-01'

Service Principal (Parquet, directory read)

Read using: Principal Secret (Client ID / Tenant ID / Client Secret)
Account Name: mystorageacct
Container: curated
File type: PARQUET
Path: sales/year=2025/month=09/
Read Directory: Enabled

PreviousS3 Reader NextES Reader