Azure Blob Reader

Configure the Azure Blob Reader task to ingest files from an Azure Blob Storage container into your data job (Spark) for downstream transforms and analytics.

Prerequisites

  • Access to the target Azure Storage account and container.

  • One of the supported authentication methods:

    • Shared Access Signature (SAS)

    • Account Secret Key

    • Azure AD Application (Service Principal)Client ID + Tenant ID + Client Secret

  • Network egress from the job’s compute environment to Azure Storage endpoints.

  • Knowledge of the container, path/prefix, and file type (CSV/JSON/Parquet/Avro/XML).

Tip: Start with a small sample (use a narrow Path and/or a LIMIT in Query) to validate connectivity and schema before full loads.

Quick Start

  1. Drag the Azure Blob Reader task to the workspace and open it.

  2. On Meta Information, choose Read using (SAS, Secret Key, or Principal Secret).

  3. Provide required fields (account, container, credentials).

  4. Select File type; complete the options that appear for that type.

  5. Set Path (and optionally Read Directory) and, if needed, enter a Spark SQL Query.

  6. Click Save Task In Storage to persist the configuration.

Read using (Authentication Modes)

You can authenticate in three ways. Configure the fields for the option you select.

1) Read using Shared Access Signature (SAS)

A SAS grants time‑bound, scoped access to account resources without sharing the account key.

Field
Required
Example
Description / Notes

Shared Access Signature

*

?sv=2024-01-01&ss=b&srt=co&sp=rl&se=2025-09-30T23:59Z&sig=…

Full SAS query string (or complete SAS URL if your UI supports it). Ensure scope includes the target container and permissions (read/list).

Account Name

*

mystorageacct

Storage account name.

Container

*

landing

Container holding the files.

File type

*

CSV / JSON / PARQUET / AVRO / XML

Determines additional options (see File Type Options).

Path

*

raw/2025/09/*.parquet

Object path/prefix inside the container. Supports wildcards where applicable.

Read Directory

✓ / blank

When enabled, reads all files under the specified path/prefix.

Query

Spark SQL

Optional post‑load SQL (filters, projections, joins).

Best practice: Use least‑privilege SAS (only read/list), set a short expiry, and scope to the narrowest path you need.


2) Read using Secret Key (Account Access Key)

Account keys provide full access; prefer SAS or service principal in production where possible.

Field
Required
Example
Description / Notes

Account Key

*

Eby8vdM02x…

Storage account key. Store securely.

Account Name

*

mystorageacct

Storage account name.

Container

*

curated

Container holding the files.

File type

*

CSV / JSON / PARQUET / AVRO / XML

See File Type Options.

Path

*

sales/2025-09/*.json

Path/prefix to the files.

Read Directory

✓ / blank

Read all files under the prefix.

Query

Spark SQL

Optional post‑load SQL.

Note: Because account keys are highly privileged, rotate them regularly and restrict task access to the minimum set of operators.


3) Read using Principal Secret (Azure AD Service Principal)

Use a registered app with Azure AD and an assigned role (e.g., Storage Blob Data Reader) on the storage account/container.

Field
Required
Example
Description / Notes

Client ID

*

1e2a3b4c-…

Azure AD Application (client) ID of your app registration.

Tenant ID

*

72f988bf-…

Azure AD Directory (tenant) ID.

Client Secret

*

****

App secret used to authenticate the service principal.

Account Name

*

mystorageacct

Storage account name.

Container

*

bronze

Container holding the files.

File type

*

CSV / JSON / PARQUET / AVRO / XML

See File Type Options.

Path

*

events/2025/09/

Path/prefix within the container.

Read Directory

✓ / blank

Read all files under the prefix.

Query

Spark SQL

Optional post‑load SQL.

RBAC: Ensure your service principal has Storage Blob Data Reader (or equivalent) on the account or container scope.


File Type Options

After selecting File type, additional options appear:

CSV

  • Header — Treat first row as column names.

  • Infer Schema — Infer column types from sample rows.

Tips

  • For production, prefer an explicit schema downstream for reliability.

  • Keep delimiter/quote/escape consistent across files.

JSON

  • Multiline — Enable if records span multiple lines.

  • Charset — Character encoding (e.g., UTF‑8).

Tip: Line‑delimited JSON (one JSON object per line) reads most efficiently at scale.

PARQUET

  • No extra fields. Columnar, compressed; best default for performance and schema fidelity.

AVRO

  • CompressionDeflate or Snappy.

  • Compression Level0–9 (when Deflate is selected).

XML

  • Infer schema — Attempt to infer element types.

  • Path — Full or relative path to the XML file(s) within the container path context.

  • Root Tag — Root XML element name.

  • Row Tags — Element name(s) that represent individual records.

  • Join Row Tags — Merge elements from multiple row tags into a single record.


Path & Directory Selection

  • Path points to a directory, prefix, or file pattern under the specified Container (e.g., raw/2025/09/*.parquet).

  • Read Directory processes all files beneath the Path (recursive when supported by your runtime).

Tip: Organize data by partition‑like prefixes (e.g., …/year=2025/month=09/day=11/) to enable selective reads and faster queries.

Query (Spark SQL)

Use Query to filter, project, or join after the dataset is registered by the reader.

SELECT order_id, customer_id, amount
FROM <table>
WHERE order_date >= DATE '2025-01-01'

Join with an already‑registered dimension

SELECT f.order_id, d.customer_name, f.amount
FROM <table> AS f
JOIN dim_customers AS d
  ON f.customer_id = d.customer_id

Please note: Replace <table> with the logical name your platform assigns to the reader’s dataset (often derived from Container/Path or a Table field if present).

Validation Checklist

  • Credentials & RBAC: Confirm SAS scope/expiry or service principal role is valid.

  • Path: Verify container and prefix exist; test with a small Path subset or add a LIMIT in Query.

  • File Type: Match options to the actual data (e.g., CSV Header/Infer Schema, JSON Multiline).

  • Schema: Validate expected columns and types with a small sample.

  • Query: Run against a small sample to confirm filters/joins and row counts.

Performance & Scaling

  • Prefer Parquet/Avro over CSV for large‑scale reads.

  • Prune early via partitioned prefixes (e.g., date‑based folders) and Query filters.

  • Avoid many small files; where possible, compact upstream to fewer, larger files.

  • Ensure adequate executor parallelism for large directories; monitor task skew and adjust the layout (e.g., balance file sizes across prefixes).

Security & Governance

  • SAS: Issue read/list‑only, narrow scope, short expiry.

  • Secret Key: Restrict usage, rotate regularly; prefer SAS or service principal for production.

  • Service Principal: Use Azure AD with Storage Blob Data Reader role; rotate Client Secret per policy.

  • Always use HTTPS endpoints; avoid embedding secrets in plain text—use the platform’s credential store.

Troubleshooting

Symptom / Error
Likely Cause
Resolution

AuthorizationPermissionMismatch / AuthenticationFailed

Insufficient permissions or invalid credentials

Verify SAS scope/expiry; confirm service principal role; check account key validity.

BlobNotFound / InvalidResourceName

Wrong container/path or typo

Confirm container name and exact path/prefix; beware of case sensitivity.

Signature fields not well formed (SAS)

Malformed or URL‑encoded SAS string

Paste the full SAS query string or URL; ensure it hasn’t expired; avoid extra whitespace.

Unexpected schema / parse errors

File type option mismatch (e.g., CSV header off, JSON multiline off)

Align options to actual data; test on a single file first.

Slow reads / too many tasks

Fragmented small files or wide scans

Compact to Parquet, narrow Path/Query, and ensure balanced file sizes.

Save & Next Steps

  • Click Save Task In Storage to persist your configuration.

  • Connect downstream transforms and loads; schedule the data job.

  • For ongoing ingestion, align your container prefix layout with business partitioning (e.g., by date or source system) to simplify incremental processing.

Example Configurations

SAS (CSV, header, and schema inference)

  • Read using: Shared Access Signature

  • Account Name: mystorageacct

  • Container: landing

  • File type: CSV → Header=On, Infer Schema=On

  • Path: hr/hires/2025/09/*.csv

  • Query:

SELECT name, team, CAST(salary AS DOUBLE) AS salary
FROM <table>
WHERE hire_date >= DATE '2025-01-01'

Service Principal (Parquet, directory read)

  • Read using: Principal Secret (Client ID / Tenant ID / Client Secret)

  • Account Name: mystorageacct

  • Container: curated

  • File type: PARQUET

  • Path: sales/year=2025/month=09/

  • Read Directory: Enabled