Azure Blob Reader
Configure the Azure Blob Reader task to ingest files from an Azure Blob Storage container into your data job (Spark) for downstream transforms and analytics.
Prerequisites
Access to the target Azure Storage account and container.
One of the supported authentication methods:
Shared Access Signature (SAS)
Account Secret Key
Azure AD Application (Service Principal) — Client ID + Tenant ID + Client Secret
Network egress from the job’s compute environment to Azure Storage endpoints.
Knowledge of the container, path/prefix, and file type (CSV/JSON/Parquet/Avro/XML).
Quick Start
Drag the Azure Blob Reader task to the workspace and open it.
On Meta Information, choose Read using (SAS, Secret Key, or Principal Secret).
Provide required fields (account, container, credentials).
Select File type; complete the options that appear for that type.
Set Path (and optionally Read Directory) and, if needed, enter a Spark SQL Query.
Click Save Task In Storage to persist the configuration.
Read using (Authentication Modes)
You can authenticate in three ways. Configure the fields for the option you select.
1) Read using Shared Access Signature (SAS)
A SAS grants time‑bound, scoped access to account resources without sharing the account key.
Shared Access Signature
*
?sv=2024-01-01&ss=b&srt=co&sp=rl&se=2025-09-30T23:59Z&sig=…
Full SAS query string (or complete SAS URL if your UI supports it). Ensure scope includes the target container and permissions (read/list).
Account Name
*
mystorageacct
Storage account name.
Container
*
landing
Container holding the files.
File type
*
CSV / JSON / PARQUET / AVRO / XML
Determines additional options (see File Type Options).
Path
*
raw/2025/09/*.parquet
Object path/prefix inside the container. Supports wildcards where applicable.
Read Directory
✓ / blank
When enabled, reads all files under the specified path/prefix.
Query
Spark SQL
Optional post‑load SQL (filters, projections, joins).
Best practice: Use least‑privilege SAS (only read/list), set a short expiry, and scope to the narrowest path you need.
2) Read using Secret Key (Account Access Key)
Account keys provide full access; prefer SAS or service principal in production where possible.
Account Key
*
Eby8vdM02x…
Storage account key. Store securely.
Account Name
*
mystorageacct
Storage account name.
Container
*
curated
Container holding the files.
File type
*
CSV / JSON / PARQUET / AVRO / XML
See File Type Options.
Path
*
sales/2025-09/*.json
Path/prefix to the files.
Read Directory
✓ / blank
Read all files under the prefix.
Query
Spark SQL
Optional post‑load SQL.
Note: Because account keys are highly privileged, rotate them regularly and restrict task access to the minimum set of operators.
3) Read using Principal Secret (Azure AD Service Principal)
Use a registered app with Azure AD and an assigned role (e.g., Storage Blob Data Reader) on the storage account/container.
Client ID
*
1e2a3b4c-…
Azure AD Application (client) ID of your app registration.
Tenant ID
*
72f988bf-…
Azure AD Directory (tenant) ID.
Client Secret
*
****
App secret used to authenticate the service principal.
Account Name
*
mystorageacct
Storage account name.
Container
*
bronze
Container holding the files.
File type
*
CSV / JSON / PARQUET / AVRO / XML
See File Type Options.
Path
*
events/2025/09/
Path/prefix within the container.
Read Directory
✓ / blank
Read all files under the prefix.
Query
Spark SQL
Optional post‑load SQL.
RBAC: Ensure your service principal has Storage Blob Data Reader (or equivalent) on the account or container scope.
File Type Options
After selecting File type, additional options appear:
CSV
Header — Treat first row as column names.
Infer Schema — Infer column types from sample rows.
Tips
For production, prefer an explicit schema downstream for reliability.
Keep delimiter/quote/escape consistent across files.
JSON
Multiline — Enable if records span multiple lines.
Charset — Character encoding (e.g., UTF‑8).
Tip: Line‑delimited JSON (one JSON object per line) reads most efficiently at scale.
PARQUET
No extra fields. Columnar, compressed; best default for performance and schema fidelity.
AVRO
Compression — Deflate or Snappy.
Compression Level — 0–9 (when Deflate is selected).
XML
Infer schema — Attempt to infer element types.
Path — Full or relative path to the XML file(s) within the container path context.
Root Tag — Root XML element name.
Row Tags — Element name(s) that represent individual records.
Join Row Tags — Merge elements from multiple row tags into a single record.
Path & Directory Selection
Path points to a directory, prefix, or file pattern under the specified Container (e.g.,
raw/2025/09/*.parquet
).Read Directory processes all files beneath the Path (recursive when supported by your runtime).
Query (Spark SQL)
Use Query to filter, project, or join after the dataset is registered by the reader.
SELECT order_id, customer_id, amount
FROM <table>
WHERE order_date >= DATE '2025-01-01'
Join with an already‑registered dimension
SELECT f.order_id, d.customer_name, f.amount
FROM <table> AS f
JOIN dim_customers AS d
ON f.customer_id = d.customer_id
Validation Checklist
Credentials & RBAC: Confirm SAS scope/expiry or service principal role is valid.
Path: Verify container and prefix exist; test with a small Path subset or add a LIMIT in Query.
File Type: Match options to the actual data (e.g., CSV Header/Infer Schema, JSON Multiline).
Schema: Validate expected columns and types with a small sample.
Query: Run against a small sample to confirm filters/joins and row counts.
Performance & Scaling
Prefer Parquet/Avro over CSV for large‑scale reads.
Prune early via partitioned prefixes (e.g., date‑based folders) and Query filters.
Avoid many small files; where possible, compact upstream to fewer, larger files.
Ensure adequate executor parallelism for large directories; monitor task skew and adjust the layout (e.g., balance file sizes across prefixes).
Security & Governance
SAS: Issue read/list‑only, narrow scope, short expiry.
Secret Key: Restrict usage, rotate regularly; prefer SAS or service principal for production.
Service Principal: Use Azure AD with Storage Blob Data Reader role; rotate Client Secret per policy.
Always use HTTPS endpoints; avoid embedding secrets in plain text—use the platform’s credential store.
Troubleshooting
AuthorizationPermissionMismatch
/ AuthenticationFailed
Insufficient permissions or invalid credentials
Verify SAS scope/expiry; confirm service principal role; check account key validity.
BlobNotFound
/ InvalidResourceName
Wrong container/path or typo
Confirm container name and exact path/prefix; beware of case sensitivity.
Signature fields not well formed
(SAS)
Malformed or URL‑encoded SAS string
Paste the full SAS query string or URL; ensure it hasn’t expired; avoid extra whitespace.
Unexpected schema / parse errors
File type option mismatch (e.g., CSV header off, JSON multiline off)
Align options to actual data; test on a single file first.
Slow reads / too many tasks
Fragmented small files or wide scans
Compact to Parquet, narrow Path/Query, and ensure balanced file sizes.
Save & Next Steps
Click Save Task In Storage to persist your configuration.
Connect downstream transforms and loads; schedule the data job.
For ongoing ingestion, align your container prefix layout with business partitioning (e.g., by date or source system) to simplify incremental processing.
Example Configurations
SAS (CSV, header, and schema inference)
Read using: Shared Access Signature
Account Name:
mystorageacct
Container:
landing
File type: CSV → Header=On, Infer Schema=On
Path:
hr/hires/2025/09/*.csv
Query:
SELECT name, team, CAST(salary AS DOUBLE) AS salary
FROM <table>
WHERE hire_date >= DATE '2025-01-01'
Service Principal (Parquet, directory read)
Read using: Principal Secret (Client ID / Tenant ID / Client Secret)
Account Name:
mystorageacct
Container:
curated
File type: PARQUET
Path:
sales/year=2025/month=09/
Read Directory: Enabled