Sandbox Reader
Configure the Sandbox Reader task to read files from the Network pool of the Data Sandbox into your data job (Spark) for downstream transformations and analytics.
Prerequisites
Files have been published to the Sandbox Network pool by your upstream process (Data Sandbox / Data Prep).
You know which Sandbox file to read and its file type (CSV, JSON, Parquet, Avro, or XML).
The job’s compute environment can access the Sandbox storage location (managed by the platform).
Quick Start
Drag the Sandbox Reader task to the workspace and open it.
On Meta Information, review Storage Type (pre‑defined) and choose your Sandbox File from the drop‑down.
Select the File Type; configure any additional options that appear for that type.
(Optional) Provide a Spark SQL Query for filtering, projection, or joins.
Click Save Task In Storage to persist the configuration.
Meta Information — Field Reference
Storage Type
*
Sandbox (Network Pool)
Pre‑defined by the platform; indicates reads occur from the managed Sandbox storage.
Sandbox File
*
hr/hires/2025-09-10.csv
Select from files that are currently available in the Network pool. If a file is missing, ensure it has been published from Data Sandbox/Data Prep.
File Type
*
CSV / JSON / PARQUET / AVRO / XML
Determines which additional options are shown (see File Type Options).
Query
Spark SQL
Optional post‑load Spark SQL executed on the registered dataset (filters, projections, joins).
File Type Options
There are five supported file types: CSV, JSON, PARQUET, AVRO, and XML. After selecting a type, additional options (if any) will appear.
CSV
Header — Treat the first row as column names.
Infer Schema — Infer column types from sample rows.
Best practices
Prefer an explicit schema downstream for large datasets or when columns can drift.
Keep delimiter/quote/escape consistent across files. (If your build exposes delimiter/quote options, set them to match your data.)
JSON
Multiline — Enable if JSON records span multiple lines.
Charset — Character encoding (e.g., UTF‑8).
Best practices
Line‑delimited JSON (one object per line) reads most efficiently at scale.
If nested structures are present, consider normalizing downstream.
PARQUET
No additional fields.
Columnar, compressed; typically the best default for performance and schema fidelity.
AVRO
Compression — Deflate or Snappy.
Compression Level — 0–9 (when Deflate is selected).
XML
The common options include:
Infer schema — Attempt to infer element types.
Path — Local path within the Sandbox context if further scoping is needed.
Root Tag — Root XML element name.
Row Tags — Element(s) that represent individual records.
Join Row Tags — Merge elements from multiple row tags into a single record.
Query (Spark SQL)
Use Query to filter, project, or join the dataset after the Sandbox Reader registers it as a table in the job context. Replace <table>
with the logical name your platform assigns to this reader’s dataset (often derived from the Sandbox File).
Filter & project
SELECT candidate_id, team, CAST(monthly_salary AS DOUBLE) AS monthly_salary
FROM <table>
WHERE joining_date >= DATE '2025-01-01'
Join with another reader’s dataset
SELECT f.order_id, d.customer_name, f.amount
FROM <table> AS f
JOIN dim_customers AS d
ON f.customer_id = d.customer_id
WHERE f.order_ts >= CURRENT_DATE - INTERVAL '30' DAY
Validation Checklist
File selection: Confirm Sandbox File points to the expected object and format.
Schema: If using Header/Infer Schema, validate column names and types against a known sample.
Query: Test on a small subset; verify expected row counts and datatypes.
Consistency: Ensure all files in a run have compatible schemas (for partitioned directories or wildcards).
Downstream: Validate joins/aggregations and numeric/date casting in subsequent transforms.
Performance & Scaling
Prefer Parquet/Avro over CSV for large‑scale processing.
Prune early using Query (date filters, projections).
Avoid lots of tiny files; where possible, compact upstream (e.g., via Data Prep) to fewer, larger files for better task parallelism.
Keep schemas stable across runs; schema drift increases planning time and may cause failures.
Security & Governance
Access to the Sandbox Network pool is governed by platform roles—use least‑privilege principles.
Avoid placing sensitive data in broad, shared locations; isolate by folder/prefix where possible.
Maintain lineage: note which Sandbox prep steps created the file and the version of transformations applied.
Save & Next Steps
Click Save Task In Storage to persist your Sandbox Reader configuration.
Chain downstream transforms (cleaning, joins, aggregations) and configure your load/sink tasks.
Schedule the job and monitor for schema drift or upstream changes.
Example Configurations
CSV with headers (quick validation)
Sandbox File:
hr/hires/2025-09-10.csv
File Type: CSV (Header=On, Infer Schema=On)
Query:
SELECT name, team, CAST(salary AS DOUBLE) AS salary
FROM <table>
WHERE hire_date >= DATE '2025-01-01'
Parquet (preferred production format)
Sandbox File:
sales/2025/09/*.parquet
File Type: PARQUET
Query:
SELECT order_id, customer_id, amount
FROM <table>
WHERE order_date >= CURRENT_DATE - INTERVAL '7' DAY