Sandbox Reader

Configure the Sandbox Reader task to read files from the Network pool of the Data Sandbox into your data job (Spark) for downstream transformations and analytics.

Prerequisites

  • Files have been published to the Sandbox Network pool by your upstream process (Data Sandbox / Data Prep).

  • You know which Sandbox file to read and its file type (CSV, JSON, Parquet, Avro, or XML).

  • The job’s compute environment can access the Sandbox storage location (managed by the platform).

Tip: Start with a small sample (use a narrow file/prefix if applicable and a LIMIT in your Spark SQL Query) to validate schema and datatypes before full loads.

Quick Start

  1. Drag the Sandbox Reader task to the workspace and open it.

  2. On Meta Information, review Storage Type (pre‑defined) and choose your Sandbox File from the drop‑down.

  3. Select the File Type; configure any additional options that appear for that type.

  4. (Optional) Provide a Spark SQL Query for filtering, projection, or joins.

  5. Click Save Task In Storage to persist the configuration.

Meta Information — Field Reference

Field
Required
Example
Description / Best Practices

Storage Type

*

Sandbox (Network Pool)

Pre‑defined by the platform; indicates reads occur from the managed Sandbox storage.

Sandbox File

*

hr/hires/2025-09-10.csv

Select from files that are currently available in the Network pool. If a file is missing, ensure it has been published from Data Sandbox/Data Prep.

File Type

*

CSV / JSON / PARQUET / AVRO / XML

Determines which additional options are shown (see File Type Options).

Query

Spark SQL

Optional post‑load Spark SQL executed on the registered dataset (filters, projections, joins).

Please note: Only one source is read per Sandbox Reader instance. To combine multiple files or formats, use multiple readers and join downstream via Query or a transform task.

File Type Options

There are five supported file types: CSV, JSON, PARQUET, AVRO, and XML. After selecting a type, additional options (if any) will appear.

CSV

  • Header — Treat the first row as column names.

  • Infer Schema — Infer column types from sample rows.

Best practices

  • Prefer an explicit schema downstream for large datasets or when columns can drift.

  • Keep delimiter/quote/escape consistent across files. (If your build exposes delimiter/quote options, set them to match your data.)

JSON

  • Multiline — Enable if JSON records span multiple lines.

  • Charset — Character encoding (e.g., UTF‑8).

Best practices

  • Line‑delimited JSON (one object per line) reads most efficiently at scale.

  • If nested structures are present, consider normalizing downstream.

PARQUET

  • No additional fields.

  • Columnar, compressed; typically the best default for performance and schema fidelity.

AVRO

  • CompressionDeflate or Snappy.

  • Compression Level0–9 (when Deflate is selected).

XML

  • The common options include:

    • Infer schema — Attempt to infer element types.

    • Path — Local path within the Sandbox context if further scoping is needed.

    • Root Tag — Root XML element name.

    • Row Tags — Element(s) that represent individual records.

    • Join Row Tags — Merge elements from multiple row tags into a single record.

Query (Spark SQL)

Use Query to filter, project, or join the dataset after the Sandbox Reader registers it as a table in the job context. Replace <table> with the logical name your platform assigns to this reader’s dataset (often derived from the Sandbox File).

Filter & project

SELECT candidate_id, team, CAST(monthly_salary AS DOUBLE) AS monthly_salary
FROM <table>
WHERE joining_date >= DATE '2025-01-01'

Join with another reader’s dataset

SELECT f.order_id, d.customer_name, f.amount
FROM <table> AS f
JOIN dim_customers AS d
  ON f.customer_id = d.customer_id
WHERE f.order_ts >= CURRENT_DATE - INTERVAL '30' DAY

Validation Checklist

  1. File selection: Confirm Sandbox File points to the expected object and format.

  2. Schema: If using Header/Infer Schema, validate column names and types against a known sample.

  3. Query: Test on a small subset; verify expected row counts and datatypes.

  4. Consistency: Ensure all files in a run have compatible schemas (for partitioned directories or wildcards).

  5. Downstream: Validate joins/aggregations and numeric/date casting in subsequent transforms.

Performance & Scaling

  • Prefer Parquet/Avro over CSV for large‑scale processing.

  • Prune early using Query (date filters, projections).

  • Avoid lots of tiny files; where possible, compact upstream (e.g., via Data Prep) to fewer, larger files for better task parallelism.

  • Keep schemas stable across runs; schema drift increases planning time and may cause failures.

Security & Governance

  • Access to the Sandbox Network pool is governed by platform roles—use least‑privilege principles.

  • Avoid placing sensitive data in broad, shared locations; isolate by folder/prefix where possible.

  • Maintain lineage: note which Sandbox prep steps created the file and the version of transformations applied.

Save & Next Steps

  • Click Save Task In Storage to persist your Sandbox Reader configuration.

  • Chain downstream transforms (cleaning, joins, aggregations) and configure your load/sink tasks.

  • Schedule the job and monitor for schema drift or upstream changes.

Example Configurations

CSV with headers (quick validation)

  • Sandbox File: hr/hires/2025-09-10.csv

  • File Type: CSV (Header=On, Infer Schema=On)

  • Query:

SELECT name, team, CAST(salary AS DOUBLE) AS salary
FROM <table>
WHERE hire_date >= DATE '2025-01-01'

Parquet (preferred production format)

  • Sandbox File: sales/2025/09/*.parquet

  • File Type: PARQUET

  • Query:

SELECT order_id, customer_id, amount
FROM <table>
WHERE order_date >= CURRENT_DATE - INTERVAL '7' DAY