Sandbox Reader

The Sandbox Reader component is used to read and access data files stored in a configured sandbox environment. It supports multiple file formats, column filtering, partitioning, and Spark SQL queries for flexible data exploration.

Note: Before using the Sandbox Reader, upload a file to Data Sandbox under the Data Center module.

Configuration Sections

The Sandbox Reader component configurations are organized into the following sections:

Basic Information
Meta Information
Resource Configuration
Connection Validation

Using Sandbox Reader

Navigate to the Data Pipeline Editor.
Expand the Readers section in the component palette.
Drag and drop the Sandbox Reader into the workflow editor.
Click the component to open its configuration tabs.

Basic Information Tab

Parameter

Description

Example

Required

Invocation Type

Select how the reader runs: Real-Time or Batch.

Batch

Yes

Deployment Type

Deployment type (pre-selected).

Kubernetes

Container Image Version

Docker container image version (pre-selected).

v1.2.3

Failover Event

Select a failover event.

retry_event

Optional

Batch Size

Maximum records processed per cycle (minimum = 10).

1000

Yes

Meta Information Tab

The Meta Information tab contains parameters that vary depending on the Storage Type and File Type selected.

Storage Type Options

Network (default): Reads files using folder paths.
Platform: Reads files directly from sandbox-managed datasets.

Network Mode

Parameter

Description

Example

File Type

File type to read (CSV, JSON, PARQUET, AVRO, XML, ORC).

CSV

Schema

Spark schema in JSON format (optional).

employee_schema.json

Sandbox Folder Path

Folder path containing the part files.

/sandbox/sales/

Limit

Maximum number of records to read.

5000

Platform Mode

Parameter

Description

Example

File Type

File type to read.

PARQUET

Sandbox Name

Sandbox name for the selected file type.

employee_data

Sandbox File

File name (auto-filled after selecting sandbox).

employee_2025.parquet

Limit

Maximum number of records to read.

1000

Query

Spark SQL query (inputDf as table name).

SELECT * FROM inputDf

Column Filtering

You can select specific columns instead of retrieving the full dataset.

Field

Description

Example

Source Field

Column name in the file.

employee_id

Destination Field

Alias for the column.

emp_id

Column Type

Data type of the column.

STRING

Additional Options:

Upload File: Upload CSV/JSON files (<2 MB) to auto-populate schema.
Download Data: Export schema structure in JSON format.
Delete Data: Clear all column filter details.

Partition Columns

Enter the name of a partitioned column to read data from specific partitions.

File Type-Specific Behavior

CSV

Header: Enable if the first row contains column names.
Infer Schema: Automatically detect schema.
Multiline: Enable for multiline CSV records.
Schema: Provide Spark schema (JSON) to filter bad records.
- To handle bad records, map the failover event in the Basic Information tab.

JSON

Multiline: Enable for multiline JSON.
Charset: Specify encoding (UTF-8, ISO-8859-1, etc.).

PARQUET

No extra fields required.

AVRO

Compression: Select Deflate or Snappy (default).
Compression Level: Available when Deflate is selected (0–9).

XML

Infer Schema: Enable to detect column schema.
Path: File path.
Root Tag: Root element in XML.
Row Tags: Row-level tag(s).
Join Row Tags: Enable to join multiple row tags.

ORC

Push Down: Predicate pushdown option:
- True: Enables pushdown, improving query performance by filtering at storage level.
- False: Filtering happens after load (slower).

Example Query

SELECT 
    team AS Team, 
    AVG(monthly_salary) AS Average_Salary, 
    AVG(experience) AS Average_Experience 
FROM inputDf 
GROUP BY team;

Notes

Fields marked with (*) are mandatory.
Either table or query must be specified (except for SFTP Reader).
Ensure no data type mismatch in Column Filter definitions.
Available fields vary by File Type selection.

PreviousClickHouse Reader (Docker)NextAzure Blob Reader (Docker)