DB Reader
DB Reader — Meta Information Tab (Configuration Guide)
Configure the DB Reader task to ingest data from relational/analytical databases into your pipeline.
Supported sources: MySQL, Microsoft SQL Server (MSSQL), Oracle, ClickHouse, Snowflake, PostgreSQL, Amazon Redshift.
Prerequisites
Network access from the pipeline runtime to the target database host/port.
A database user with read privileges on the target database/schema/table(s).
(Recommended) Encrypted connections (TLS/SSL) enabled on the database server.
Table or SQL knowledge to define either Table name(s) or a Query (only one is required).
Quick Start
Drag the DB Reader task to the workspace and open it.
In Meta Information, select Driver, then provide Host IP Address, Port, Database name, and credentials.
Provide either Table name(s) or a Query (Spark SQL).
(Optional) Set Fetch Size and Create Partition options for performance.
Click Save Task in Storage to persist the configuration.
Meta Information — Field Reference
Driver
*
All
PostgreSQL
Choose among: MySQL, MSSQL, Oracle, ClickHouse, Snowflake, PostgreSQL, Redshift.
Host IP Address
*
All
10.10.20.25
or acme.snowflakecomputing.com
Hostname or IP of the database endpoint.
Port
*
All
See driver defaults below
Must be open from the pipeline runtime.
Database name
*
All
sales_dw
Database/catalog to read from. For platforms with schemas, you’ll reference schema in Table name or Query.
Table name
All
public.orders
or hr.emp,hr.dept
Comma‑separated list for multiple tables. Leave empty if using Query.
User name
*
All
pipeline_reader
Use least‑privilege credentials (read‑only).
Password
*
All
••••••••
Stored securely by the platform.
Fetch Size
All
50000
Max rows fetched per execution cycle. Larger values reduce round trips but use more memory.
Create Partition
All
Enabled/Disabled
Enables client‑side partitioned reads for parallelism. (See details below.)
Partition By
When Create Partition is enabled
Auto Increment
or Index
Select Auto Increment for automatic numeric ranges; Index to partition by a chosen Partition column.
Query
All
Spark SQL
Provide a Spark SQL statement for the given table(s). Use this for filters, projections, or joins across multiple tables.
Driver default ports & notes
MySQL
3306
mysql.internal.company.com
Use schema-qualified tables in queries (e.g., sales.orders
).
MSSQL
1433
sqlsrv01.corp.local
SQL Server Browser may alter ports; confirm with admin if non‑default.
Oracle
1521
orcl.prod.local
Use service/SID in JDBC URL (platform builds it from fields).
ClickHouse
8123 (HTTP)
ch01.analytics.local
Important: Spark components use HTTP port (8123), not TCP 9000.
Snowflake
443
acme-xy12345.snowflakecomputing.com
HTTPS endpoint; database/schema used in Table/Query.
PostgreSQL
5432
pgdw.prod.local
Set search_path
in DB if you omit schema in Table.
Redshift
5439
redshift-cluster.xxxxx.us-east-1.redshift.amazonaws.com
Database/schema/table naming similar to PostgreSQL.
Partitioning (Performance)
Enable Create Partition to parallelize reads.
Auto Increment The reader creates evenly sized numeric ranges over an implicit/available numeric key (or row sequence) and reads them in parallel. Use when a natural integer key (e.g.,
id
) exists and is roughly uniform.Index You choose a Partition column (typically an indexed numeric or timestamp column). The reader creates partition ranges over that column. Best when the column is monotonic (e.g.,
order_id
,event_time
) and well distributed.
Guidelines
Prefer numeric or timestamp columns for partitioning.
Start with 8–32 partitions for medium tables; scale with cluster cores.
Avoid highly skewed columns (few hot values).
Partitioning is performed by the reader; it does not create indexes on the source server.
Query Authoring (Spark SQL)
You can leave Table name empty and supply a Spark SQL statement in Query:
Single table
SELECT order_id, customer_id, order_ts, total_amount
FROM sales.orders
WHERE order_ts >= DATE '2025-01-01';
Multiple Tables (join)
SELECT o.order_id,
o.order_ts,
c.customer_name,
o.total_amount
FROM sales.orders AS o
JOIN sales.customers AS c
ON o.customer_id = c.customer_id
WHERE o.order_ts >= DATE '2025-01-01';
Validation Checklist
Connectivity: Test with a simple query (
SELECT 1
) or a small time‑bound filter.Credentials: Verify the user can
SELECT
from the intended tables.Schema: Confirm the column names and types match downstream expectations.
Performance: Start with a modest Fetch Size (e.g., 10–50k) and enable Create Partition with a suitable column.
Joins: If using Query, run the same SQL in a native client to validate syntax and results.
Performance & Scaling
Push down filters in Query (WHERE clauses) to minimize data transfer.
Project only required columns—avoid
SELECT *
in production.Tune Fetch Size: higher reduces round trips but increases memory; benchmark.
Use Create Partition with a uniform numeric or timestamp column to parallelize reads.
For very large tables, consider incremental ingestion (e.g.,
WHERE updated_at >= :last_watermark
).
Security & Governance
Use least‑privilege accounts (read‑only).
Prefer TLS/SSL connections; validate server certificates.
Protect credentials in the platform’s connection/secret store.
Log and review changes to Driver, Host, and Query fields.
Save & Next Steps
Click Save Task in Storage to persist your configuration.
Proceed to downstream tasks (mapping, transformations, loads) and run the job.