DB Reader

DB Reader — Meta Information Tab (Configuration Guide)

Configure the DB Reader task to ingest data from relational/analytical databases into your pipeline.

Supported sources: MySQL, Microsoft SQL Server (MSSQL), Oracle, ClickHouse, Snowflake, PostgreSQL, Amazon Redshift.

Prerequisites

  • Network access from the pipeline runtime to the target database host/port.

  • A database user with read privileges on the target database/schema/table(s).

  • (Recommended) Encrypted connections (TLS/SSL) enabled on the database server.

  • Table or SQL knowledge to define either Table name(s) or a Query (only one is required).

Quick Start

  1. Drag the DB Reader task to the workspace and open it.

  2. In Meta Information, select Driver, then provide Host IP Address, Port, Database name, and credentials.

  3. Provide either Table name(s) or a Query (Spark SQL).

  4. (Optional) Set Fetch Size and Create Partition options for performance.

  5. Click Save Task in Storage to persist the configuration.

Meta Information — Field Reference

Please note: Only one of Table name or Query is required. If you provide a query, leave the Table name field blank.

Field
Required
Applies To
Example
Notes / Best Practices

Driver

*

All

PostgreSQL

Choose among: MySQL, MSSQL, Oracle, ClickHouse, Snowflake, PostgreSQL, Redshift.

Host IP Address

*

All

10.10.20.25 or acme.snowflakecomputing.com

Hostname or IP of the database endpoint.

Port

*

All

See driver defaults below

Must be open from the pipeline runtime.

Database name

*

All

sales_dw

Database/catalog to read from. For platforms with schemas, you’ll reference schema in Table name or Query.

Table name

All

public.orders or hr.emp,hr.dept

Comma‑separated list for multiple tables. Leave empty if using Query.

User name

*

All

pipeline_reader

Use least‑privilege credentials (read‑only).

Password

*

All

••••••••

Stored securely by the platform.

Fetch Size

All

50000

Max rows fetched per execution cycle. Larger values reduce round trips but use more memory.

Create Partition

All

Enabled/Disabled

Enables client‑side partitioned reads for parallelism. (See details below.)

Partition By

When Create Partition is enabled

Auto Increment or Index

Select Auto Increment for automatic numeric ranges; Index to partition by a chosen Partition column.

Query

All

Spark SQL

Provide a Spark SQL statement for the given table(s). Use this for filters, projections, or joins across multiple tables.

Driver default ports & notes

Driver
Default Port
Host format example
Notes

MySQL

3306

mysql.internal.company.com

Use schema-qualified tables in queries (e.g., sales.orders).

MSSQL

1433

sqlsrv01.corp.local

SQL Server Browser may alter ports; confirm with admin if non‑default.

Oracle

1521

orcl.prod.local

Use service/SID in JDBC URL (platform builds it from fields).

ClickHouse

8123 (HTTP)

ch01.analytics.local

Important: Spark components use HTTP port (8123), not TCP 9000.

Snowflake

443

acme-xy12345.snowflakecomputing.com

HTTPS endpoint; database/schema used in Table/Query.

PostgreSQL

5432

pgdw.prod.local

Set search_path in DB if you omit schema in Table.

Redshift

5439

redshift-cluster.xxxxx.us-east-1.redshift.amazonaws.com

Database/schema/table naming similar to PostgreSQL.

Please note (joins): For multi‑table reads (join queries), write the join directly in Query. You do not need to list all tables in Table name when using Query.

Partitioning (Performance)

Enable Create Partition to parallelize reads.

  • Auto Increment The reader creates evenly sized numeric ranges over an implicit/available numeric key (or row sequence) and reads them in parallel. Use when a natural integer key (e.g., id) exists and is roughly uniform.

  • Index You choose a Partition column (typically an indexed numeric or timestamp column). The reader creates partition ranges over that column. Best when the column is monotonic (e.g., order_id, event_time) and well distributed.

Guidelines

  • Prefer numeric or timestamp columns for partitioning.

  • Start with 8–32 partitions for medium tables; scale with cluster cores.

  • Avoid highly skewed columns (few hot values).

  • Partitioning is performed by the reader; it does not create indexes on the source server.

Query Authoring (Spark SQL)

You can leave Table name empty and supply a Spark SQL statement in Query:

  • Single table

SELECT order_id, customer_id, order_ts, total_amount
FROM sales.orders
WHERE order_ts >= DATE '2025-01-01';
  • Multiple Tables (join)

SELECT o.order_id,
       o.order_ts,
       c.customer_name,
       o.total_amount
FROM sales.orders AS o
JOIN sales.customers AS c
  ON o.customer_id = c.customer_id
WHERE o.order_ts >= DATE '2025-01-01';

Please note:

  • ClickHouse: The ClickHouse driver in the Spark components will use HTTP Port and not the TCP port

  • Snowflake: Host resolves to *.snowflakecomputing.com over 443; qualify with database.schema.table in queries if needed.

Validation Checklist

  1. Connectivity: Test with a simple query (SELECT 1) or a small time‑bound filter.

  2. Credentials: Verify the user can SELECT from the intended tables.

  3. Schema: Confirm the column names and types match downstream expectations.

  4. Performance: Start with a modest Fetch Size (e.g., 10–50k) and enable Create Partition with a suitable column.

  5. Joins: If using Query, run the same SQL in a native client to validate syntax and results.

Performance & Scaling

  • Push down filters in Query (WHERE clauses) to minimize data transfer.

  • Project only required columns—avoid SELECT * in production.

  • Tune Fetch Size: higher reduces round trips but increases memory; benchmark.

  • Use Create Partition with a uniform numeric or timestamp column to parallelize reads.

  • For very large tables, consider incremental ingestion (e.g., WHERE updated_at >= :last_watermark).

Security & Governance

  • Use least‑privilege accounts (read‑only).

  • Prefer TLS/SSL connections; validate server certificates.

  • Protect credentials in the platform’s connection/secret store.

  • Log and review changes to Driver, Host, and Query fields.

Save & Next Steps

  • Click Save Task in Storage to persist your configuration.

  • Proceed to downstream tasks (mapping, transformations, loads) and run the job.