Creating your First ETL Job

This page contains end-to-end process for creating an ETL Job.

The BDB Jobs module enables users to define and automate ETL (Extract, Transform, Load) workflows for batch data processing. This guide describes how to configure extraction, transformation, and loading components, and how to leverage scripting and scheduling to build production-ready ETL pipelines.

Sample Use Case: Scheduled ETL with Spark, S3, and PostgreSQL

Scenario: An organization processes daily sales reports. Raw sales data is stored as files in an Amazon S3 bucket. Each day, the system extracts new files, applies cleaning and aggregation transformations, and loads the structured data into a PostgreSQL database for business intelligence and reporting.

Configuring the ETL Workflow

1

Extract (Source Component)

  • Add an Extraction component in the Jobs visual interface.

  • Configure the component to connect to the designated S3 bucket where raw sales data is stored.

  • Schedule the job to run at a specific time each day to ingest new files automatically.

2

Transform (SQL Component with PySpark)

  • Connect the Extraction component to a SQL Transformation component.

  • The transformation component leverages Apache Spark for distributed processing.

  • Write a SQL or PySpark query to perform required transformations, such as:

    • Filtering out incomplete or invalid records.

    • Joining data from multiple input files.

    • Aggregating sales totals by product and region.

3

Load (Destination Component)

  • Connect the Transformation component to a Load component.

  • Configure the Load component to connect to the PostgreSQL database.

  • Specify the target schema and table where the transformed data will be inserted.

Please note: The entire process—from file extraction to loading into the database—is executed as a single, powerful Spark job.

4

Automation & Scheduling

  • Job Scheduling: Define recurring schedules (e.g., daily at midnight) to automate ETL execution.

  • Inter-Job Triggering: Configure downstream jobs to run automatically once the ETL job completes successfully. This ensures smooth integration into larger workflows.

Last updated