Creating your First ETL Job
This page contains end-to-end process for creating an ETL Job.
The BDB Jobs module enables users to define and automate ETL (Extract, Transform, Load) workflows for batch data processing. This guide describes how to configure extraction, transformation, and loading components, and how to leverage scripting and scheduling to build production-ready ETL pipelines.
Sample Use Case: Scheduled ETL with Spark, S3, and PostgreSQL
Scenario: An organization processes daily sales reports. Raw sales data is stored as files in an Amazon S3 bucket. Each day, the system extracts new files, applies cleaning and aggregation transformations, and loads the structured data into a PostgreSQL database for business intelligence and reporting.
Configuring the ETL Workflow
Transform (SQL Component with PySpark)
Connect the Extraction component to a SQL Transformation component.
The transformation component leverages Apache Spark for distributed processing.
Write a SQL or PySpark query to perform required transformations, such as:
Filtering out incomplete or invalid records.
Joining data from multiple input files.
Aggregating sales totals by product and region.
Last updated