Implementation Steps
Step 1: Database Connection Setup
The pipeline uses psycopg2 to connect to PostgreSQL. Please ensure:
Your database is accessible from your execution environment.
It has the required tables and views.
User has SELECT permissions on target tables.
Step 2: Query Configuration
Define your queries in the queries list within the main() function:
queries =
[
{
"label": "table_name", # Used for S3 file naming
"sql": """
SELECT column1, column2, column3
FROM your_schema.your_table
WHERE date_column::date BETWEEN '{start_date}' AND '{end_date}'
"""
}
]Step 3: S3 Structure Planning
The pipeline creates the following S3 structure:
s3://bucket-name/
└── s3-prefix/
└── table-label/
└── insertion_date=YYYY-MM-DD/
└── table-label-YYYYMMDD_HHMMSS.parquet
Step 4: Date Parameter Configuration
The pipeline supports three date modes
Relative Date (using day parameter):
main(..., day=1) # Yesterday's data
Date Range:
main(..., start_date='2024-01-01', end_date='2024-01-31')
Single Date:
main(..., start_date='2024-01-01') # end_date defaults to start_date
Last updated