Athena Query Content

Amazon Athena is an interactive query service that easily analyzes data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL. With a few actions in the AWS Management Console, you can point Athena at your data stored in Amazon S3 and begin using standard SQL to run ad-hoc queries and get results in seconds.

Athena Query Executor task enables users to read data directly from the external table created in AWS Athena.

Prerequisites

  • AWS account with permissions:

    • Athena: StartQueryExecution, GetQueryExecution, GetQueryResults.

    • AWS Glue/Athena Data Catalog (if applicable): read metadata.

    • Amazon S3: GetObject on data buckets and PutObject/ListBucket on the query result location bucket/prefix.

  • The Athena database and external table(s) already exist.

  • The Workgroup and Query result location are known (or you will supply an explicit S3 results path).

  • Network egress from your job’s compute to Athena and S3 endpoints for the specified Region.

Tip: Start with a small LIMIT to validate connectivity, permissions, and schema before running full scans.

Quick Start

  1. Drag the Athena Query Executor task to the workspace and open it.

  2. In Meta Information, enter Region, Access Key, and Secret Key.

  3. Specify Database Name, Table Name, and (optionally) Limit.

  4. Provide Data Source (e.g., AwsDataCatalog) and Workgroup (e.g., primary).

  5. Set Query location (S3 path for query results), e.g., s3://athena-query-results-<acct>-<region>/jobs/.

  6. (Optional) Enter a Query to override the simple table read (see examples).

  7. Click Save Task In Storage to persist the configuration.

Meta Information — Field Reference

Field
Required
Example
Description / Best Practices

Region

*

us-east-1

The AWS Region of your data and Athena Workgroup. Must match S3 buckets used for data and results.

Access Key

*

AKIA…

AWS Access Key for the account/role executing queries. Prefer short‑lived credentials when possible.

Secret Key

*

wJalrXUtnF…

AWS Secret Key corresponding to the Access Key. Store via the platform’s secret store.

Table Name

*

sales.orders

Name of the external table. Prefer fully‑qualified database.table.

Database Name

*

sales

Athena database containing the table(s).

Limit

100000

Caps result rows returned to Spark. Useful for sampling; combine with SQL LIMIT when appropriate.

Data Source

*

AwsDataCatalog

Athena Data Source (Catalog/Connector). For S3‑backed tables, the default is typically AwsDataCatalog.

Workgroup

*

primary

Athena Workgroup to use for execution, resource separation, and cost tracking.

Query location

*

s3://athena-query-results-123456789012-us-east-1/jobs/

S3 path where CSV results are written by Athena. Must be writable by your credentials.

Query

SQL (Athena/Trino) or Spark SQL*

SQL statement to execute in Athena. If omitted, the task will read from Table Name. *Use Athena SQL syntax; see note below.

Usage Patterns

A. Read a table (no custom SQL)

Provide Database Name and Table Name; leave Query blank. Optionally set Limit to cap rows.

B. Filter and project with SQL

Provide Query (fully qualify tables) and optional Limit.

SELECT order_id, customer_id, order_ts, total_amount
FROM sales.orders
WHERE order_ts >= DATE '2025-01-01'
LIMIT 100000;

C. Join multiple tables

SELECT o.order_id,
       o.order_ts,
       c.customer_name,
       o.total_amount
FROM sales.orders AS o
JOIN sales.customers AS c
  ON o.customer_id = c.customer_id
WHERE o.order_ts >= DATE '2025-01-01';

D. Partition pruning

If your table is partitioned (e.g., year, month, day), constrain the partitions:

SELECT *
FROM sales.fact_sales
WHERE year = 2025 AND month = 9 AND day >= 1 AND day <= 7;

Validation Checklist

  1. Permissions: Can your credentials write to Query location? A failure like Insufficient permissions for output location indicates missing S3 rights.

  2. Region alignment: Region, source S3 buckets, and Query location bucket must match or be allowed via appropriate endpoints.

  3. Table resolution: database.table exists and is accessible in the specified Data Source / Workgroup.

  4. Row sample: Run a small SELECT … LIMIT 10 sample and compare columns to the expected schema.

  5. Cost sanity: Confirm partition predicates and projections reduce bytes scanned.

Performance & Cost Optimization

  • Partitioned tables: Always filter by partition columns.

  • Columnar formats: Prefer Parquet/ORC over CSV/JSON to reduce scan size.

  • Projection: Select only required columns; avoid SELECT * in production.

  • Workgroups: Use Workgroup settings for query limits, enforced output location, and cost controls.

  • Result reuse: Re‑reading the same large result repeatedly? Consider CTAS to persist optimized, partitioned outputs (if permitted by your process).

Security & Governance

  • Use least‑privilege IAM policies; separate read permissions for data buckets and write permissions for result buckets.

  • Keep credentials in the platform’s secret store; rotate keys regularly or prefer assumed roles/temporary credentials.

  • Configure Workgroup to a controlled Query result location and enable query metrics/logging for auditability.

Save & Next Steps

  • Click Save Task In Storage to persist your configuration.

  • Chain downstream transforms/loads; schedule the job.

  • Monitor Workgroup query metrics and S3 bytes scanned to manage cost and performance.

Example Configurations

Sample read with LIMIT

  • Region: us-east-1

  • Access/Secret Key: (stored securely)

  • Database Name: sales

  • Table Name: orders

  • Data Source: AwsDataCatalog

  • Workgroup: primary

  • Query location: s3://athena-query-results-123456789012-us-east-1/jobs/

  • Query:

SELECT order_id, customer_id, order_ts, total_amount
FROM sales.orders
WHERE order_ts >= DATE '2025-01-01'
LIMIT 10000;

Join across two tables

Same as above; Query:

SELECT o.order_id, c.customer_name, o.total_amount
FROM sales.orders o
JOIN sales.customers c ON o.customer_id = c.customer_id
WHERE o.order_ts >= current_date - interval '30' day;