chore: add athena api blog by Aydawka · Pull Request #753 · fairdataihub/fairdataihub.org

Aydawka · 2026-03-20T17:25:11Z

Summary by Sourcery

Documentation:

Add a user-facing guide that walks through querying the Software Heritage Graph Dataset in S3 with Amazon Athena to extract and deduplicate GitHub README content hashes, including cost considerations and example outputs.

… into athena-api-blog

fairdataihub-bot · 2026-03-20T17:25:15Z

Thank you for submitting this pull request! We appreciate your contribution to the project. Before we can merge it, we need to review the changes you've made to ensure they align with our code standards and meet the requirements of the project. We'll get back to you as soon as we can with feedback. Thanks again!

sourcery-ai · 2026-03-20T17:25:18Z

Reviewer's Guide

Adds a new technical blog post describing a step-by-step workflow for querying the Software Heritage S3 Graph Dataset using Amazon Athena, including schema explanation, incremental Athena queries, cost breakdown, and example results.

ER diagram for Software Heritage graph tables used in the Athena workflow

erDiagram
    origin {
        string id
        string url
    }
    origin_visit_status {
        string origin
        string date
        string snapshot
    }
    snapshot {
        string id
    }
    snapshot_branch {
        string snapshot_id
        string target
        string target_type
        varbinary name
    }
    revision {
        string id
        string directory
    }
    revision_history {
        string revision_id
        string parent_id
    }
    directory {
        string id
    }
    directory_entry {
        string directory_id
        string target
        string type
        varbinary name
    }
    content {
        string sha1_git
        string sha1
    }
    release {
        string id
        string target_type
        string target
    }

    origin ||--o{ origin_visit_status : has_visits
    origin_visit_status }o--|| snapshot : references
    snapshot ||--o{ snapshot_branch : has_branches
    snapshot_branch }o--|| revision : targets_revision
    revision ||--|| directory : has_root_directory
    revision ||--o{ revision_history : has_parent
    directory ||--o{ directory_entry : contains
    directory_entry }o--|| directory : subdirectory
    directory_entry }o--|| content : file_content
    release }o--|| revision : targets_revision
    release }o--|| directory : targets_directory

Flow diagram for the stepwise Athena query pipeline over the SWH graph dataset

flowchart TD
    A["Step 1<br/>Define Athena external tables<br/>over SWH Parquet data"] --> B["Step 2<br/>Create table url_and_date<br/>Join origin with origin_visit_status"]
    B --> C["Step 2<br/>Create table url_date_snapshot_2a<br/>Add snapshot_id from origin_visit_status"]
    C --> D["Step 3<br/>Create table snapshot_branch_filtered<br/>Filter snapshot_branch to main and master"]
    D --> E["Step 3<br/>Create table url_date_branch_2b<br/>Join url_date_snapshot_2a with snapshot_branch_filtered"]
    E --> F["Step 3<br/>Create table url_date_rev_2c<br/>Join with revision to get directory_id"]
    F --> G["Step 4<br/>Create table directory_entry_readme<br/>Filter directory_entry to README file names"]
    G --> H["Step 5<br/>Create table url_date_directory_sha_3b<br/>Join url_date_rev_2c with directory_entry_readme"]
    H --> I["Step 5<br/>Create table filtered_directory_sha1<br/>Distinct content_sha1_git"]
    I --> J["Step 5<br/>Create table content_matched<br/>Join filtered_directory_sha1 with content to get sha1"]
    J --> K["Step 5<br/>Create table url_content_final<br/>Join url_date_directory_sha_3b with content_matched"]
    K --> L["Step 6<br/>Create table filtered_github_total_table<br/>Filter to GitHub URLs"]
    L --> M["Step 6<br/>Create table filtered_github_unique<br/>Deduplicate by url using MAX_BY on visit_date"]

File-Level Changes

Change	Details	Files
Introduce new Athena/SWH how-to blog post with detailed, stepwise SQL workflow.	Create new markdown blog entry with front matter metadata (title, authors, date, tags, hero image, etc.). Document prerequisites and AWS setup, including S3 bucket creation and required IAM permissions for Athena and Glue. Explain SWH graph schema and initial naive multi-table join that exceeds Athena limits, motivating an incremental strategy. Define Athena external tables over SWH Parquet data and implement staged CTAS queries to progressively join origin, origin_visit_status, snapshot_branch, revision, directory_entry, and content tables. Filter snapshot branches to main/master heads and restrict directory_entry scans to README-like filenames using hex-encoded names to control data volume and cost. Materialize intermediate tables to resolve git SHA-1 to canonical SHA-1, filter to GitHub URLs, deduplicate by URL using visit_date, and produce final URL–SHA1 table. Provide computational cost breakdown per step using Athena’s per-TB pricing and summarize the resulting dataset and its downstream applications.	`blog/swh-s3-athena.md`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

vercel · 2026-03-20T17:25:18Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
fairdataihub-website	Ready	Preview, Comment	Mar 20, 2026 7:07pm

fairdataihub-bot · 2026-03-20T17:25:26Z

Thanks for making updates to your pull request. Our team will take a look and provide feedback as soon as possible. Please wait for any GitHub Actions to complete before editing your pull request. If you have any additional questions or concerns, feel free to let us know. Thank you for your contributions!

sourcery-ai

Hey - I've found 3 issues, and left some high level feedback:

In the filtered_github_unique aggregation you drop the sha1 column even though the text says you retain a single record with its associated content hash; consider adding something like MAX_BY(sha1, visit_date) AS sha1 so the deduplicated table actually carries the README content hash for the latest visit.
For consistency with other examples and to avoid ambiguity in Athena, you may want to fully qualify url_content_final as default.url_content_final in the CREATE TABLE default.filtered_github_total_table statement.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In the `filtered_github_unique` aggregation you drop the `sha1` column even though the text says you retain a single record with its associated content hash; consider adding something like `MAX_BY(sha1, visit_date) AS sha1` so the deduplicated table actually carries the README content hash for the latest visit.
- For consistency with other examples and to avoid ambiguity in Athena, you may want to fully qualify `url_content_final` as `default.url_content_final` in the `CREATE TABLE default.filtered_github_total_table` statement.

## Individual Comments

### Comment 1
<location path="blog/swh-s3-athena.md" line_range="17" />
<code_context>
+  - Software Heritage
+  - AWS
+  - AWS S3 bucket
+  - Athena api
+  - Parquet
+---
</code_context>
<issue_to_address>
**suggestion (typo):** Consider capitalizing "API" in the tag "Athena api" for consistency with standard acronym usage.

This will align with the capitalization of other acronyms in the list (e.g., AWS, S3) and help it read less like a typo.

```suggestion
  - Athena API
```
</issue_to_address>

### Comment 2
<location path="blog/swh-s3-athena.md" line_range="37" />
<code_context>
+  - An AWS account with an IAM user or role
+  - Correct permissions attached to your IAM user or role
+
+### Setup AWS S3 bucket
+
+1. Create (or confirm) an S3 bucket for Athena outputs
</code_context>
<issue_to_address>
**suggestion (typo):** Use "Set up" instead of "Setup" when used as a verb in the heading.

Because this heading is an instruction, the verb phrase "Set up AWS S3 bucket" is more appropriate than the noun "Setup."

```suggestion
### Set up AWS S3 bucket
```
</issue_to_address>

### Comment 3
<location path="blog/swh-s3-athena.md" line_range="213" />
<code_context>
+
+### Step 5. Resolving Git SHA-1 to Canonical SHA-1
+
+Once we retrieve the directory-level sha1_git values, we decompose the query into three incremental steps. First, we extract the distinct content_sha1_git values from the intermediate result set. Next, we join this reduced set against the content table to retrieve only the matching SHA1 _git and SHA1 pairs. Finally, we perform the join between the original URL/date dataset and the filtered content results.
+
+By materializing intermediate tables and reducing the join cardinality at each stage, we are able to avoid exhaustion errors and complete the retrieval successfully.
</code_context>
<issue_to_address>
**issue (typo):** Fix the spacing in "SHA1 _git" to match the intended identifier name.

Here you have a space in "SHA1 _git" that doesn’t appear elsewhere (e.g., "sha1_git"). Please remove the space and align the casing with the rest of the document.

```suggestion
Once we retrieve the directory-level sha1_git values, we decompose the query into three incremental steps. First, we extract the distinct content_sha1_git values from the intermediate result set. Next, we join this reduced set against the content table to retrieve only the matching sha1_git and sha1 pairs. Finally, we perform the join between the original URL/date dataset and the filtered content results.
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

blog/swh-s3-athena.md

Aydawka added 11 commits February 18, 2026 16:32

feat: ✨ add athena blog

cc9cda1

fix: 🐛 update athena blog post

5937ba7

fix: 🐛 cost

fc03dc7

fix: 🐛 update flow

db31cfd

fix: 🐛 update graph schema

be7b292

fix: 🐛 update graph schema

e09c39f

fix: 🐛 grammar

3419191

fix: 🐛 sequence

16ccebf

Merge branch 'main' of https://github.com/fairdataihub/fairdataihub.org…

bfb70ba

… into athena-api-blog

fix: 🐛 tense

ea542ae

feat: ✨ add links

46b65e2

Aydawka requested a review from megasanjay as a code owner March 20, 2026 17:25

sourcery-ai bot reviewed Mar 20, 2026

View reviewed changes

blog/swh-s3-athena.md Outdated Show resolved Hide resolved

blog/swh-s3-athena.md Outdated Show resolved Hide resolved

blog/swh-s3-athena.md Outdated Show resolved Hide resolved

fix: 🐛 typos

ea7843b

vercel bot deployed to Preview March 20, 2026 17:29 View deployment

fix: 🐛 update the image

c3472b5

vercel bot deployed to Preview March 20, 2026 19:07 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: add athena api blog#753

chore: add athena api blog#753
Aydawka wants to merge 13 commits intomainfrom
athena-api-blog

Aydawka commented Mar 20, 2026 •

edited by sourcery-ai bot

Loading

Uh oh!

fairdataihub-bot bot commented Mar 20, 2026

Uh oh!

sourcery-ai bot commented Mar 20, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

vercel bot commented Mar 20, 2026 •

edited

Loading

Uh oh!

fairdataihub-bot bot commented Mar 20, 2026

Uh oh!

sourcery-ai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Aydawka commented Mar 20, 2026 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

fairdataihub-bot bot commented Mar 20, 2026

Uh oh!

sourcery-ai bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

ER diagram for Software Heritage graph tables used in the Athena workflow

Flow diagram for the stepwise Athena query pipeline over the SWH graph dataset

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

vercel bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fairdataihub-bot bot commented Mar 20, 2026

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Aydawka commented Mar 20, 2026 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Mar 20, 2026 •

edited

Loading

vercel bot commented Mar 20, 2026 •

edited

Loading