Skip to content

chore: add athena api blog#753

Open
Aydawka wants to merge 13 commits intomainfrom
athena-api-blog
Open

chore: add athena api blog#753
Aydawka wants to merge 13 commits intomainfrom
athena-api-blog

Conversation

@Aydawka
Copy link
Collaborator

@Aydawka Aydawka commented Mar 20, 2026

Summary by Sourcery

Documentation:

  • Add a user-facing guide that walks through querying the Software Heritage Graph Dataset in S3 with Amazon Athena to extract and deduplicate GitHub README content hashes, including cost considerations and example outputs.

@Aydawka Aydawka requested a review from megasanjay as a code owner March 20, 2026 17:25
@fairdataihub-bot
Copy link

Thank you for submitting this pull request! We appreciate your contribution to the project. Before we can merge it, we need to review the changes you've made to ensure they align with our code standards and meet the requirements of the project. We'll get back to you as soon as we can with feedback. Thanks again!

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Mar 20, 2026

Reviewer's Guide

Adds a new technical blog post describing a step-by-step workflow for querying the Software Heritage S3 Graph Dataset using Amazon Athena, including schema explanation, incremental Athena queries, cost breakdown, and example results.

ER diagram for Software Heritage graph tables used in the Athena workflow

erDiagram
    origin {
        string id
        string url
    }
    origin_visit_status {
        string origin
        string date
        string snapshot
    }
    snapshot {
        string id
    }
    snapshot_branch {
        string snapshot_id
        string target
        string target_type
        varbinary name
    }
    revision {
        string id
        string directory
    }
    revision_history {
        string revision_id
        string parent_id
    }
    directory {
        string id
    }
    directory_entry {
        string directory_id
        string target
        string type
        varbinary name
    }
    content {
        string sha1_git
        string sha1
    }
    release {
        string id
        string target_type
        string target
    }

    origin ||--o{ origin_visit_status : has_visits
    origin_visit_status }o--|| snapshot : references
    snapshot ||--o{ snapshot_branch : has_branches
    snapshot_branch }o--|| revision : targets_revision
    revision ||--|| directory : has_root_directory
    revision ||--o{ revision_history : has_parent
    directory ||--o{ directory_entry : contains
    directory_entry }o--|| directory : subdirectory
    directory_entry }o--|| content : file_content
    release }o--|| revision : targets_revision
    release }o--|| directory : targets_directory
Loading

Flow diagram for the stepwise Athena query pipeline over the SWH graph dataset

flowchart TD
    A["Step 1<br/>Define Athena external tables<br/>over SWH Parquet data"] --> B["Step 2<br/>Create table url_and_date<br/>Join origin with origin_visit_status"]
    B --> C["Step 2<br/>Create table url_date_snapshot_2a<br/>Add snapshot_id from origin_visit_status"]
    C --> D["Step 3<br/>Create table snapshot_branch_filtered<br/>Filter snapshot_branch to main and master"]
    D --> E["Step 3<br/>Create table url_date_branch_2b<br/>Join url_date_snapshot_2a with snapshot_branch_filtered"]
    E --> F["Step 3<br/>Create table url_date_rev_2c<br/>Join with revision to get directory_id"]
    F --> G["Step 4<br/>Create table directory_entry_readme<br/>Filter directory_entry to README file names"]
    G --> H["Step 5<br/>Create table url_date_directory_sha_3b<br/>Join url_date_rev_2c with directory_entry_readme"]
    H --> I["Step 5<br/>Create table filtered_directory_sha1<br/>Distinct content_sha1_git"]
    I --> J["Step 5<br/>Create table content_matched<br/>Join filtered_directory_sha1 with content to get sha1"]
    J --> K["Step 5<br/>Create table url_content_final<br/>Join url_date_directory_sha_3b with content_matched"]
    K --> L["Step 6<br/>Create table filtered_github_total_table<br/>Filter to GitHub URLs"]
    L --> M["Step 6<br/>Create table filtered_github_unique<br/>Deduplicate by url using MAX_BY on visit_date"]
Loading

File-Level Changes

Change Details Files
Introduce new Athena/SWH how-to blog post with detailed, stepwise SQL workflow.
  • Create new markdown blog entry with front matter metadata (title, authors, date, tags, hero image, etc.).
  • Document prerequisites and AWS setup, including S3 bucket creation and required IAM permissions for Athena and Glue.
  • Explain SWH graph schema and initial naive multi-table join that exceeds Athena limits, motivating an incremental strategy.
  • Define Athena external tables over SWH Parquet data and implement staged CTAS queries to progressively join origin, origin_visit_status, snapshot_branch, revision, directory_entry, and content tables.
  • Filter snapshot branches to main/master heads and restrict directory_entry scans to README-like filenames using hex-encoded names to control data volume and cost.
  • Materialize intermediate tables to resolve git SHA-1 to canonical SHA-1, filter to GitHub URLs, deduplicate by URL using visit_date, and produce final URL–SHA1 table.
  • Provide computational cost breakdown per step using Athena’s per-TB pricing and summarize the resulting dataset and its downstream applications.
blog/swh-s3-athena.md

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@vercel
Copy link

vercel bot commented Mar 20, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
fairdataihub-website Ready Ready Preview, Comment Mar 20, 2026 7:07pm

Request Review

@fairdataihub-bot
Copy link

Thanks for making updates to your pull request. Our team will take a look and provide feedback as soon as possible. Please wait for any GitHub Actions to complete before editing your pull request. If you have any additional questions or concerns, feel free to let us know. Thank you for your contributions!

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 3 issues, and left some high level feedback:

  • In the filtered_github_unique aggregation you drop the sha1 column even though the text says you retain a single record with its associated content hash; consider adding something like MAX_BY(sha1, visit_date) AS sha1 so the deduplicated table actually carries the README content hash for the latest visit.
  • For consistency with other examples and to avoid ambiguity in Athena, you may want to fully qualify url_content_final as default.url_content_final in the CREATE TABLE default.filtered_github_total_table statement.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In the `filtered_github_unique` aggregation you drop the `sha1` column even though the text says you retain a single record with its associated content hash; consider adding something like `MAX_BY(sha1, visit_date) AS sha1` so the deduplicated table actually carries the README content hash for the latest visit.
- For consistency with other examples and to avoid ambiguity in Athena, you may want to fully qualify `url_content_final` as `default.url_content_final` in the `CREATE TABLE default.filtered_github_total_table` statement.

## Individual Comments

### Comment 1
<location path="blog/swh-s3-athena.md" line_range="17" />
<code_context>
+  - Software Heritage
+  - AWS
+  - AWS S3 bucket
+  - Athena api
+  - Parquet
+---
</code_context>
<issue_to_address>
**suggestion (typo):** Consider capitalizing "API" in the tag "Athena api" for consistency with standard acronym usage.

This will align with the capitalization of other acronyms in the list (e.g., AWS, S3) and help it read less like a typo.

```suggestion
  - Athena API
```
</issue_to_address>

### Comment 2
<location path="blog/swh-s3-athena.md" line_range="37" />
<code_context>
+  - An AWS account with an IAM user or role
+  - Correct permissions attached to your IAM user or role
+
+### Setup AWS S3 bucket
+
+1. Create (or confirm) an S3 bucket for Athena outputs
</code_context>
<issue_to_address>
**suggestion (typo):** Use "Set up" instead of "Setup" when used as a verb in the heading.

Because this heading is an instruction, the verb phrase "Set up AWS S3 bucket" is more appropriate than the noun "Setup."

```suggestion
### Set up AWS S3 bucket
```
</issue_to_address>

### Comment 3
<location path="blog/swh-s3-athena.md" line_range="213" />
<code_context>
+
+### Step 5. Resolving Git SHA-1 to Canonical SHA-1
+
+Once we retrieve the directory-level sha1_git values, we decompose the query into three incremental steps. First, we extract the distinct content_sha1_git values from the intermediate result set. Next, we join this reduced set against the content table to retrieve only the matching SHA1 _git and SHA1 pairs. Finally, we perform the join between the original URL/date dataset and the filtered content results.
+
+By materializing intermediate tables and reducing the join cardinality at each stage, we are able to avoid exhaustion errors and complete the retrieval successfully.
</code_context>
<issue_to_address>
**issue (typo):** Fix the spacing in "SHA1 _git" to match the intended identifier name.

Here you have a space in "SHA1 _git" that doesn’t appear elsewhere (e.g., "sha1_git"). Please remove the space and align the casing with the rest of the document.

```suggestion
Once we retrieve the directory-level sha1_git values, we decompose the query into three incremental steps. First, we extract the distinct content_sha1_git values from the intermediate result set. Next, we join this reduced set against the content table to retrieve only the matching sha1_git and sha1 pairs. Finally, we perform the join between the original URL/date dataset and the filtered content results.
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant