Conversation
… into athena-api-blog
|
Thank you for submitting this pull request! We appreciate your contribution to the project. Before we can merge it, we need to review the changes you've made to ensure they align with our code standards and meet the requirements of the project. We'll get back to you as soon as we can with feedback. Thanks again! |
Reviewer's GuideAdds a new technical blog post describing a step-by-step workflow for querying the Software Heritage S3 Graph Dataset using Amazon Athena, including schema explanation, incremental Athena queries, cost breakdown, and example results. ER diagram for Software Heritage graph tables used in the Athena workflowerDiagram
origin {
string id
string url
}
origin_visit_status {
string origin
string date
string snapshot
}
snapshot {
string id
}
snapshot_branch {
string snapshot_id
string target
string target_type
varbinary name
}
revision {
string id
string directory
}
revision_history {
string revision_id
string parent_id
}
directory {
string id
}
directory_entry {
string directory_id
string target
string type
varbinary name
}
content {
string sha1_git
string sha1
}
release {
string id
string target_type
string target
}
origin ||--o{ origin_visit_status : has_visits
origin_visit_status }o--|| snapshot : references
snapshot ||--o{ snapshot_branch : has_branches
snapshot_branch }o--|| revision : targets_revision
revision ||--|| directory : has_root_directory
revision ||--o{ revision_history : has_parent
directory ||--o{ directory_entry : contains
directory_entry }o--|| directory : subdirectory
directory_entry }o--|| content : file_content
release }o--|| revision : targets_revision
release }o--|| directory : targets_directory
Flow diagram for the stepwise Athena query pipeline over the SWH graph datasetflowchart TD
A["Step 1<br/>Define Athena external tables<br/>over SWH Parquet data"] --> B["Step 2<br/>Create table url_and_date<br/>Join origin with origin_visit_status"]
B --> C["Step 2<br/>Create table url_date_snapshot_2a<br/>Add snapshot_id from origin_visit_status"]
C --> D["Step 3<br/>Create table snapshot_branch_filtered<br/>Filter snapshot_branch to main and master"]
D --> E["Step 3<br/>Create table url_date_branch_2b<br/>Join url_date_snapshot_2a with snapshot_branch_filtered"]
E --> F["Step 3<br/>Create table url_date_rev_2c<br/>Join with revision to get directory_id"]
F --> G["Step 4<br/>Create table directory_entry_readme<br/>Filter directory_entry to README file names"]
G --> H["Step 5<br/>Create table url_date_directory_sha_3b<br/>Join url_date_rev_2c with directory_entry_readme"]
H --> I["Step 5<br/>Create table filtered_directory_sha1<br/>Distinct content_sha1_git"]
I --> J["Step 5<br/>Create table content_matched<br/>Join filtered_directory_sha1 with content to get sha1"]
J --> K["Step 5<br/>Create table url_content_final<br/>Join url_date_directory_sha_3b with content_matched"]
K --> L["Step 6<br/>Create table filtered_github_total_table<br/>Filter to GitHub URLs"]
L --> M["Step 6<br/>Create table filtered_github_unique<br/>Deduplicate by url using MAX_BY on visit_date"]
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Thanks for making updates to your pull request. Our team will take a look and provide feedback as soon as possible. Please wait for any GitHub Actions to complete before editing your pull request. If you have any additional questions or concerns, feel free to let us know. Thank you for your contributions! |
There was a problem hiding this comment.
Hey - I've found 3 issues, and left some high level feedback:
- In the
filtered_github_uniqueaggregation you drop thesha1column even though the text says you retain a single record with its associated content hash; consider adding something likeMAX_BY(sha1, visit_date) AS sha1so the deduplicated table actually carries the README content hash for the latest visit. - For consistency with other examples and to avoid ambiguity in Athena, you may want to fully qualify
url_content_finalasdefault.url_content_finalin theCREATE TABLE default.filtered_github_total_tablestatement.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In the `filtered_github_unique` aggregation you drop the `sha1` column even though the text says you retain a single record with its associated content hash; consider adding something like `MAX_BY(sha1, visit_date) AS sha1` so the deduplicated table actually carries the README content hash for the latest visit.
- For consistency with other examples and to avoid ambiguity in Athena, you may want to fully qualify `url_content_final` as `default.url_content_final` in the `CREATE TABLE default.filtered_github_total_table` statement.
## Individual Comments
### Comment 1
<location path="blog/swh-s3-athena.md" line_range="17" />
<code_context>
+ - Software Heritage
+ - AWS
+ - AWS S3 bucket
+ - Athena api
+ - Parquet
+---
</code_context>
<issue_to_address>
**suggestion (typo):** Consider capitalizing "API" in the tag "Athena api" for consistency with standard acronym usage.
This will align with the capitalization of other acronyms in the list (e.g., AWS, S3) and help it read less like a typo.
```suggestion
- Athena API
```
</issue_to_address>
### Comment 2
<location path="blog/swh-s3-athena.md" line_range="37" />
<code_context>
+ - An AWS account with an IAM user or role
+ - Correct permissions attached to your IAM user or role
+
+### Setup AWS S3 bucket
+
+1. Create (or confirm) an S3 bucket for Athena outputs
</code_context>
<issue_to_address>
**suggestion (typo):** Use "Set up" instead of "Setup" when used as a verb in the heading.
Because this heading is an instruction, the verb phrase "Set up AWS S3 bucket" is more appropriate than the noun "Setup."
```suggestion
### Set up AWS S3 bucket
```
</issue_to_address>
### Comment 3
<location path="blog/swh-s3-athena.md" line_range="213" />
<code_context>
+
+### Step 5. Resolving Git SHA-1 to Canonical SHA-1
+
+Once we retrieve the directory-level sha1_git values, we decompose the query into three incremental steps. First, we extract the distinct content_sha1_git values from the intermediate result set. Next, we join this reduced set against the content table to retrieve only the matching SHA1 _git and SHA1 pairs. Finally, we perform the join between the original URL/date dataset and the filtered content results.
+
+By materializing intermediate tables and reducing the join cardinality at each stage, we are able to avoid exhaustion errors and complete the retrieval successfully.
</code_context>
<issue_to_address>
**issue (typo):** Fix the spacing in "SHA1 _git" to match the intended identifier name.
Here you have a space in "SHA1 _git" that doesn’t appear elsewhere (e.g., "sha1_git"). Please remove the space and align the casing with the rest of the document.
```suggestion
Once we retrieve the directory-level sha1_git values, we decompose the query into three incremental steps. First, we extract the distinct content_sha1_git values from the intermediate result set. Next, we join this reduced set against the content table to retrieve only the matching sha1_git and sha1 pairs. Finally, we perform the join between the original URL/date dataset and the filtered content results.
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Summary by Sourcery
Documentation: