Skip to content

Conversation

@jazware
Copy link
Contributor

@jazware jazware commented Jun 29, 2025

Just playing around for now with a more efficient package for crawling the whole network and a little tool to dump the network to a JSONL file.

The basic premise of this crawling strategy is to initialize a crawler per-PDS and walk their listRepos responses concurrently, enqueueing jobs. Then each crawler can have its own concurrency limits on a per-PDS basis for getRepo allowing you to effectively horizontally scale your network crawling without putting outsized load on any one node in the network.

Ideally you should be able to crawl the whole network in <16 hours if you have the compute and BW from it maxing out at 10 getRepos per second per PDS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants