fix: RateLimiter burst race, Retry-After headers, deep crawl dispatcher (#1095)#1835
Open
fix: RateLimiter burst race, Retry-After headers, deep crawl dispatcher (#1095)#1835
Conversation
…er (#1095) Three fixes for RateLimiter ineffectiveness: 1. Burst race: Added per-domain asyncio.Lock in wait_if_needed() so concurrent tasks serialize properly. Previously all tasks read last_request_time simultaneously and fired together. 2. Retry-After headers: Added optional response_headers param to update_delay() with parsing for Retry-After (seconds and HTTP-date). Both dispatcher call sites now pass result.response_headers. 3. Deep crawl dispatcher: Added dispatcher param to BFS, DFS, and BestFirst strategies, forwarded to all arun_many() calls so users can configure rate limiting for deep crawls.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
1. Burst race (concurrent tasks bypass rate limiting)
wait_if_needed()had no synchronization — concurrent tasks all readlast_request_timeat the same instant, computedwait_time ≈ 0, and fired together. Added per-domainasyncio.Lockso tasks serialize and each waits its proper turn.Before: 9/10 requests fire at +1.7s simultaneously (0ms gaps)
After: Requests spaced 1.2-1.8s apart across 13.7s total
2. Retry-After header support
update_delay()only accepted(url, status_code)— server rate-limit headers were completely ignored. Added optionalresponse_headersparam with parsing forRetry-After(both delay-seconds and HTTP-date formats). Both dispatcher call sites now passresult.response_headers.Before: 429 with
Retry-After: 5→ blind exponential backoff (1.9s)After: 429 with
Retry-After: 5→ delay set to 5.0s as server instructed3. Deep crawl dispatcher configurability
BFS, DFS, and BestFirst strategies hardcoded
arun_many()calls without passing a dispatcher. Addeddispatcherparam to all three, forwarded to everyarun_many()call.Changes
crawl4ai/async_dispatcher.py: Per-domain lock inwait_if_needed(),response_headersparam +_parse_retry_after()inupdate_delay(), both call sites updatedcrawl4ai/deep_crawling/bfs_strategy.py: Addeddispatcherparam, forwarded toarun_many()crawl4ai/deep_crawling/dfs_strategy.py: Forwardedself.dispatchertoarun_many()crawl4ai/deep_crawling/bff_strategy.py: Addeddispatcherparam, forwarded toarun_many()Test plan