Skip to content

fix: RateLimiter burst race, Retry-After headers, deep crawl dispatcher (#1095)#1835

Open
ntohidi wants to merge 1 commit intodevelopfrom
fix/rate-limiter-burst-and-headers-1095
Open

fix: RateLimiter burst race, Retry-After headers, deep crawl dispatcher (#1095)#1835
ntohidi wants to merge 1 commit intodevelopfrom
fix/rate-limiter-burst-and-headers-1095

Conversation

@ntohidi
Copy link
Collaborator

@ntohidi ntohidi commented Mar 16, 2026

Summary

1. Burst race (concurrent tasks bypass rate limiting)

wait_if_needed() had no synchronization — concurrent tasks all read last_request_time at the same instant, computed wait_time ≈ 0, and fired together. Added per-domain asyncio.Lock so tasks serialize and each waits its proper turn.

Before: 9/10 requests fire at +1.7s simultaneously (0ms gaps)
After: Requests spaced 1.2-1.8s apart across 13.7s total

2. Retry-After header support

update_delay() only accepted (url, status_code) — server rate-limit headers were completely ignored. Added optional response_headers param with parsing for Retry-After (both delay-seconds and HTTP-date formats). Both dispatcher call sites now pass result.response_headers.

Before: 429 with Retry-After: 5 → blind exponential backoff (1.9s)
After: 429 with Retry-After: 5 → delay set to 5.0s as server instructed

3. Deep crawl dispatcher configurability

BFS, DFS, and BestFirst strategies hardcoded arun_many() calls without passing a dispatcher. Added dispatcher param to all three, forwarded to every arun_many() call.

Changes

  • crawl4ai/async_dispatcher.py: Per-domain lock in wait_if_needed(), response_headers param + _parse_retry_after() in update_delay(), both call sites updated
  • crawl4ai/deep_crawling/bfs_strategy.py: Added dispatcher param, forwarded to arun_many()
  • crawl4ai/deep_crawling/dfs_strategy.py: Forwarded self.dispatcher to arun_many()
  • crawl4ai/deep_crawling/bff_strategy.py: Added dispatcher param, forwarded to arun_many()

Test plan

  • Reproduction script verifies all three fixes (burst serialization, Retry-After parsing, dispatcher passthrough)
  • Deep crawl with rate-limited site (e.g. gamesjobslive.niceboard.co) to verify end-to-end

…er (#1095)

Three fixes for RateLimiter ineffectiveness:

1. Burst race: Added per-domain asyncio.Lock in wait_if_needed() so
   concurrent tasks serialize properly. Previously all tasks read
   last_request_time simultaneously and fired together.

2. Retry-After headers: Added optional response_headers param to
   update_delay() with parsing for Retry-After (seconds and HTTP-date).
   Both dispatcher call sites now pass result.response_headers.

3. Deep crawl dispatcher: Added dispatcher param to BFS, DFS, and
   BestFirst strategies, forwarded to all arun_many() calls so users
   can configure rate limiting for deep crawls.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant