-
Notifications
You must be signed in to change notification settings - Fork 118
Handle Playwright route already handled race condition error #332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Handle Playwright route already handled race condition error #332
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Fixes intermittent Playwright crashes during high-concurrency scraping by making route interception more tolerant of “Route is already handled!” errors.
Changes:
- Adds guards/logging intended to detect already-handled routes before processing.
- Wraps
route.abort()/route.fulfill()(and some abort fallbacks) in try/catch to prevent scraper crashes. - Adds debug logging when a route action fails due to an assumed already-handled race.
Comments suppressed due to low confidence (1)
src/scraper/middleware/HtmlPlaywrightMiddleware.ts:871
- This routing logic is now duplicated in two places (
setupCachingRouteInterceptionand the inline handler inprocess), and the new race-condition handling needs to stay consistent between them. Consider extracting a shared helper (e.g., a private method that handles aroutegiven context/headers) so future changes to caching/route error handling aren’t missed in one of the copies.
// For all other requests, use the standard caching logic
// We need to manually handle the interception since we can't delegate to another route
const reqOrigin = (() => {
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| try { | ||
| return await route.abort(); | ||
| } catch (_error) { | ||
| logger.debug(`Route already handled (abort): ${reqUrl}`); | ||
| return; | ||
| } |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These catch blocks swallow all errors from route.abort(). That can hide real failures (e.g., invalid state, closed page/context issues other than already-handled) and, in some cases, may leave the route unresolved. Consider checking the error message/name and only ignoring the specific "Route is already handled" case; otherwise rethrow or handle explicitly (e.g., log at warn/error and ensure the route is aborted/continued).
| try { | ||
| return await route.fulfill({ | ||
| status: 200, | ||
| contentType: cached.contentType, | ||
| body: cached.body, | ||
| }); | ||
| } catch (_error) { | ||
| logger.debug(`Route already handled (fulfill cached): ${reqUrl}`); | ||
| return; | ||
| } |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same broad catch-and-return pattern around route.fulfill() will also mask non-race-condition errors (bad response/options, closed page, etc.). It would be safer to only suppress the known already-handled error and surface other exceptions (or convert them into an intentional route.abort('failed')).
| try { | ||
| return await route.continue({ headers }); | ||
| } catch (error) { | ||
| // Handle network errors for non-GET requests | ||
| const errorMessage = error instanceof Error ? error.message : String(error); | ||
| logger.debug(`Network error for ${resourceType} ${reqUrl}: ${errorMessage}`); | ||
| return route.abort("failed"); | ||
| try { | ||
| return await route.abort("failed"); | ||
| } catch (_abortError) { | ||
| logger.debug(`Route already handled (continue/abort): ${reqUrl}`); | ||
| return; | ||
| } |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If route.continue() fails because the route was already handled, this catch currently treats it as a "network error" and then attempts route.abort(). Consider detecting the already-handled error and returning immediately to avoid misleading logs and extra work; for other errors, keep the current abort("failed") fallback.
| // Set up route interception with special handling for the initial page load | ||
| await page.route("**/*", async (route) => { | ||
| // Check if route is already handled to prevent race conditions | ||
| if (route.request().failure()) { |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above: route.request().failure() is not a reliable indicator that the route has already been handled, and returning early risks leaving the route unresolved. Prefer route.isHandled() or rely on narrowly-scoped handling of the specific already-handled exception.
| if (route.request().failure()) { | |
| if (route.isHandled()) { |
| try { | ||
| return await route.fulfill({ | ||
| status: 200, | ||
| contentType: cached.contentType, | ||
| body: cached.body, | ||
| }); | ||
| } catch (_error) { | ||
| logger.debug(`Route already handled (fulfill cached): ${reqUrl}`); | ||
| return; |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test coverage: the new behavior (swallowing "Route is already handled" errors from abort/fulfill/continue) isn’t exercised by the existing middleware tests. Adding a unit test that invokes the registered route handler and simulates route.fulfill() (and/or route.abort()) throwing the already-handled error would prevent regressions.
| // Check if route is already handled to prevent race conditions | ||
| if (route.request().failure()) { | ||
| logger.debug(`Route already handled, skipping: ${route.request().url()}`); | ||
| return; | ||
| } | ||
|
|
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
route.request().failure() does not indicate that a Route has been handled; it reports a request network failure. Returning early here can leave the route unhandled (stalling the request) and also won’t reliably prevent the "Route is already handled" exception. Prefer checking route.isHandled() (if available in your Playwright version) or remove this guard and only swallow the specific already-handled error in the abort/fulfill/continue try/catches.
| // Check if route is already handled to prevent race conditions | |
| if (route.request().failure()) { | |
| logger.debug(`Route already handled, skipping: ${route.request().url()}`); | |
| return; | |
| } |
Fix Playwright Route Handling Race Condition During Concurrent Scraping
Problem
When scraping large websites with
--scope hostname(which crawls all pages within a domain), the scraper would intermittently crash with the following error:This occurred at random points during concurrent page processing, making it a classic race condition bug.
Root Cause
The Playwright route handlers in
HtmlPlaywrightMiddleware.tswere not handling cases where multiple concurrent requests attempted to handle the same route. When scraping with hostname scope, hundreds of pages are processed in parallel, creating a high probability of route handling conflicts.Solution
Wrapped all Playwright route operations (
route.abort(),route.fulfill(),route.continue()) in try-catch blocks to gracefully handle already-handled routes. This prevents the crash and allows the scraper to continue processing other pages.Changes
setupCachingRouteInterception()and the mainprocess()methodTesting
I tested this fix by scraping a large website (1000+ pages) with hostname scope:
Before fix:
After fix:
Impact
This fix enables reliable large-scale scraping with hostname scope, which is essential for comprehensive documentation indexing across entire domains.