Skip to content

Conversation

@kenzaelk98
Copy link

Fix Playwright Route Handling Race Condition During Concurrent Scraping

Problem

When scraping large websites with --scope hostname (which crawls all pages within a domain), the scraper would intermittently crash with the following error:

route.fulfill: Route is already handled!
at /app/dist/index.js:5520:26

This occurred at random points during concurrent page processing, making it a classic race condition bug.

Root Cause

The Playwright route handlers in HtmlPlaywrightMiddleware.ts were not handling cases where multiple concurrent requests attempted to handle the same route. When scraping with hostname scope, hundreds of pages are processed in parallel, creating a high probability of route handling conflicts.

Solution

Wrapped all Playwright route operations (route.abort(), route.fulfill(), route.continue()) in try-catch blocks to gracefully handle already-handled routes. This prevents the crash and allows the scraper to continue processing other pages.

Changes

  • Added error handling around all route operations in both setupCachingRouteInterception() and the main process() method
  • Prefixed unused error variables with underscore to satisfy linting rules
  • Added debug logging for route handling conflicts

Testing

I tested this fix by scraping a large website (1000+ pages) with hostname scope:

Before fix:

  • ❌ Crashed with "Route is already handled!" error during early stages of scraping
  • Occurred intermittently but consistently prevented completion

After fix:

  • ✅ Successfully scraped hundreds of pages without any route handling errors
  • Only normal operational warnings (timeouts, network errors)
  • Scraping continues smoothly with concurrent page processing

Impact

This fix enables reliable large-scale scraping with hostname scope, which is essential for comprehensive documentation indexing across entire domains.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes intermittent Playwright crashes during high-concurrency scraping by making route interception more tolerant of “Route is already handled!” errors.

Changes:

  • Adds guards/logging intended to detect already-handled routes before processing.
  • Wraps route.abort() / route.fulfill() (and some abort fallbacks) in try/catch to prevent scraper crashes.
  • Adds debug logging when a route action fails due to an assumed already-handled race.
Comments suppressed due to low confidence (1)

src/scraper/middleware/HtmlPlaywrightMiddleware.ts:871

  • This routing logic is now duplicated in two places (setupCachingRouteInterception and the inline handler in process), and the new race-condition handling needs to stay consistent between them. Consider extracting a shared helper (e.g., a private method that handles a route given context/headers) so future changes to caching/route error handling aren’t missed in one of the copies.
        // For all other requests, use the standard caching logic
        // We need to manually handle the interception since we can't delegate to another route
        const reqOrigin = (() => {

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +536 to +541
try {
return await route.abort();
} catch (_error) {
logger.debug(`Route already handled (abort): ${reqUrl}`);
return;
}
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These catch blocks swallow all errors from route.abort(). That can hide real failures (e.g., invalid state, closed page/context issues other than already-handled) and, in some cases, may leave the route unresolved. Consider checking the error message/name and only ignoring the specific "Route is already handled" case; otherwise rethrow or handle explicitly (e.g., log at warn/error and ensure the route is aborted/continued).

Copilot uses AI. Check for mistakes.
Comment on lines +550 to +559
try {
return await route.fulfill({
status: 200,
contentType: cached.contentType,
body: cached.body,
});
} catch (_error) {
logger.debug(`Route already handled (fulfill cached): ${reqUrl}`);
return;
}
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same broad catch-and-return pattern around route.fulfill() will also mask non-race-condition errors (bad response/options, closed page, etc.). It would be safer to only suppress the known already-handled error and surface other exceptions (or convert them into an intentional route.abort('failed')).

Copilot uses AI. Check for mistakes.
Comment on lines +623 to +634
try {
return await route.continue({ headers });
} catch (error) {
// Handle network errors for non-GET requests
const errorMessage = error instanceof Error ? error.message : String(error);
logger.debug(`Network error for ${resourceType} ${reqUrl}: ${errorMessage}`);
return route.abort("failed");
try {
return await route.abort("failed");
} catch (_abortError) {
logger.debug(`Route already handled (continue/abort): ${reqUrl}`);
return;
}
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If route.continue() fails because the route was already handled, this catch currently treats it as a "network error" and then attempts route.abort(). Consider detecting the already-handled error and returning immediately to avoid misleading logs and extra work; for other errors, keep the current abort("failed") fallback.

Copilot uses AI. Check for mistakes.
// Set up route interception with special handling for the initial page load
await page.route("**/*", async (route) => {
// Check if route is already handled to prevent race conditions
if (route.request().failure()) {
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above: route.request().failure() is not a reliable indicator that the route has already been handled, and returning early risks leaving the route unresolved. Prefer route.isHandled() or rely on narrowly-scoped handling of the specific already-handled exception.

Suggested change
if (route.request().failure()) {
if (route.isHandled()) {

Copilot uses AI. Check for mistakes.
Comment on lines +550 to +558
try {
return await route.fulfill({
status: 200,
contentType: cached.contentType,
body: cached.body,
});
} catch (_error) {
logger.debug(`Route already handled (fulfill cached): ${reqUrl}`);
return;
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test coverage: the new behavior (swallowing "Route is already handled" errors from abort/fulfill/continue) isn’t exercised by the existing middleware tests. Adding a unit test that invokes the registered route handler and simulates route.fulfill() (and/or route.abort()) throwing the already-handled error would prevent regressions.

Copilot uses AI. Check for mistakes.
Comment on lines +518 to +523
// Check if route is already handled to prevent race conditions
if (route.request().failure()) {
logger.debug(`Route already handled, skipping: ${route.request().url()}`);
return;
}

Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

route.request().failure() does not indicate that a Route has been handled; it reports a request network failure. Returning early here can leave the route unhandled (stalling the request) and also won’t reliably prevent the "Route is already handled" exception. Prefer checking route.isHandled() (if available in your Playwright version) or remove this guard and only swallow the specific already-handled error in the abort/fulfill/continue try/catches.

Suggested change
// Check if route is already handled to prevent race conditions
if (route.request().failure()) {
logger.debug(`Route already handled, skipping: ${route.request().url()}`);
return;
}

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant