Skip to content

Add JS-render fallback for URL ingest when static fetch indexes blocked/interstitial pages #85

@mfittko

Description

@mfittko

Problem

Some URL ingests are currently reported as successful but index blocked/interstitial text instead of article content.

Observed with Reuters URL:

Current behavior:

  • raged ingest --url <reuters-url> --collection docs reports success.
  • Stored chunk text is Please enable JS and disable any ad blocker.
  • Reader-proxy retry (r.jina.ai) can also be blocked (HTTP 451), and that blocked payload gets indexed.

Impact:

  • Retrieval quality is poor because embeddings are generated from blocked text.
  • This looks like "ingestion did not land" to users, even when rows/chunks exist.

Reproduction

  1. Ingest the Reuters URL above.
  2. Query for terms like zhipu glm 5 reuters.
  3. Inspect stored chunk text: interstitial/blocked content instead of article body.

Proposal

Implement URL-ingest fallback strategy:

  1. Fast path: current static HTTP fetch.
  2. Blocked-content detection heuristics (e.g. enable JS, disable ad blocker, anti-bot templates, too-short boilerplate text).
  3. Fallback renderer: Playwright for flagged pages.
  4. If still blocked: mark ingestion as blocked/unreadable with explicit reason and avoid embedding junk content.

Acceptance Criteria

  • JS-rendered pages are ingested with meaningful text when accessible.
  • If blocked, ingestion records explicit blocked status/reason.
  • Feature is gated by config/env flag to avoid global overhead.
  • Observability includes fetch mode (static | playwright), detection reason, and extracted text length.
  • Tests cover static success, blocked-page detection, and fallback success/failure.

Notes

  • Prefer Playwright over Selenium for Node ecosystem fit.
  • Respect robots/site terms and applicable legal constraints.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions