-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Problem
Some URL ingests are currently reported as successful but index blocked/interstitial text instead of article content.
Observed with Reuters URL:
Current behavior:
raged ingest --url <reuters-url> --collection docsreports success.- Stored chunk text is
Please enable JS and disable any ad blocker. - Reader-proxy retry (
r.jina.ai) can also be blocked (HTTP 451), and that blocked payload gets indexed.
Impact:
- Retrieval quality is poor because embeddings are generated from blocked text.
- This looks like "ingestion did not land" to users, even when rows/chunks exist.
Reproduction
- Ingest the Reuters URL above.
- Query for terms like
zhipu glm 5 reuters. - Inspect stored chunk text: interstitial/blocked content instead of article body.
Proposal
Implement URL-ingest fallback strategy:
- Fast path: current static HTTP fetch.
- Blocked-content detection heuristics (e.g.
enable JS,disable ad blocker, anti-bot templates, too-short boilerplate text). - Fallback renderer: Playwright for flagged pages.
- If still blocked: mark ingestion as blocked/unreadable with explicit reason and avoid embedding junk content.
Acceptance Criteria
- JS-rendered pages are ingested with meaningful text when accessible.
- If blocked, ingestion records explicit blocked status/reason.
- Feature is gated by config/env flag to avoid global overhead.
- Observability includes fetch mode (
static|playwright), detection reason, and extracted text length. - Tests cover static success, blocked-page detection, and fallback success/failure.
Notes
- Prefer Playwright over Selenium for Node ecosystem fit.
- Respect robots/site terms and applicable legal constraints.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels