-
Notifications
You must be signed in to change notification settings - Fork 341
Open
Description
Context
Currently, the Chrome Side Panel extension for summarize extracts main text from the rendered DOM using Readability, with media URLs (YouTube, Twitter, direct audio/video, podcasts) always routed to the backend daemon (URL mode/extractOnly) via shouldPreferUrlMode(url). Reddit threads could be more robustly extracted through Reddit's .json API, but the extension only leverages this in CLI if users manually add .json.
Problem
- Page content extraction for Reddit threads is unstable/noisy using Readability.
- Reddit exposes a solid
.jsonAPI that returns well-structured data for both the initial post and comments. - We want to improve "in-browser/extension" summarization experience for Reddit, while keeping all media-specific and
preferUrllogic unaffected for all other URLs.
Hard Requirements
DO NOT alter the preferUrl logic, its flow, or cache rules.
- If
shouldPreferUrlMode(url)is true for a URL (e.g., YouTube/Twitter/media/podcast):- The router must skip all new extractors and run only the original url-daemon extraction logic. No regression or behavioral change is allowed here.
- Only for
preferUrl === falseURLs should the new extractors run.
Implementation Scope
A) Modular Extractor Router for extension background
- Define an "Extractor" interface:
match(ctx)andextract(ctx)→ExtractorResult | null - Routing logic:
- If preferUrl: run url-daemon, do NOT try new extractors (and log this fact).
- Otherwise: try
reddit-thread=>page-readability=>url-daemon, in order, falling back as needed.
- Maintain all existing caching logic; router outputs are cached for summarize/chat.
B) RedditThreadExtractor (for .json API, structured text output)
- Match: reddit.com URLs with
/comments/, extracting subreddit+postId. - Fetch
.json(async, never sync XHR) from Reddit with browser creds, parse, and format:- Post header (author, subreddit, time, title, post body, comment count).
- Comments: recursive with 2-space indentation per level,
[ISO date] author (score:N): body. Truncate per-comment bodies and stop at max comments/depth/sum length.
- Only trigger if
preferUrlis false. - On any fetch/parse failure, return null (router fallback).
C) Logging
- All extraction attempts must add diagnostic logs (
extractor.route.start,extractor.try,extractor.success,extractor.fail,extractor.route.preferUrlHardSwitchfor preferUrl skips).
Not in Scope
- Do not modify
shouldPreferUrlMode(no new logic, no changes allowed here at all). - Do not touch daemon/core extractor router or its logic.
- Do not sync browser cookies/session to daemon.
- No settings UI for these behaviors in this phase.
Acceptance Criteria
- For all media URLs (
preferUrltrue), logs show router does not run new extractors (must logpreferUrlHardSwitch). No regression allowed. - For Reddit threads (
/comments/), output fromreddit-threadextractor is used (with structured post + comments), falls back to page or url-daemon if needed, never crashes. - Logs diagnostically record every routing/extractor decision for debugging.
- Daemon is never given browser cookies/session; logs are privacy safe.
Reference
- Reddit thread extractor prototype (logic only, do not use sync XHR).
- WIP: douo@efc3edf
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels