Skip to content

[Feature] Chrome extension: Extractor Router framework & Reddit .json thread extraction (preserve preferUrl logic) #174

@douo

Description

@douo

Context

Currently, the Chrome Side Panel extension for summarize extracts main text from the rendered DOM using Readability, with media URLs (YouTube, Twitter, direct audio/video, podcasts) always routed to the backend daemon (URL mode/extractOnly) via shouldPreferUrlMode(url). Reddit threads could be more robustly extracted through Reddit's .json API, but the extension only leverages this in CLI if users manually add .json.

Problem

  • Page content extraction for Reddit threads is unstable/noisy using Readability.
  • Reddit exposes a solid .json API that returns well-structured data for both the initial post and comments.
  • We want to improve "in-browser/extension" summarization experience for Reddit, while keeping all media-specific and preferUrl logic unaffected for all other URLs.

Hard Requirements

DO NOT alter the preferUrl logic, its flow, or cache rules.

  • If shouldPreferUrlMode(url) is true for a URL (e.g., YouTube/Twitter/media/podcast):
    • The router must skip all new extractors and run only the original url-daemon extraction logic. No regression or behavioral change is allowed here.
  • Only for preferUrl === false URLs should the new extractors run.

Implementation Scope

A) Modular Extractor Router for extension background

  • Define an "Extractor" interface: match(ctx) and extract(ctx)ExtractorResult | null
  • Routing logic:
    • If preferUrl: run url-daemon, do NOT try new extractors (and log this fact).
    • Otherwise: try reddit-thread => page-readability => url-daemon, in order, falling back as needed.
  • Maintain all existing caching logic; router outputs are cached for summarize/chat.

B) RedditThreadExtractor (for .json API, structured text output)

  • Match: reddit.com URLs with /comments/, extracting subreddit+postId.
  • Fetch .json (async, never sync XHR) from Reddit with browser creds, parse, and format:
    • Post header (author, subreddit, time, title, post body, comment count).
    • Comments: recursive with 2-space indentation per level, [ISO date] author (score:N): body. Truncate per-comment bodies and stop at max comments/depth/sum length.
  • Only trigger if preferUrl is false.
  • On any fetch/parse failure, return null (router fallback).

C) Logging

  • All extraction attempts must add diagnostic logs (extractor.route.start, extractor.try, extractor.success, extractor.fail, extractor.route.preferUrlHardSwitch for preferUrl skips).

Not in Scope

  • Do not modify shouldPreferUrlMode (no new logic, no changes allowed here at all).
  • Do not touch daemon/core extractor router or its logic.
  • Do not sync browser cookies/session to daemon.
  • No settings UI for these behaviors in this phase.

Acceptance Criteria

  1. For all media URLs (preferUrl true), logs show router does not run new extractors (must log preferUrlHardSwitch). No regression allowed.
  2. For Reddit threads (/comments/), output from reddit-thread extractor is used (with structured post + comments), falls back to page or url-daemon if needed, never crashes.
  3. Logs diagnostically record every routing/extractor decision for debugging.
  4. Daemon is never given browser cookies/session; logs are privacy safe.

Reference

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions