[Feature] Chrome extension: Extractor Router framework & Reddit .json thread extraction (preserve preferUrl logic)

## Context

Currently, the Chrome Side Panel extension for `summarize` extracts main text from the rendered DOM using Readability, with media URLs (YouTube, Twitter, direct audio/video, podcasts) always routed to the backend daemon (URL mode/extractOnly) via `shouldPreferUrlMode(url)`. Reddit threads could be more robustly extracted through Reddit's `.json` API, but the extension only leverages this in CLI if users manually add `.json`.

#### Problem
- Page content extraction for Reddit threads is unstable/noisy using Readability.
- Reddit exposes a solid `.json` API that returns well-structured data for both the initial post and comments.
- We want to improve "in-browser/extension" summarization experience for Reddit, while keeping all media-specific and `preferUrl` logic unaffected for all other URLs.

## Hard Requirements

**DO NOT alter the preferUrl logic, its flow, or cache rules.**
- If `shouldPreferUrlMode(url)` is true for a URL (e.g., YouTube/Twitter/media/podcast):
  - The router must skip all new extractors and run only the original url-daemon extraction logic. No regression or behavioral change is allowed here.
- Only for `preferUrl === false` URLs should the new extractors run.

## Implementation Scope

**A) Modular Extractor Router for extension background**
- Define an "Extractor" interface: `match(ctx)` and `extract(ctx)` → `ExtractorResult | null`
- Routing logic:
  - If preferUrl: run url-daemon, do NOT try new extractors (and log this fact).
  - Otherwise: try `reddit-thread` => `page-readability` => `url-daemon`, in order, falling back as needed.
- Maintain all existing caching logic; router outputs are cached for summarize/chat.

**B) RedditThreadExtractor (for .json API, structured text output)**
- Match: reddit.com URLs with `/comments/`, extracting subreddit+postId.
- Fetch `.json` (async, never sync XHR) from Reddit with browser creds, parse, and format:
  - Post header (author, subreddit, time, title, post body, comment count).
  - Comments: recursive with 2-space indentation per level, `[ISO date] author (score:N): body`. Truncate per-comment bodies and stop at max comments/depth/sum length.
- Only trigger if `preferUrl` is false.
- On any fetch/parse failure, return null (router fallback).

**C) Logging**
- All extraction attempts must add diagnostic logs (`extractor.route.start`, `extractor.try`, `extractor.success`, `extractor.fail`, `extractor.route.preferUrlHardSwitch` for preferUrl skips).

## Not in Scope
- Do not modify `shouldPreferUrlMode` (no new logic, no changes allowed here at all).
- Do not touch daemon/core extractor router or its logic.
- Do not sync browser cookies/session to daemon.
- No settings UI for these behaviors in this phase.

## Acceptance Criteria

1. For all media URLs (`preferUrl` true), logs show router does not run new extractors (must log `preferUrlHardSwitch`). No regression allowed.
2. For Reddit threads (`/comments/`), output from `reddit-thread` extractor is used (with structured post + comments), falls back to page or url-daemon if needed, never crashes.
3. Logs diagnostically record every routing/extractor decision for debugging.
4. Daemon is never given browser cookies/session; logs are privacy safe.

## Reference
- [Reddit thread extractor prototype](https://github.com/douo/webpage-summary/blob/3727a37ff6bd78268a5ac875e6f0740a2a24236c/packages/ext/src/utils/extractors/reddit-thread.ts) (logic only, do not use sync XHR).
- WIP: https://github.com/douo/summarize/commit/efc3edfb59f199c2c48e4b7f54125262097b5242

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Chrome extension: Extractor Router framework & Reddit .json thread extraction (preserve preferUrl logic) #174

Context

Problem

Hard Requirements

Implementation Scope

Not in Scope

Acceptance Criteria

Reference

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Feature] Chrome extension: Extractor Router framework & Reddit .json thread extraction (preserve preferUrl logic) #174

Description

Context

Problem

Hard Requirements

Implementation Scope

Not in Scope

Acceptance Criteria

Reference

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions