Skip to content

Latest commit

 

History

History
616 lines (412 loc) · 15.5 KB

File metadata and controls

616 lines (412 loc) · 15.5 KB

Self-Hosted RSS Reader with Auto Folders, Dedup, and Smart Digests (PWA + Synology + Tailscale)

0) Summary

A single-user, self-hosted RSS reader that:

  • runs on Synology via Docker
  • is accessed over Tailscale (PWA)
  • auto-sorts feeds into auto folders (simple categories, not manual folderizing)
  • dedups “same story across outlets” into one story card
  • supports mute-with-breakout filtering (e.g., “hide Roblox unless it’s a major incident”)
  • generates AI summaries + digests and learns ranking from your behavior

1) Goals

  • Replace Feedly for daily reading.
  • Minimal mental overhead: Folders are automatic and simple.
  • Collapse duplicates across outlets into one view.
  • Fast catch-up: digest mode when you’re away or behind.
  • Preference learning: sort what you’ll likely care about higher.
  • Maintain good UX: always show headline + hero image + source.

2) Non-goals (v1)

  • Multi-user accounts
  • Push notifications
  • Full offline archive of every article body
  • Perfect extraction for all paywalled sites

3) Deployment constraints

  • Host: Synology NAS
  • Runtime: Docker Compose
  • Network: Tailscale (private access)
  • Single user (you), single device priority (iPhone), but works on desktop browser too

3.1 Initial stack decisions

  • Frontend: Next.js PWA (TypeScript)
  • Backend API: Node.js + Fastify (TypeScript)
  • Worker: Node.js (TypeScript)
  • Queue/jobs: Postgres-backed jobs via pg-boss in MVP1 (no Redis required)
  • Database: Postgres latest stable major at deployment time
  • Vector extension: no pgvector in MVP1
  • AI default provider: OpenAI

Rationale:

  • Use one language across web, API, and worker to reduce maintenance overhead.
  • Keep a single repo with clearly separated services and shared contracts/types.

Suggested repository structure:

  • apps/web (Next.js PWA, TypeScript)
  • apps/api (Fastify HTTP API, TypeScript)
  • apps/worker (polling, extraction, clustering, digests; TypeScript)
  • packages/contracts (shared schemas/types and generated API client)
  • infra (docker-compose, env templates, deployment scripts)
  • db (migrations, seed data)

3.2 Python adoption gates (deferred path)

Add a Python sidecar only if TypeScript implementation misses quality targets for two consecutive weeks after tuning:

  • extraction success rate for priority sources < 90%
  • cluster correction rate (manual split requests) > 12%
  • worker CPU saturation causes p95 ingest latency to exceed configured target

3.3 Backup policy (recommended)

  • Nightly Postgres backup (pg_dump custom format)
  • Retention: 7 daily + 4 weekly snapshots
  • Store backups on NAS volume with optional encrypted off-device sync
  • Run a restore verification at least monthly

4) UX Spec (PWA)

4.1 Navigation

  • Home (All Stories)
  • Folders (Auto folders tabs/list)
  • Digest
  • Saved
  • Sources (manage feeds)
  • Settings

4.2 Home (primary view)

Infinite scroll list of Story Cards (clusters)

Sort options:

  • For You (default: personal score with recency floor)
  • Latest (strict reverse chronological)

Card fields:

  • headline
  • hero image
  • primary source name + time
  • “+N outlets” (cluster size)
  • folder label
  • optional AI “1–2 sentence summary”
  • optional badge: “Muted topic breakout” (with reason)

Card actions:

  • Open (cluster detail)
  • Save
  • Mark read
  • “Not interested”
  • “Mute keyword…” (creates a mute rule)
  • “Prefer this source” (source weight +)
  • “Mute this source”

4.3 Cluster detail

Header: headline + hero + primary source

Sections:

  • AI “Story so far” summary (optional)
  • Outlets list (members): each opens the article view

Actions:

  • Save cluster
  • Mark read
  • Split cluster (escape hatch)
  • Mute keyword/topic extracted from title (quick creation)

4.4 Article view (in-app)

  • Render via reader mode (extracted text) when available
  • Fallback: embedded page view
  • If embed is blocked by site CSP/X-Frame headers, show a clear "Open original" action

Capture analytics:

  • time on article view
  • scroll depth
  • quick bounce (<10–15s)

4.5 Digest view

Trigger banner on Home when conditions met (“You were away… View digest”)

Digest sections:

  • Top picks for you
  • Big stories (most outlets / high-rep sources)
  • Quick scan (one-liners)

Tap entry → cluster detail

4.6 Saved

  • List of saved clusters
  • Sort by saved date; optional folder filter

4.7 Sources

List feeds with:

  • assigned folder (single)
  • “trial” flag (optional)
  • weight slider (Prefer / Neutral / Deprioritize)

Actions:

  • Add feed URL
  • OPML import

On add: prompt “I categorized this as Gaming. Change?”

4.8 Settings

AI mode:

  • Off
  • Summaries + digest
  • Full (summaries + auto foldering assist + smart ranking)

Digest triggers (defaults):

  • Away ≥ 24h OR backlog ≥ 50 clusters

Retention: see section 10

Filters: manage mute rules

Provider selection: OpenAI / Claude / Local

AI budget cap:

  • Monthly cap is configurable (default $20)
  • On cap hit: fallback to local model only when local provider is configured
  • If local provider is not configured, fallback option is not selectable and hosted AI is paused until reset

Feed polling:

  • Poll interval is configurable (default 60 minutes)

5) Auto Folders (simple, site-first)

5.1 Folder list

Default folders (editable but keep small):

  • Tech
  • Gaming
  • Security
  • Business
  • Politics
  • Sports
  • Design
  • Local
  • World
  • Other

5.2 Feed → folder assignment (on add)

Inputs:

  • feed title/description
  • site title/description (if available)
  • sample of last 10 titles

Process:

  • Rules/keywords classifier (fast)
  • If AI enabled, LLM classifier as tie-breaker
  • Prompt user with suggestion + dropdown override

Store:

  • feed.folder_id
  • feed.folder_confidence

5.3 Drift detection (optional, weekly)

Weekly job samples last 30 items:

  • if >35% classify to a different folder → prompt:
    • “This feed looks more like Tech lately. Move it?”

Actions: Keep / Move / Create folder (rare)

Note: This is the only “ask me if shifts” behavior; no constant reorg.

6) Dedup: story clustering across outlets

6.1 Definition

A cluster represents one story covered by multiple outlets.

6.2 Inputs (per item)

  • canonical_url
  • title
  • summary/excerpt
  • extracted snippet (if available)
  • published timestamp
  • folder inherited from feed (site-first)

6.3 Clustering algorithm (v1)

Time-windowed near-duplicate matching (48h window):

Candidate selection: items within 48h, same language

Similarity score:

  • title simhash distance
  • token Jaccard overlap
  • optional embedding cosine (if AI mode full)

Decision:

  • if score ≥ threshold → join cluster
  • else create new cluster

Representative item selection:

  • highest source weight
  • else most complete extracted text
  • else earliest

6.4 Folder assignment for clusters

To keep UI simple:

  • cluster folder = representative item’s feed folder

6.5 Escape hatch

“Split cluster”:

  • creates a new cluster and moves selected members
  • logs a correction event (can tune thresholds later)

7) Filtering: “Mute with breakout”

7.1 Filter types

  • Mute (default): hide matches unless breakout triggers
  • Hard block: never show (rare)

7.2 Matching scope

Pre-filter on title + feed summary

Post-cluster filter on representative’s title + summary + extracted snippet

This prevents leaks from other outlets.

Muted matches are soft-hidden before clustering, not dropped. They still participate in clustering and breakout checks.

7.3 Breakout logic

If a mute rule matches, allow through when any of:

  • Severity keywords appear (e.g., hack/breach/0day/arrest/DOJ/CISA/state-backed/outage/porn)
  • Source is “high reputation” list (user-configurable)
  • Cluster size ≥ N outlets within 24h (default N=4)

Cluster size for breakout includes outlets that are muted/hidden by the same rule.

If allowed through:

  • badge story as “Muted topic breakout” + reason

7.4 Example

Rule: keyword="roblox", mode=mute

Normal Roblox content hidden

“Roblox hacked…” passes due to severity keyword + cluster size

8) AI Features (optional but supported)

8.1 Provider abstraction (switchable)

Define interface:

  • embed(texts[]) -> vectors[]
  • summarize(text, style) -> summary
  • classify(text, labels[]) -> label/confidence

Implement providers:

  • OpenAI
  • Anthropic (Claude)
  • Local (Ollama / llama.cpp)

Routing:

  • embeddings: local or cheapest
  • summaries/digest: hosted by default
  • classification for folders: hosted only when uncertain

Config:

  • AI_PROVIDER=openai|anthropic|local
  • optional per-task overrides

8.2 Summaries

Generate per cluster:

  • 1–2 sentence “card summary”
  • longer “story so far” in cluster detail

Cache and regenerate only when cluster materially changes.

Sensitive handling:

  • if headline indicates sensitive content, generate a short sanitized summary or skip.

8.3 Personalized ranking (v1)

Goal: order clusters by “you’ll likely care”.

Signals:

  • recency decay
  • folder affinity (learned)
  • source weight
  • engagement history (opens, dwell, scroll, saves, not interested)
  • diversity penalty (avoid same folder/source repeating)

Start with heuristic scoring; later upgrade to a lightweight learned model.

Ranking guardrails:

  • Add exploration quota to avoid permanent starvation of low-ranked stories
  • Always provide a user-visible sort toggle (For You and Latest)

9) Recommendations: new outlets (optional v2)

  • Suggest-only initially (no auto-add)
  • Based on folders you read + sources you prefer
  • Trial feeds: add 1–3 per week if enabled, easy remove
  • Promote/demote based on engagement

10) Retention and states

10.1 States

  • Unread
  • Read (hidden)
  • Saved (persist)

10.2 Policy

  • Keep Unread until read or older than max-age (optional default: no max)
  • When marked Read:
    • hide from UI immediately
    • keep lightweight record for ranking + dedup memory
    • purge extracted text after N days (default 14–30) to save space
  • Saved:
    • keep indefinitely
    • keep metadata + canonical link only (no guaranteed full text retention)

11) Hero images + metadata (must-have)

11.1 Item-level extraction order

  • RSS media fields (media:content, media:thumbnail, enclosure)
  • Article HTML meta (og:image, twitter:image)
  • Fallback: first large image in extracted content

Store:

  • hero_image_url
  • optionally hero_image_cached_path (download/cache)

11.2 Cluster-level hero

  • Use representative item’s hero; fallback to first available among members.

12) Data Model (minimum tables)

Core

  • folder(id, name)
  • feed(id, url, title, site_url, folder_id, folder_confidence, weight, muted, created_at, last_polled_at, etag, last_modified)
  • item(id, feed_id, url, canonical_url, title, summary, published_at, author, guid, hero_image_url, extracted_text, extracted_at)
  • cluster(id, rep_item_id, folder_id, created_at, updated_at, size)
  • cluster_member(cluster_id, item_id, added_at)
  • read_state(cluster_id, read_at, saved_at)

Auth

  • user_account(id, username, password_hash, created_at, last_login_at)
  • auth_session(id, user_id, refresh_token_hash, created_at, expires_at, last_seen_at, revoked_at)

Filtering

  • filter_rule(id, pattern, type=phrase|regex, mode=mute|block, breakout_enabled, created_at)
  • filter_event(rule_id, cluster_id, action=hidden|breakout_shown, ts)

Analytics

  • event(id, ts, type, payload_json) (batched from PWA)

Digests

  • digest(id, created_at, start_ts, end_ts, title, body, entries_json)

12.1 Constraints and idempotency requirements

  • feed.url_normalized unique
  • item unique on (feed_id, guid) when guid exists
  • fallback uniqueness for guid-less entries: (feed_id, canonical_url, published_at)
  • cluster_member unique on (cluster_id, item_id)
  • read_state.cluster_id is primary key
  • event accepts client idempotency_key to dedupe retries

13) Pipeline

13.1 Worker stages

  • Poll feeds (conditional GET)
  • Parse items → upsert
  • Canonicalize URL
  • Pre-filter soft gate (mute/block) using title+summary
  • Selective extraction (policy-based)
  • Compute features (simhash; embeddings if enabled)
  • Cluster assignment
  • Post-cluster filter (mute-with-breakout)
  • Summary generation (optional)
  • Digest generation (if triggers met)

13.2 Reliability policy

  • Retries with exponential backoff for poll/extract/AI stages
  • Stage timeouts and per-feed circuit breaker
  • Dead-letter queue/table for repeated failures
  • Structured error logging with feed/item identifiers

13.3 Selective extraction policy (default)

Extract if any:

  • summary missing or < N chars (e.g., 280)
  • title matches generic patterns (“briefing”, “top stories”, “update”)
  • item becomes cluster representative
  • source weight is high

14) Digest triggers

Default triggers:

  • Away ≥ 24h OR backlog ≥ 50 unread clusters

Manual “Generate digest now”

Away is defined from last_active_at, updated on app foreground and interaction events.

Digest generation:

  • rank clusters
  • produce multi-section digest
  • cache for the session/day

15) API (v1)

  • GET /v1/clusters?folder_id=&cursor=&limit=&state=unread|saved|all&sort=personal|latest
  • GET /v1/clusters/{id}
  • POST /v1/clusters/{id}/read
  • POST /v1/clusters/{id}/save
  • POST /v1/clusters/{id}/split
  • POST /v1/clusters/{id}/feedback (not_interested, split_request)
  • GET /v1/folders
  • GET /v1/feeds
  • POST /v1/feeds (add)
  • PATCH /v1/feeds/{id} (folder, weight, muted, trial)
  • POST /v1/opml/import
  • GET /v1/filters
  • POST /v1/filters
  • PATCH /v1/filters/{id}
  • DELETE /v1/filters/{id}
  • GET /v1/digests
  • POST /v1/events (batch)
  • GET /v1/settings
  • POST /v1/settings
  • POST /v1/auth/login
  • POST /v1/auth/logout
  • POST /v1/auth/refresh

Auth:

  • single-user login (password) with access + refresh tokens

16) Docker Compose (components)

  • web (Next.js PWA)
  • api (Node.js Fastify)
  • worker (Node.js polling + processing)
  • postgres (recommended)
  • optional redis (future caching/rate-limit use, not required for MVP1)

17) MVP milestones

MVP1: Replacement

  • OPML import + add feed
  • Auto folder assignment prompt
  • Story clustering
  • Home + Cluster detail + Saved
  • Login screen + single-user auth session
  • Sort toggle: For You / Latest
  • Hero images
  • Read/hide behavior
  • Pre/post filters (mute-with-breakout)

MVP2: Catch-up

  • Digest view + triggers
  • AI summaries (provider switchable)

MVP3: Your algorithm

  • Preference learning ranking improvements
  • Recommendations (trial feeds)

18) Defaults you asked for (locked)

  • Auto folders = site-first, one folder per feed, minimal UI concepts
  • Dedup = cluster stories across outlets
  • Roblox-like filtering = mute-with-breakout
  • Hero image + headline always captured and stored
  • Retention = unread persists; read hidden but lightweight history kept; saved persists

19) Interview decisions captured

  • Chosen stack direction: TypeScript-only (Next.js + Fastify + Node worker)
  • Queue recommendation accepted: Postgres-backed jobs first (pg-boss)
  • DB recommendation accepted: latest stable Postgres major
  • pgvector deferred
  • OpenAI is default provider
  • AI monthly budget cap starts at $20 and is configurable
  • Auth model is login screen (single user)
  • Ranking default is personal score with sort fallback to latest
  • Muted stories still count for breakout conditions
  • Polling interval is configurable; default 60 minutes with conditional GET and backoff
  • Saved entries retain metadata + canonical link only
  • Python is deferred unless quality gates in section 3.2 fail

If you want next, I can output:

  • the exact filter rule JSON schema + starter severity keyword list,
  • a Postgres schema (DDL),
  • and a Synology-friendly docker-compose.yml skeleton.