-
Notifications
You must be signed in to change notification settings - Fork 0
feat(refresh): extend scheduled refresh to support Reddit/YouTube/GitHub re-ingestion #41
Description
Summary
axon refresh currently only supports periodic URL re-crawling — it fetches web pages, compares content hashes, and re-embeds changed content. But our ingest sources (Reddit, YouTube, GitHub) have no scheduled refresh story at all. A subreddit we ingested last week has new posts. A YouTube channel published new videos. A GitHub repo merged new PRs. Today there's no way to keep those indexed sources fresh without manually re-running axon ingest.
Current State
Refresh system (crates/jobs/refresh/)
RefreshJobConfigholdsurls: Vec<String>— URL-onlyurl_processor.rsfetches each URL via HTTP, checks ETag/Last-Modified/content-hash, re-embeds changed pagesRefreshScheduletable supportsseed_urlorurls_json— both expect web URLs- Worker processes refresh jobs via
worker_lane.rs
Ingest system (crates/jobs/ingest.rs)
IngestSourceenum:Github { repo, include_source },Reddit { target },Youtube { target },Sessions { ... }process_ingest_job()dispatches to source-specific handlers- Each source has its own data pipeline (OAuth2 for Reddit, yt-dlp for YouTube, REST API for GitHub)
- No scheduling, no delta detection, no periodic re-ingestion
Proposed Design
1. Extend RefreshSchedule to support ingest sources
Add an ingest_source column (nullable JSONB) to axon_refresh_schedules:
ALTER TABLE axon_refresh_schedules
ADD COLUMN ingest_source JSONB; -- serialized IngestSource variantWhen ingest_source IS NOT NULL, the schedule creates an ingest job instead of a refresh job on tick. The existing seed_url/urls_json fields remain for URL-based refresh — the two modes are mutually exclusive per schedule row.
2. Delta-aware re-ingestion
Each source needs a "what's new since last run" strategy:
| Source | Delta strategy | Key field |
|---|---|---|
after param in Reddit API — fetch posts/comments newer than last ingested reddit_post_id |
last_ingested_id in schedule metadata |
|
| YouTube | yt-dlp playlist download with --dateafter or compare against already-embedded video IDs in Qdrant |
last_ingested_date or video ID set |
| GitHub | GitHub API since param on issues/PRs/commits; compare file tree SHA against last known commit |
last_commit_sha in schedule metadata |
Store delta cursors in a new cursor_json JSONB column on axon_refresh_schedules.
3. CLI surface
# Schedule periodic Reddit re-ingestion (every 6 hours)
axon refresh schedule create \
--name "rust-subreddit" \
--ingest-target "r/rust" \
--every 6h
# Schedule periodic YouTube channel re-ingestion (daily)
axon refresh schedule create \
--name "fireship-channel" \
--ingest-target "@fireship" \
--every 24h
# Schedule periodic GitHub repo re-ingestion (every 12 hours)
axon refresh schedule create \
--name "axon-repo" \
--ingest-target "jmagar/axon_rust" \
--every 12h
# List all schedules (URL + ingest)
axon refresh schedule list
# Existing URL refresh still works unchanged
axon refresh schedule create --name "rust-docs" --seed-url "https://doc.rust-lang.org" --every 6h4. Worker changes
The refresh schedule tick loop (handle_refresh_schedule_run_due) currently calls start_refresh_job_with_pool(). For ingest-type schedules, it should call start_ingest_job() instead, passing the stored IngestSource + any delta cursor.
After the ingest job completes, update the schedule's cursor_json with the new high-water mark (latest post ID, commit SHA, video date, etc.).
5. classify_target() reuse
The existing classify_target() in crates/ingest/classify.rs already auto-detects source type from a target string. The --ingest-target flag can pass through classify_target() at schedule creation time and store the resulting IngestSource variant.
Implementation Notes
RefreshScheduleCreatestruct needs aningest_source: Option<IngestSource>field- Delta cursors are source-specific — use an enum or freeform JSONB to avoid coupling refresh to ingest internals
- The refresh worker already runs as a separate service — no new worker needed, just dispatch logic branching on
ingest_source.is_some() - Reddit OAuth2 tokens have short TTLs — the ingest handler already handles token refresh, so periodic re-ingestion should Just Work
- YouTube playlists/channels already enumerate all videos — delta detection can diff against Qdrant's existing
yt_video_idpayloads - GitHub
sinceparam on the issues/commits API is the cleanest delta mechanism
Out of Scope
- Sessions re-ingestion (local files, different lifecycle)
- Automatic schedule creation on first
axon ingest(could be a follow-up with--auto-refreshflag like crawl has)