Skip to content

feat(refresh): extend scheduled refresh to support Reddit/YouTube/GitHub re-ingestion #41

@jmagar

Description

@jmagar

Summary

axon refresh currently only supports periodic URL re-crawling — it fetches web pages, compares content hashes, and re-embeds changed content. But our ingest sources (Reddit, YouTube, GitHub) have no scheduled refresh story at all. A subreddit we ingested last week has new posts. A YouTube channel published new videos. A GitHub repo merged new PRs. Today there's no way to keep those indexed sources fresh without manually re-running axon ingest.

Current State

Refresh system (crates/jobs/refresh/)

  • RefreshJobConfig holds urls: Vec<String> — URL-only
  • url_processor.rs fetches each URL via HTTP, checks ETag/Last-Modified/content-hash, re-embeds changed pages
  • RefreshSchedule table supports seed_url or urls_json — both expect web URLs
  • Worker processes refresh jobs via worker_lane.rs

Ingest system (crates/jobs/ingest.rs)

  • IngestSource enum: Github { repo, include_source }, Reddit { target }, Youtube { target }, Sessions { ... }
  • process_ingest_job() dispatches to source-specific handlers
  • Each source has its own data pipeline (OAuth2 for Reddit, yt-dlp for YouTube, REST API for GitHub)
  • No scheduling, no delta detection, no periodic re-ingestion

Proposed Design

1. Extend RefreshSchedule to support ingest sources

Add an ingest_source column (nullable JSONB) to axon_refresh_schedules:

ALTER TABLE axon_refresh_schedules
ADD COLUMN ingest_source JSONB;  -- serialized IngestSource variant

When ingest_source IS NOT NULL, the schedule creates an ingest job instead of a refresh job on tick. The existing seed_url/urls_json fields remain for URL-based refresh — the two modes are mutually exclusive per schedule row.

2. Delta-aware re-ingestion

Each source needs a "what's new since last run" strategy:

Source Delta strategy Key field
Reddit after param in Reddit API — fetch posts/comments newer than last ingested reddit_post_id last_ingested_id in schedule metadata
YouTube yt-dlp playlist download with --dateafter or compare against already-embedded video IDs in Qdrant last_ingested_date or video ID set
GitHub GitHub API since param on issues/PRs/commits; compare file tree SHA against last known commit last_commit_sha in schedule metadata

Store delta cursors in a new cursor_json JSONB column on axon_refresh_schedules.

3. CLI surface

# Schedule periodic Reddit re-ingestion (every 6 hours)
axon refresh schedule create \
  --name "rust-subreddit" \
  --ingest-target "r/rust" \
  --every 6h

# Schedule periodic YouTube channel re-ingestion (daily)
axon refresh schedule create \
  --name "fireship-channel" \
  --ingest-target "@fireship" \
  --every 24h

# Schedule periodic GitHub repo re-ingestion (every 12 hours)
axon refresh schedule create \
  --name "axon-repo" \
  --ingest-target "jmagar/axon_rust" \
  --every 12h

# List all schedules (URL + ingest)
axon refresh schedule list

# Existing URL refresh still works unchanged
axon refresh schedule create --name "rust-docs" --seed-url "https://doc.rust-lang.org" --every 6h

4. Worker changes

The refresh schedule tick loop (handle_refresh_schedule_run_due) currently calls start_refresh_job_with_pool(). For ingest-type schedules, it should call start_ingest_job() instead, passing the stored IngestSource + any delta cursor.

After the ingest job completes, update the schedule's cursor_json with the new high-water mark (latest post ID, commit SHA, video date, etc.).

5. classify_target() reuse

The existing classify_target() in crates/ingest/classify.rs already auto-detects source type from a target string. The --ingest-target flag can pass through classify_target() at schedule creation time and store the resulting IngestSource variant.

Implementation Notes

  • RefreshScheduleCreate struct needs an ingest_source: Option<IngestSource> field
  • Delta cursors are source-specific — use an enum or freeform JSONB to avoid coupling refresh to ingest internals
  • The refresh worker already runs as a separate service — no new worker needed, just dispatch logic branching on ingest_source.is_some()
  • Reddit OAuth2 tokens have short TTLs — the ingest handler already handles token refresh, so periodic re-ingestion should Just Work
  • YouTube playlists/channels already enumerate all videos — delta detection can diff against Qdrant's existing yt_video_id payloads
  • GitHub since param on the issues/commits API is the cleanest delta mechanism

Out of Scope

  • Sessions re-ingestion (local files, different lifecycle)
  • Automatic schedule creation on first axon ingest (could be a follow-up with --auto-refresh flag like crawl has)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions