Skip to content

feat(ingest): audit and leverage structured metadata across all ingest sources #33

@jmagar

Description

@jmagar

Summary

We recently started collecting rich structured metadata on ingested content but we're not fully leveraging it. Audit every source type, document what's being stored, identify gaps, and wire metadata into search/filter/UI surfaces.

What We're Collecting Today

GitHub (crates/ingest/github/meta.rs) — well-structured:

  • Repo chunks: gh_owner, gh_stars, gh_forks, gh_open_issues, gh_language, gh_topics, gh_created_at, gh_pushed_at, gh_is_fork, gh_is_archived
  • Issue chunks: gh_issue_number, gh_state, gh_author, gh_created_at, gh_updated_at, gh_comment_count, gh_labels, gh_is_pr
  • PR chunks: + gh_merged_at, gh_is_draft

Reddit (crates/ingest/reddit/meta.rs) — audit needed, document what fields exist

Sessions (crates/ingest/sessions/) — no structured metadata:

  • embed_text_with_metadata(cfg, text, url, "claude_session", title) — just a source type string
  • No session_platform, session_project, session_date, session_turn_count, etc.

Local file embeds (axon embed <path>) — audit needed:

  • File path, MIME type, last modified — are any of these captured in the Qdrant payload?

What Needs To Happen

1. Audit pass — document everything

Add a Qdrant Payload Fields section to docs/SCHEMA.md listing every field stored per source type, its type, and example values. This is the source of truth.

2. Add structured metadata to sessions chunks

// Currently (no structured metadata):
embed_text_with_metadata(cfg, text, url, "claude_session", title)

// Target (structured payload):
embed_text_with_extra_payload(cfg, text, url, "claude_session", title, json!({
    "session_platform": "claude",
    "session_project": project_name,
    "session_date": session_date,       // ISO 8601
    "session_file": file_path,
    "session_turn_count": turn_count,
    "session_model": model_name,        // where parseable from export
}))

Same for codex_session and gemini_session.

3. Add structured metadata to local file embed chunks

json!({
    "embed_source": "local_file",
    "file_path": path.to_string_lossy(),
    "file_extension": ext,
    "file_size_bytes": size,
    "file_modified_at": mtime.to_rfc3339(),
})

4. Wire metadata into search + filter

  • axon query / axon ask: add --filter key=value flag for Qdrant payload filtering
    • axon query "memory leak" --filter gh_language=rust --filter gh_is_pr=false
    • axon query "async" --filter session_platform=claude --filter session_project=axon_rust
  • axon sources / axon domains: break down by source type, show metadata summary

5. Surface metadata in Cortex UI

  • Stats page: metadata distribution (top languages, top labels, issues vs PRs vs files)
  • GitHub-specific filters in search: open issues only, PRs only, specific language
  • Session search: filter by platform, project, date range

Files

File Action
crates/ingest/sessions/claude.rs Switch to embed_text_with_extra_payload with session metadata
crates/ingest/sessions/codex.rs Same
crates/ingest/sessions/gemini.rs Same
crates/vector/ops/commands/query.rs Add --filter key=value metadata filter support
docs/SCHEMA.md Add Qdrant payload fields section — all fields per source type

Acceptance Criteria

  • docs/SCHEMA.md documents all Qdrant payload fields per source type (GitHub, Reddit, YouTube, sessions, local files)
  • Session chunks have structured extra payload: session_platform, session_project, session_date, session_turn_count
  • Local file embed chunks have structured payload: embed_source, file_path, file_extension, file_modified_at
  • axon query --filter key=value filters results by Qdrant payload field
  • Cortex stats page shows metadata distribution
  • cargo clippy clean, all tests pass

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions