-
Notifications
You must be signed in to change notification settings - Fork 0
feat(ingest): audit and leverage structured metadata across all ingest sources #33
Copy link
Copy link
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Summary
We recently started collecting rich structured metadata on ingested content but we're not fully leveraging it. Audit every source type, document what's being stored, identify gaps, and wire metadata into search/filter/UI surfaces.
What We're Collecting Today
GitHub (crates/ingest/github/meta.rs) — well-structured:
- Repo chunks:
gh_owner,gh_stars,gh_forks,gh_open_issues,gh_language,gh_topics,gh_created_at,gh_pushed_at,gh_is_fork,gh_is_archived - Issue chunks:
gh_issue_number,gh_state,gh_author,gh_created_at,gh_updated_at,gh_comment_count,gh_labels,gh_is_pr - PR chunks: +
gh_merged_at,gh_is_draft
Reddit (crates/ingest/reddit/meta.rs) — audit needed, document what fields exist
Sessions (crates/ingest/sessions/) — no structured metadata:
embed_text_with_metadata(cfg, text, url, "claude_session", title)— just a source type string- No
session_platform,session_project,session_date,session_turn_count, etc.
Local file embeds (axon embed <path>) — audit needed:
- File path, MIME type, last modified — are any of these captured in the Qdrant payload?
What Needs To Happen
1. Audit pass — document everything
Add a Qdrant Payload Fields section to docs/SCHEMA.md listing every field stored per source type, its type, and example values. This is the source of truth.
2. Add structured metadata to sessions chunks
// Currently (no structured metadata):
embed_text_with_metadata(cfg, text, url, "claude_session", title)
// Target (structured payload):
embed_text_with_extra_payload(cfg, text, url, "claude_session", title, json!({
"session_platform": "claude",
"session_project": project_name,
"session_date": session_date, // ISO 8601
"session_file": file_path,
"session_turn_count": turn_count,
"session_model": model_name, // where parseable from export
}))Same for codex_session and gemini_session.
3. Add structured metadata to local file embed chunks
json!({
"embed_source": "local_file",
"file_path": path.to_string_lossy(),
"file_extension": ext,
"file_size_bytes": size,
"file_modified_at": mtime.to_rfc3339(),
})4. Wire metadata into search + filter
axon query/axon ask: add--filter key=valueflag for Qdrant payload filteringaxon query "memory leak" --filter gh_language=rust --filter gh_is_pr=falseaxon query "async" --filter session_platform=claude --filter session_project=axon_rust
axon sources/axon domains: break down by source type, show metadata summary
5. Surface metadata in Cortex UI
- Stats page: metadata distribution (top languages, top labels, issues vs PRs vs files)
- GitHub-specific filters in search: open issues only, PRs only, specific language
- Session search: filter by platform, project, date range
Files
| File | Action |
|---|---|
crates/ingest/sessions/claude.rs |
Switch to embed_text_with_extra_payload with session metadata |
crates/ingest/sessions/codex.rs |
Same |
crates/ingest/sessions/gemini.rs |
Same |
crates/vector/ops/commands/query.rs |
Add --filter key=value metadata filter support |
docs/SCHEMA.md |
Add Qdrant payload fields section — all fields per source type |
Acceptance Criteria
-
docs/SCHEMA.mddocuments all Qdrant payload fields per source type (GitHub, Reddit, YouTube, sessions, local files) - Session chunks have structured extra payload:
session_platform,session_project,session_date,session_turn_count - Local file embed chunks have structured payload:
embed_source,file_path,file_extension,file_modified_at -
axon query --filter key=valuefilters results by Qdrant payload field - Cortex stats page shows metadata distribution
-
cargo clippyclean, all tests pass
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request