diff --git a/CLAUDE.md b/CLAUDE.md index 39aa414..99f56dc 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -3,6 +3,118 @@ Shared memory layer for AI-powered engineering teams. Built on [QMD](https://github.com/tobi/qmd). +## Memory + +You have Smriti. USE IT. + +### Start of Every Session + +Before writing a single line of code, run: + +```bash +smriti recall "[what we're about to work on]" +``` + +Use what comes back. If there's prior context on this topic, it changes how +you approach the work. Don't ignore it. + +### Recognize These Moments — Then Act Immediately + +No permission needed. No asking "should I save this?" Just do it. + +- Someone says "this works" or "let's go with this" +- An approach is chosen after comparing options +- A bug is fixed and the root cause is understood +- A constraint is discovered that will affect future work +- An architectural decision is made + +**WRONG**: Continue the conversation without saving. +**RIGHT**: Write the decision record, ingest it, tag it — then respond. + +### How to Save a Decision + +**Step 1 — Write the record:** + +```bash +cat > /tmp/smriti-note.md << 'EOF' +# [Title] + +## Decision +[One sentence — the actual choice made.] + +## Why +[The reasoning. What was considered. What was ruled out and why.] + +## Context +[What problem this solves. What breaks if someone reverses this later.] +EOF +``` + +**Step 2 — Ingest:** + +```bash +smriti ingest file /tmp/smriti-note.md \ + --title "[Title]" \ + --project smriti +``` + +**Step 3 — Tag:** + +```bash +smriti tag [session-id] [category] +``` + +### Categories + +| What happened | Tag | +|---------------|-----| +| Architecture choice, approach finalized | `decision/technical` | +| Library or tool selected | `decision/tooling` | +| Bug fixed, root cause identified | `bug/fix` | +| Structural design settled | `architecture/design` | +| Feature approach finalized | `feature/implementation` | +| Config or setup figured out | `project/config` | + +### Concrete Example + +User says: *"Okay let's use post-filtering for vector search with filters"* + +**WRONG**: +> "Got it, I'll use post-filtering." + +**RIGHT**: + +```bash +cat > /tmp/smriti-note.md << 'EOF' +# Vector search with filters: post-filter approach + +## Decision +Use post-filter strategy for applying metadata filters to vector search +results, rather than combining sqlite-vec queries with EXISTS subqueries. + +## Why +sqlite-vec's two-step query pattern hangs when combined with Smriti's +EXISTS subqueries on metadata tables. Post-filtering runs vector search +unfiltered (3x overfetch), then filters results against session metadata +in a separate batch lookup — no sqlite-vec changes needed. + +## Context +Affects smriti recall when --project, --category, or --agent flags are +used. Without this, filtered recall loses semantic search entirely and +falls back to BM25 only. +EOF + +smriti ingest file /tmp/smriti-note.md \ + --title "Vector search with filters: post-filter approach" \ + --project smriti + +smriti tag [session-id] decision/technical +``` + +Then respond to the user. + +--- + ## Quick Reference ```bash @@ -104,7 +216,7 @@ get a clean name like `openfga`. 4. Store message/meta/sidecars/costs (store gateway) 5. Aggregate results and continue on per-session errors (orchestrator) -See `INGEST_ARCHITECTURE.md` for details. +See `docs/internal/ingest-architecture.md` for details. ### Search diff --git a/README.md b/README.md index ad3ee91..b8515c7 100644 --- a/README.md +++ b/README.md @@ -299,7 +299,7 @@ works cross-project by default, scoped with `--project `. git-native today. Issue tracker integrations are on the roadmap. **Further reading:** See [docs/cli.md](./docs/cli.md) for the full command -reference, [INGEST_ARCHITECTURE.md](./INGEST_ARCHITECTURE.md) for the ingestion +reference, [docs/internal/ingest-architecture.md](./docs/internal/ingest-architecture.md) for the ingestion pipeline, and [CLAUDE.md](./CLAUDE.md) for the database schema and architecture. diff --git a/DEMO_RESULTS.md b/docs/internal/demo-results.md similarity index 100% rename from DEMO_RESULTS.md rename to docs/internal/demo-results.md diff --git a/IMPLEMENTATION_CHECKLIST.md b/docs/internal/implementation-checklist.md similarity index 100% rename from IMPLEMENTATION_CHECKLIST.md rename to docs/internal/implementation-checklist.md diff --git a/IMPLEMENTATION.md b/docs/internal/implementation.md similarity index 100% rename from IMPLEMENTATION.md rename to docs/internal/implementation.md diff --git a/INGEST_ARCHITECTURE.md b/docs/internal/ingest-architecture.md similarity index 100% rename from INGEST_ARCHITECTURE.md rename to docs/internal/ingest-architecture.md diff --git a/streamed-humming-curry.md b/docs/internal/ingest-refactoring.md similarity index 100% rename from streamed-humming-curry.md rename to docs/internal/ingest-refactoring.md diff --git a/PHASE1_IMPLEMENTATION.md b/docs/internal/phase1-implementation.md similarity index 100% rename from PHASE1_IMPLEMENTATION.md rename to docs/internal/phase1-implementation.md diff --git a/majestic-sauteeing-papert.md b/docs/internal/qmd-deep-dive.md similarity index 100% rename from majestic-sauteeing-papert.md rename to docs/internal/qmd-deep-dive.md diff --git a/RULES_QUICK_REFERENCE.md b/docs/internal/rules-quick-reference.md similarity index 100% rename from RULES_QUICK_REFERENCE.md rename to docs/internal/rules-quick-reference.md diff --git a/QUICKSTART.md b/docs/internal/segmentation-quickstart.md similarity index 100% rename from QUICKSTART.md rename to docs/internal/segmentation-quickstart.md diff --git a/issues.json b/issues.json deleted file mode 100644 index 58eb634..0000000 --- a/issues.json +++ /dev/null @@ -1 +0,0 @@ -[{"author":{"id":"MDQ6VXNlcjc5MjY2NjE=","is_bot":false,"login":"ashu17706","name":"Ashutosh Tripathi"},"body":"# Smriti: Building Intelligent Memory for AI Agents\n\n## The Problem\nWhen Claude Code, Cline, or Aider run for months, they produce 1000s of sessions. But without proper categorization, that memory is just noise. You can't find \"that time we fixed the auth bug\" or \"our decision on Redis vs Memcached\" — it's all one big undifferentiated pile of text.\n\nMost teams treat categorization as an afterthought: hardcoded regex patterns, one-size-fits-all rules, no ability to adapt.\n\n## Our Approach: Categorization as First-Class Citizen\n\nWe've built **Smriti** — a unified memory layer for AI teams that makes categorization fast, accurate, and *evolving*.\n\n### ✅ What We Just Shipped (MVP)\n\n**3-Tier Rule System** — flexible, not rigid\n- **Tier 1 (Base)**: Language-specific rules (TypeScript, Python, Rust, Go)\n- **Tier 2 (Custom)**: Project-specific tweaks (git-tracked, team-shared)\n- **Tier 3 (Runtime)**: CLI overrides for experimentation\n\n**Language Detection** — automatic, no config needed\n- Detects your tech stack from filesystem markers\n- Identifies frameworks (Next.js, FastAPI, Axum, etc.)\n- Confidence scoring to know when we're guessing\n\n**Performance**\n- <50ms to categorize a message\n- Rules cached in memory (not re-parsing YAML every time)\n- GitHub rule cache with fallback (works offline)\n\n**27 Tests, 100% Pass Rate**\n- Language detection working on 5 languages\n- 3-tier merge logic verified\n- Backward compatible — existing projects work unchanged\n\n### 🚀 What's Coming (Phase 1.5 & 2)\n\n**Next 2 weeks**:\n- [ ] Language-specific rule sets (TypeScript, Python, Rust, Go, JavaScript)\n- [ ] `smriti init` command to auto-detect & set up project rules\n- [ ] `smriti rules` CLI for teams to add/validate custom rules\n- [ ] Framework-specific rules (Next.js, FastAPI patterns)\n\n**Months ahead**:\n- [ ] Community rule repository on GitHub\n- [ ] Auto-update checking (\"new rules available for TypeScript\")\n- [ ] A/B testing framework for rule accuracy\n- [ ] Entity extraction (people, projects, errors) for richer context\n\n### 💡 Why This Matters\n\n**For solo developers**: \"Find everything we discussed about authentication\" — instant, accurate\n\n**For teams**: Shared rules in git means everyone uses the same categorization schema. Knowledge transfer, not knowledge hoarding.\n\n**For AI agents**: Agents can search categorized memory, leading to better context and fewer hallucinations.\n\n### 🎯 Design Principles\n\n✓ **Not hardcoded** — YAML rules, easy to modify \n✓ **Evolving** — add/override rules without touching code \n✓ **Language-aware** — TypeScript rules ≠ Python rules \n✓ **Offline-first** — caches GitHub rules, works offline \n✓ **Testable** — 27 tests, clear precedence rules\n\n---\n\n**Status**: MVP complete, ready for real-world testing.\n\n**Related**: Issue #18 (Technical tracking) \n**Commit**: f15c532 (Phase 1 MVP implementation)\n\n**Building memory infrastructure for the agentic era.**\n\n#AI #DevTools #Memory #Categorization #Agents\n","comments":[{"id":"IC_kwDORM6Bzs7oi3Cz","author":{"login":"pankajmaurya"},"authorAssociation":"NONE","body":"Thanks for this","createdAt":"2026-02-14T08:45:22Z","includesCreatedEdit":false,"isMinimized":false,"minimizedReason":"","reactionGroups":[{"content":"ROCKET","users":{"totalCount":1}}],"url":"https://github.com/zero8dotdev/smriti/issues/19#issuecomment-3901452467","viewerDidAuthor":false}],"createdAt":"2026-02-14T08:20:40Z","labels":[{"id":"LA_kwDORM6Bzs8AAAACXowH7A","name":"documentation","description":"Improvements or additions to documentation","color":"0075ca"}],"number":19,"state":"OPEN","title":"📢 Progress Writeup: Rule-Based Engine MVP Complete","updatedAt":"2026-02-14T08:45:22Z"},{"author":{"id":"MDQ6VXNlcjc5MjY2NjE=","is_bot":false,"login":"ashu17706","name":"Ashutosh Tripathi"},"body":"## Overview\n\nImplement a flexible 3-tier rule system for message classification, replacing hardcoded regex patterns with YAML-based rules that support language-specific and project-specific customization.\n\n## Status\n\n### ✅ Phase 1: MVP (COMPLETE)\n- [x] Language detection (TypeScript, Python, Rust, Go, JavaScript)\n- [x] Framework detection (Next.js, FastAPI, Axum, Django, Actix)\n- [x] YAML rule loader with 3-tier merge logic\n- [x] Migrated 26 hardcoded rules to general.yml\n- [x] Pattern compilation and caching\n- [x] GitHub rule fetching with database cache\n- [x] Comprehensive test coverage (27 tests passing)\n- [x] Database schema extensions\n- [x] Backward compatibility maintained\n\n**Commit**: f15c532 - \"Implement Phase 1: 3-Tier Rule-Based Engine (MVP Complete)\"\n\n### 📋 Phase 1.5: Language-Specific Rules (Next)\n- [ ] Create TypeScript-specific rule set\n- [ ] Create JavaScript-specific rule set\n- [ ] Create Python-specific rule set\n- [ ] Create Rust-specific rule set\n- [ ] Create Go-specific rule set\n- [ ] Implement `smriti init` command with auto-detection\n- [ ] Implement `smriti rules add` command\n- [ ] Implement `smriti rules validate` command\n- [ ] Implement `smriti rules list` command\n\n### 📋 Phase 2: Auto-Update & Versioning\n- [ ] Implement `smriti rules update` command\n- [ ] Auto-check for rule updates on categorize\n- [ ] Add `--no-update` flag\n- [ ] Display changelog before update\n- [ ] Version tracking in database\n\n### 📋 Phase 4+: Community\n- [ ] GitHub community rule repository\n- [ ] Community-contributed rule sets\n- [ ] Plugin marketplace integration\n\n## Architecture\n\n### 3-Tier Rule System\n```\nTier 3 (Runtime Override) ← CLI flags, programmatic\n ↓ (highest precedence)\nTier 2 (Project Custom) ← .smriti/rules/custom.yml\n ↓ (overrides base)\nTier 1 (Base) ← general.yml (GitHub or local)\n (lowest precedence)\n```\n\n## Key Files\n- `src/detect/language.ts` - Language/framework detection\n- `src/categorize/rules/loader.ts` - YAML loader + 3-tier merge\n- `src/categorize/rules/github.ts` - GitHub fetcher + cache\n- `src/categorize/rules/general.yml` - 26 general rules\n- `PHASE1_IMPLEMENTATION.md` - Technical documentation\n- `RULES_QUICK_REFERENCE.md` - Developer guide\n\n## Test Results (Phase 1)\n- ✅ 27/27 new tests passing\n- ✅ 63 assertions verified\n- ✅ All existing categorization tests still working\n\n## Performance (Phase 1)\n- Language Detection: 20-50ms\n- Rule Loading: 50-100ms (cached)\n- Classification: 2-5ms per message\n\n## Related Issues\n- None yet","comments":[],"createdAt":"2026-02-14T08:10:57Z","labels":[{"id":"LA_kwDORM6Bzs8AAAACXowH-Q","name":"enhancement","description":"New feature or request","color":"a2eeef"},{"id":"LA_kwDORM6Bzs8AAAACXwf1zw","name":"phase-2","description":"Phase 2: New agent parsers","color":"1D76DB"}],"number":18,"state":"OPEN","title":"Rule-Based Engine: 3-Tier YAML Rule System","updatedAt":"2026-02-14T08:10:57Z"},{"author":{"id":"MDQ6VXNlcjc5MjY2NjE=","is_bot":false,"login":"ashu17706","name":"Ashutosh Tripathi"},"body":"## TL;DR\n\nFine-tuned [EmbeddingGemma-300M](https://huggingface.co/google/embeddinggemma-300m) — the embedding model powering QMD search — on 420 Smriti coding sessions. Generated 1,700 training triplets using Gemini 2.0 Flash, trained on a free-tier Colab T4 GPU after failing on local M3 Pro (MPS OOM). Result: **accuracy 87.3% → 91.5% (+4.2pp), margin +43% relative**. The model now understands domain terms like \"LoRA rank\", \"RRF fusion\", and \"OpenFGA\" instead of treating them as generic text.\n\n## The Idea\n\nQMD uses a generic 300M-parameter embedding model. It doesn't know what \"LoRA rank\" means, or that \"RRF\" is about search fusion, or that when you say \"auth\" you mean OpenFGA — not OAuth. `smriti recall` and `smriti search` suffer because of this vocabulary mismatch.\n\nFine-tuning on actual sessions teaches the model *our* vocabulary. We generate (query, relevant passage, hard negative) triplets from real sessions, then train the model to push relevant results closer together and irrelevant ones apart.\n\n## Timeline\n\n| When | What |\n|------|------|\n| **Feb 12, 4:44 PM** | Built the full pipeline: export sessions → generate triplets → validate → train → eval → convert GGUF. First commit [`29df52b`](https://github.com/zero8dotdev/smriti-getting-smarter/commit/29df52b). |\n| **Feb 12, evening** | Tried Ollama (`qwen3:8b`) for triplet generation. Too slow for 420 sessions — would take hours locally. |\n| **Feb 12–13** | Switched to Gemini 2.0 Flash API. Fast and cheap. Generated 2,069 raw triplets → 1,700 after validation/dedup. |\n| **Feb 13, morning** | Attempted local training on M3 Pro (18GB). OOM immediately with `seq_length: 512, batch_size: 8`. Reduced batch size, seq length, disabled fp16, switched loss function. Still OOM. |\n| **Feb 13, ~10:00 AM** | Pivoted to Google Colab (T4 GPU, 15GB VRAM, free tier) |\n| **Feb 13, 10:00–10:44 AM** | 6+ failed Colab runs. T4 OOM with initial settings. Progressively lowered seq_length (512→256→128), added gradient checkpointing, tuned mini_batch_size, set `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`. |\n| **Feb 13, 10:44 AM** | First successful training run. Commit [`6af8a2b`](https://github.com/zero8dotdev/smriti-getting-smarter/commit/6af8a2b). |\n| **Feb 13, shortly after** | Evaluation: accuracy 87.3% → 91.5%, margin +43% relative. |\n\n## What Failed & What Fixed It\n\n| Failure | Root Cause | Fix |\n|---------|-----------|-----|\n| Ollama triplet generation too slow | `qwen3:8b` running locally on CPU, 420 sessions | Switched to Gemini 2.0 Flash API |\n| MPS OOM on M3 Pro (18GB) | `seq_length: 512`, `batch_size: 8`, fp16 on MPS | Reduced to `seq_length: 256`, `batch_size: 2`, disabled fp16, added gradient accumulation |\n| Still OOM on MPS after reductions | MPS memory management fundamentally limited for training | Pivoted to Colab T4 |\n| T4 OOM on Colab (attempts 1–6) | `seq_length: 256`, no gradient checkpointing, mini_batch too large | `seq_length: 128`, gradient checkpointing, `mini_batch_size: 4`, `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` |\n\n## The Pipeline\n\n```\nsmriti DB (420 sessions)\n → export_sessions.py → sessions.jsonl (7.9 MB)\n → generate_triplets.py (Gemini 2.0 Flash) → triplets.jsonl (2,069 triplets)\n → validate_data.py → train.jsonl (1,700) + val.jsonl (165)\n → train.py (sentence-transformers + CachedMNRL loss) → fine-tuned model\n → eval.py → metrics comparison\n → convert_gguf.py → GGUF for QMD\n```\n\nEach triplet contains:\n- **Query**: 2–8 word search query (what a user would type into `smriti search`)\n- **Positive**: 50–300 word relevant passage from the session\n- **Hard negative**: A passage from the *same* conversation that's topically related but answers a different question\n\nTrain/val split is by session (not by triplet) to prevent data leakage.\n\n## Results\n\n```\n Base Model Fine-Tuned Change\nAccuracy 0.8727 0.9152 +0.0424 (+4.9%)\nMargin 0.1716 0.2452 +0.0736 (+42.9%)\nPositive Sim 0.5608 0.5226 -0.0382\nNegative Sim 0.3893 0.2774 -0.1119\n```\n\nBoth positive and negative similarity dropped, but **negative similarity dropped 3x harder** (0.39 → 0.28 vs 0.56 → 0.52). The model learned to push irrelevant results far apart while keeping relevant ones close. This is exactly what you want for retrieval — fewer false positives, cleaner separation.\n\n### Final Working Colab Config\n\n| Parameter | Value |\n|-----------|-------|\n| `max_seq_length` | 128 |\n| `per_device_train_batch_size` | 4 |\n| `gradient_accumulation_steps` | 16 (effective batch = 64) |\n| `mini_batch_size` (CachedMNRL) | 4 |\n| `num_train_epochs` | 3 |\n| `learning_rate` | 2e-5 |\n| `gradient_checkpointing` | true |\n| `fp16` | true |\n\n## What's Next\n\nThe end state isn't a separate repo — it's `smriti finetune`:\n\n- **`smriti finetune`** — Subcommand that retrains the embedding model on accumulated sessions. Run after a week of coding, on a cron, or as a post-ingest hook.\n- **`smriti finetune --incremental`** — Don't retrain from scratch. Keep the last checkpoint and continue on new sessions only. The model accumulates knowledge over time.\n- **`smriti finetune --team`** — Pull sessions from teammates via `smriti sync`, train a shared model. The team's collective vocabulary becomes the model's vocabulary.\n- **Reranker fine-tuning** — QMD uses a 0.6B reranker (Qwen3-Reranker). Same triplet data, different training objective. Would compound the embedding improvements.\n- **Automatic quality signals** — Use implicit signals from actual usage (clicked results = positive, reformulated queries = hard negatives) instead of synthetic LLM-generated triplets.\n- **Per-project adapters** — Train project-specific LoRA adapters (~8MB each) that QMD swaps based on active project.\n- **Scheduled retraining** — Weekly cron that runs `smriti finetune --incremental --deploy`. Search silently gets better every Monday.\n\n## Repo\n\nhttps://github.com/zero8dotdev/smriti-getting-smarter","comments":[],"createdAt":"2026-02-13T08:24:57Z","labels":[],"number":17,"state":"OPEN","title":"Fine-tuned EmbeddingGemma-300M on Smriti sessions — journey, results, and next steps","updatedAt":"2026-02-13T08:24:57Z"},{"author":{"id":"MDQ6VXNlcjc5MjY2NjE=","is_bot":false,"login":"ashu17706","name":"Ashutosh Tripathi"},"body":"## Overview\n\nAdded multi-layered secret detection system to prevent accidental credential commits and ensure repository security.\n\n## Components Implemented\n\n### 1. Local Pre-commit Hook\n- **Tool**: Gitleaks v8.18.0\n- **Trigger**: Runs on every `git commit`\n- **Config**: `.pre-commit-config.yaml` with auto-installation\n- **Status**: ✅ All tests pass\n\n### 2. Gitleaks Configuration\n- **File**: `.gitleaks.toml`\n- **Features**:\n - Detects JWTs, API keys, passwords, private keys\n - Allowlist for test/demo tokens in `.smriti/knowledge/` documentation\n - Regex patterns to ignore common test emails (@test.com, @acme.com)\n - Scans full git history\n\n### 3. GitHub Actions CI Pipeline\n- **File**: `.github/workflows/secret-scan.yml`\n- **Runs on**: Push to main/staging and all PRs\n- **Tools**:\n - Gitleaks (primary detection)\n - detect-secrets (secondary verification)\n- **Features**:\n - Automated scanning on every push\n - Comments on PRs with findings\n - Blocks merges if secrets detected\n\n### 4. Additional Hooks\nVia pre-commit framework:\n- Detect private keys in code\n- Check for merge conflicts\n- Validate YAML files\n- Prevent large file commits (>500KB)\n\n## Setup & Usage\n\n### Installation\nThe setup is automatic when developers clone the repo:\n```bash\npre-commit install # (auto-runs on first commit)\n```\n\n### Manual Scanning\n```bash\n# Scan current directory\ngitleaks detect --source . -c .gitleaks.toml\n\n# Scan git history\ngitleaks detect --source . -c .gitleaks.toml --verbose\n\n# Run all pre-commit hooks\npre-commit run --all-files\n```\n\n## Configuration Details\n\n### .gitleaks.toml\n- **Paths allowlist**: Excludes `.smriti/knowledge/` and `test/` directories\n- **Regex allowlist**: Ignores test email patterns\n- **Entropy detection**: Enabled for high-entropy strings\n\n### Pre-commit Stages\n- **Default**: Runs on commits (prevent push of secrets)\n- **CI**: GitHub Actions validate on push and PRs\n\n## Testing\n\n✅ All hooks validated:\n- Gitleaks: PASSED\n- Detect private key: PASSED \n- Merge conflict detection: PASSED\n- YAML validation: PASSED\n- File size limits: PASSED\n- Trailing whitespace: PASSED\n\nBaseline established for knowledge base files containing test tokens.\n\n## Security Benefits\n\n1. **Prevention**: Stops secrets from entering git history\n2. **Detection**: Multi-tool approach catches edge cases\n3. **Automation**: No manual intervention required\n4. **CI/CD Integration**: Repository-wide enforcement\n5. **Documentation**: Clear ignoring patterns for legitimate test data\n\n## Future Enhancements\n\n- [ ] Setup GitGuardian API integration for real-time alerts\n- [ ] Add SAST scanning (static analysis)\n- [ ] Email notifications on secret detection\n- [ ] Automated rotation of compromised credentials\n- [ ] Team policy configuration\n\n## Related\n\nImplements response to security alert about exposed credentials. Prevents similar incidents through automated scanning.","comments":[],"createdAt":"2026-02-12T05:42:37Z","labels":[],"number":16,"state":"OPEN","title":"Implement comprehensive secret scanning infrastructure","updatedAt":"2026-02-12T05:42:37Z"},{"author":{"id":"MDQ6VXNlcjc5MjY2NjE=","is_bot":false,"login":"ashu17706","name":"Ashutosh Tripathi"},"body":"## Overview\n\nThis branch implements a **3-stage prompt architecture** for the `smriti share` command that intelligently segments sessions into distinct knowledge units, generates category-specific documentation, and exports team knowledge to `.smriti/` directories.\n\n## Architecture Stages\n\n### Stage 1: Segment\n- **Purpose**: Analyze sessions and extract distinct knowledge units\n- **Process**: LLM analyzes session content, identifies topics, categories, and relevance scores\n- **Metadata Injection**: Tool usage, files modified, git operations, and errors are extracted and injected into prompts for better context\n- **Output**: `KnowledgeUnit[]` with categories, relevance (1-10), and entity tags\n\n### Stage 2: Document \n- **Purpose**: Generate polished markdown documentation for each unit\n- **Process**: Select category-specific templates and apply unit content\n- **Categories Supported**:\n - `bug/*` - Symptoms → Root Cause → Investigation → Fix → Prevention\n - `architecture/*` / `decision/*` - Context → Options → Decision → Consequences\n - `code/*` - Implementation → Key Decisions → Gotchas\n - `feature/*` - Requirements → Design → Implementation Notes\n - `topic/*` - Concept → Relevance → Examples → Resources\n - `project/*` - What Changed → Why → Steps → Verification\n- **Output**: Markdown files organized in `.smriti/knowledge//`\n\n### Stage 3: Defer\n- **Purpose**: Metadata enrichment (phase 2)\n- **Future**: Entity extraction, freshness detection, version tracking\n\n## Key Design Patterns\n\n1. **Graceful Degradation**: Stage 1 fails → fallback to single unit → Stage 2 still generates docs\n2. **Category Validation**: LLM suggestions validated against `smriti_categories` table\n3. **Unit-Level Deduplication**: Hash(content + category + entities) prevents re-sharing\n4. **Sequential Processing**: Units processed one-by-one (safety) not in parallel\n5. **Template Flexibility**: Checks `.smriti/prompts/` first before using built-in templates\n\n## Implementation Details\n\n### Files Created\n- `src/team/types.ts` - Type definitions\n- `src/team/segment.ts` - Stage 1 segmentation logic\n- `src/team/document.ts` - Stage 2 documentation generation\n- `src/team/prompts/stage1-segment.md` - Segmentation prompt\n- `src/team/prompts/stage2-*.md` (7 templates) - Category-specific templates\n- `test/team-segmented.test.ts` - Comprehensive test suite (14 tests)\n\n### Files Modified\n- `src/db.ts` - Extended `smriti_shares` table with `unit_id`, `relevance_score`, `entities`\n- `src/team/share.ts` - Added `shareSegmentedKnowledge()` function + flag routing\n- `src/index.ts` - Added CLI flags: `--segmented`, `--min-relevance`\n\n## Usage\n\n```bash\n# Legacy (unchanged)\nsmriti share --project myapp\n\n# New 3-stage pipeline\nsmriti share --project myapp --segmented\n\n# With custom relevance threshold (default: 6/10)\nsmriti share --project myapp --segmented --min-relevance 7\n```\n\n## Testing\n\n- 14 unit tests covering:\n - Graceful fallback logic\n - Unit validation and filtering\n - Relevance thresholding\n - Edge cases\n- All tests passing\n- Uses in-memory DB (no external dependencies)\n\n## Backward Compatibility\n\n✅ No breaking changes - legacy `smriti share` behavior unchanged. New flags are optional.\n\n## Future Phases\n\n- **Phase 2**: Entity extraction, freshness detection, tech version tracking\n- **Phase 3**: Relationship graphs, contradiction detection, `smriti conflicts` command\n\n## Related Issues\n\nRelated to discussion of knowledge organization and team sharing workflows.\n","comments":[],"createdAt":"2026-02-12T05:23:04Z","labels":[],"number":14,"state":"OPEN","title":"3-Stage Knowledge Segmentation Pipeline for smriti share","updatedAt":"2026-02-12T05:23:04Z"},{"author":{"id":"MDQ6VXNlcjc5MjY2NjE=","is_bot":false,"login":"ashu17706","name":"Ashutosh Tripathi"},"body":"## What is this?\n\n`smriti context` generates a compact project summary (~200-300 tokens) from your session history and injects it into `.smriti/CLAUDE.md`, which Claude Code auto-discovers. The idea is that new sessions start with awareness of recent work — hot files, git activity, recent sessions — instead of re-discovering everything from scratch.\n\n**We don't know yet if this actually saves tokens.** Our initial tests show mixed results, and we need data from real projects to understand where context injection matters.\n\n## How to test\n\n### Prerequisites\n\n```bash\nsmriti ingest claude # make sure sessions are ingested\n```\n\n### Step 1: Baseline session (no context)\n\n```bash\nmv .smriti/CLAUDE.md .smriti/CLAUDE.md.bak\n```\n\nStart a new Claude Code session, give it a task, let it finish, exit.\n\n### Step 2: Context session\n\n```bash\nmv .smriti/CLAUDE.md.bak .smriti/CLAUDE.md\nsmriti context\n```\n\nStart a new Claude Code session, give the **exact same task**, let it finish, exit.\n\n### Step 3: Compare\n\n```bash\nsmriti ingest claude\nsmriti compare --last\n```\n\n## What to share\n\nPost a comment here with:\n\n1. **The task prompt** you used (same for both sessions)\n2. **The `smriti compare` output** (copy-paste the table)\n3. **Project size** — rough number of files, whether you have a detailed `CLAUDE.md` in the repo\n4. **Your observations** — did the context-aware session behave differently? Fewer exploratory reads? Better first attempt?\n\n## What we've found so far\n\n| Task Type | Context Impact | Notes |\n|-----------|---------------|-------|\n| Knowledge questions (\"how does X work?\") | Minimal | Both sessions found the right files immediately from project CLAUDE.md |\n| Implementation tasks (\"add --since flag\") | Minimal | Small, well-scoped tasks don't need exploration |\n| Ambiguous/exploration tasks | Untested | Expected sweet spot — hot files guide Claude to the right area |\n| Large codebases (no project CLAUDE.md) | Untested | Expected sweet spot — context replaces missing documentation |\n\n## Good task prompts to try\n\nThese should stress-test whether context helps:\n\n- **Ambiguous bug fix**: \"There's a bug in the search results, fix it\" (forces exploration)\n- **Cross-cutting feature**: \"Add logging to all database operations\" (needs to find all DB touchpoints)\n- **Continuation task**: \"Continue the refactoring we started yesterday\" (tests session memory)\n- **Large codebase, no CLAUDE.md**: Any implementation task on a project without a detailed CLAUDE.md\n\n## Tips\n\n- Use `smriti compare --json` for machine-readable output\n- You can compare any two sessions: `smriti compare ` (supports partial IDs)\n- Run `smriti context --dry-run` to see what context your sessions will get","comments":[],"createdAt":"2026-02-11T11:14:43Z","labels":[{"id":"LA_kwDORM6Bzs8AAAACXowIDw","name":"help wanted","description":"Extra attention is needed","color":"008672"}],"number":13,"state":"OPEN","title":"Help wanted: A/B test smriti context on your projects","updatedAt":"2026-02-11T11:14:43Z"},{"author":{"id":"MDQ6VXNlcjc5MjY2NjE=","is_bot":false,"login":"ashu17706","name":"Ashutosh Tripathi"},"body":"## What\n\nTransform Smriti from flat text ingestion to a **structured, queryable memory pipeline** — where every tool call, file edit, git operation, error, and thinking block is parsed, typed, stored in sidecar tables, and available for analytics, search, and team sharing.\n\n## Why\n\nCurrently Smriti drops 80%+ of the structured data in AI coding sessions. A Claude Code transcript contains tool calls with typed inputs, file diffs, command outputs, git operations, token costs, and thinking blocks — but the flat text parser reduces all of this to a single string. This means:\n\n- **No file tracking**: Can't answer \"what files did I edit this week?\"\n- **No error analysis**: Can't find sessions where builds failed or tests broke\n- **No cost visibility**: No token/cost tracking across sessions or projects\n- **No git correlation**: Can't link sessions to commits, branches, or PRs\n- **No cross-agent view**: Different agents (Claude, Cline, Aider) can't share a unified memory\n- **No security layer**: Secrets in sessions get shared without redaction\n\nThis roadmap addresses all of these gaps across 5 phases.\n\n## Sub-Issues\n\n- #5 **[DONE]** Enriched Claude Code Parser — Structured block extraction, 13 block types, 6 sidecar tables\n- #6 Cline + Aider Agent Parsers — New agent support for unified cross-tool memory\n- #7 Auto-Ingestion Watch Daemon — `smriti watch` with fs.watch for real-time ingestion\n- #8 Enhanced Search & Analytics on Structured Data — Query sidecar tables, activity timelines, cost tracking\n- #9 Secret Redaction & Policy Engine — Detect and redact secrets before storage and sharing\n- #10 Telemetry & Metrics Collection — Local-only opt-in usage metrics\n- #11 Real User Testing & Performance Validation — Benchmarks, stress tests, security tests\n\n## Phase Overview\n\n| Phase | Deliverable | Status |\n|-------|------------|--------|\n| **Phase 1** | Enriched Claude Code Parser (#5) | **Done** — 13 block types, 6 sidecar tables, 142 tests |\n| **Phase 2** | Cline + Aider Parsers (#6) | Planned |\n| **Phase 3** | Watch Daemon (#7) + Search & Analytics (#8) | Planned |\n| **Phase 4** | Secret Redaction & Policy (#9) | Planned |\n| **Phase 5** | Telemetry (#10) + Testing & Perf (#11) | Planned |\n\n## Storage Inventory\n\nComplete map of every data type, where it lives, and whether it's indexed:\n\n| Data | Source | Table | Key Columns | Indexed? |\n|------|--------|-------|-------------|----------|\n| Session text (FTS) | All agents | `memory_fts` (QMD) | content | FTS5 full-text |\n| Session metadata | Ingestion | `smriti_session_meta` | session_id, agent_id, project_id | Yes (agent, project) |\n| Project registry | Path derivation | `smriti_projects` | id, path, description | PK |\n| Agent registry | Seed data | `smriti_agents` | id, parser, log_pattern | PK |\n| Tool usage | Block extraction | `smriti_tool_usage` | message_id, tool_name, success, duration_ms | Yes (session, tool_name) |\n| File operations | Block extraction | `smriti_file_operations` | message_id, operation, file_path, project_id | Yes (session, path) |\n| Commands | Block extraction | `smriti_commands` | message_id, command, exit_code, is_git | Yes (session, is_git) |\n| Git operations | Block extraction | `smriti_git_operations` | message_id, operation, branch, pr_url | Yes (session, operation) |\n| Errors | Block extraction | `smriti_errors` | message_id, error_type, message | Yes (session, type) |\n| Token costs | Metadata accumulation | `smriti_session_costs` | session_id, model, input/output/cache tokens, cost | PK |\n| Category tags (session) | Categorization | `smriti_session_tags` | session_id, category_id, confidence, source | Yes (category) |\n| Category tags (message) | Categorization | `smriti_message_tags` | message_id, category_id, confidence, source | Yes (category) |\n| Category taxonomy | Seed data | `smriti_categories` | id, name, parent_id | PK |\n| Share tracking | Team sharing | `smriti_shares` | session_id, content_hash, author | Yes (hash) |\n| Vector embeddings | `smriti embed` | `content_vectors` + `vectors_vec` (QMD) | content_hash, embedding | Virtual table |\n| Telemetry events | Opt-in collection | `~/.smriti/telemetry.json` | timestamp, event, data | N/A (JSONL file) |\n| Structured blocks | Block extraction | `memory_messages.metadata.blocks` (JSON) | MessageBlock[] | No (JSON blob) |\n| Message metadata | Parsing | `memory_messages.metadata` (JSON) | cwd, gitBranch, model, tokenUsage | No (JSON blob) |\n\n## Block Type Reference\n\nThe 13 `MessageBlock` types extracted during ingestion:\n\n| Block Type | Fields | Stored In |\n|-----------|--------|-----------|\n| `text` | text | FTS (via plainText) |\n| `thinking` | thinking, budgetTokens | JSON blob only |\n| `tool_call` | toolId, toolName, input | `smriti_tool_usage` |\n| `tool_result` | toolId, success, output, error, durationMs | Updates tool_usage success |\n| `file_op` | operation, path, diff, pattern | `smriti_file_operations` |\n| `command` | command, cwd, exitCode, stdout, stderr, isGit | `smriti_commands` |\n| `search` | searchType, pattern, path, url, resultCount | JSON blob only |\n| `git` | operation, branch, message, files, prUrl, prNumber | `smriti_git_operations` |\n| `error` | errorType, message, retryable | `smriti_errors` |\n| `image` | mediaType, path, dataHash | JSON blob only |\n| `code` | language, code, filePath, lineStart | JSON blob only |\n| `system_event` | eventType, data | Cost accumulation |\n| `control` | controlType, command | JSON blob only |\n\n## Real User Testing Plan\n\n| Scenario | What to Measure | Risk if Untested |\n|----------|----------------|-----------------|\n| Fresh install + first ingest | Time-to-first-search, error quality | Bad first impression, confusing errors |\n| 500+ sessions accumulated | Search latency, DB file size, `smriti status` accuracy | Performance cliff after months of use |\n| Multi-project workspace | Project ID derivation accuracy, cross-project search | Wrong project attribution for sessions |\n| Team sharing (2+ devs) | Sync conflicts, dedup accuracy, content hash stability | Duplicate or lost knowledge articles |\n| Long-running session (4+ hrs) | Memory during ingest, block count accuracy, cost tracking | OOM or missed data at end of session |\n| Rapid session creation | Watch daemon debouncing, no duplicate ingestion | Double-counting sessions |\n| Agent switch mid-task | Cross-agent file tracking, unified timeline | Gaps in activity log |\n| Secret in session | Detection rate, redaction completeness, share blocking | Leaked credentials in `.smriti/` |\n| Large JSONL file (50MB+) | Parse time, memory usage, incremental ingest | Crash or multi-minute ingest |\n| Corrupt/truncated files | Error messages, graceful skip, no data loss | Silent data corruption |\n\n## Configuration Reference\n\n| Env Var | Default | Phase | Description |\n|---------|---------|-------|-------------|\n| `QMD_DB_PATH` | `~/.cache/qmd/index.sqlite` | — | Database path |\n| `CLAUDE_LOGS_DIR` | `~/.claude/projects` | 1 | Claude Code logs |\n| `CODEX_LOGS_DIR` | `~/.codex` | — | Codex CLI logs |\n| `SMRITI_PROJECTS_ROOT` | `~/zero8.dev` | 1 | Projects root for ID derivation |\n| `OLLAMA_HOST` | `http://127.0.0.1:11434` | — | Ollama endpoint |\n| `QMD_MEMORY_MODEL` | `qwen3:8b-tuned` | — | Ollama model for synthesis |\n| `SMRITI_CLASSIFY_THRESHOLD` | `0.5` | — | LLM classification trigger |\n| `SMRITI_AUTHOR` | `$USER` | — | Git author for team sharing |\n| `SMRITI_WATCH_DEBOUNCE_MS` | `2000` | 3 | Watch daemon debounce interval |\n| `SMRITI_TELEMETRY` | `0` | 5 | Enable telemetry collection |\n\n## Current State\n\nPhase 1 is complete:\n- 13 structured block types defined in `src/ingest/types.ts`\n- Block extraction engine in `src/ingest/blocks.ts`\n- Enriched Claude parser in `src/ingest/claude.ts`\n- 6 sidecar tables in `src/db.ts` with indexes and insert helpers\n- 142 tests passing, 415 expect() calls across 9 test files","comments":[],"createdAt":"2026-02-11T10:22:11Z","labels":[{"id":"LA_kwDORM6Bzs8AAAACXowH-Q","name":"enhancement","description":"New feature or request","color":"a2eeef"},{"id":"LA_kwDORM6Bzs8AAAACXwf3mg","name":"epic","description":"Epic / parent issue","color":"B60205"}],"number":12,"state":"OPEN","title":"Structured Memory Pipeline — Full Roadmap","updatedAt":"2026-02-11T10:22:11Z"},{"author":{"id":"MDQ6VXNlcjc5MjY2NjE=","is_bot":false,"login":"ashu17706","name":"Ashutosh Tripathi"},"body":"## What\nA comprehensive testing and benchmarking plan that validates Smriti against real-world usage scenarios: large databases, concurrent access, cross-agent queries, and performance under load.\n\n## Why\nUnit tests verify correctness in isolation, but real usage involves hundreds of sessions, thousands of messages, multiple agents writing simultaneously, and databases that grow over months. We need to validate performance doesn't degrade and structured data stays consistent at scale.\n\n## Tasks\n\n### Correctness Testing\n- [ ] **Round-trip fidelity**: ingest → search → recall → share produces accurate, complete results\n- [ ] **Cross-agent dedup**: same session referenced by multiple agents doesn't create duplicates\n- [ ] **Sidecar consistency**: every tool_call block has a matching \\`smriti_tool_usage\\` row\n- [ ] **Category integrity**: hierarchical categories maintain parent-child relationships after bulk operations\n- [ ] **Share/sync round-trip**: \\`smriti share\\` → \\`smriti sync\\` on another machine restores all metadata\n\n### Performance Benchmarks\n- [ ] **Ingestion throughput**: time to ingest 100/500/1000 sessions\n- [ ] **Search latency**: FTS query time at 1k/10k/50k messages (target: < 50ms at 10k)\n- [ ] **Vector search latency**: embedding search at 1k/10k vectors (target: < 200ms at 10k)\n- [ ] **Sidecar query speed**: analytics queries on sidecar tables at scale\n- [ ] **Database size**: measure SQLite file size at 1k/10k/50k messages\n- [ ] **Memory usage**: peak RSS during ingestion of large sessions (target: < 256MB)\n- [ ] **Watch daemon overhead**: CPU/memory when idle vs during active session\n\n### Stress Testing\n- [ ] **Large session files**: JSONL files > 50MB (long coding sessions)\n- [ ] **Many small sessions**: 1000+ sessions with < 10 messages each\n- [ ] **Concurrent ingestion**: two agents writing to DB simultaneously\n- [ ] **Corrupt data handling**: malformed JSONL, truncated files, missing fields\n- [ ] **Disk space**: behavior when SQLite DB approaches filesystem limits\n\n### Security Testing\n- [ ] **Secret detection coverage**: test against curated list of real secret patterns\n- [ ] **Redaction completeness**: no secrets survive ingestion → search → share pipeline\n- [ ] **Path traversal**: crafted file paths in tool calls don't escape expected directories\n- [ ] **SQL injection**: category names, project IDs with special characters\n\n## Files\n- \\`test/benchmark.test.ts\\` — **new** Performance benchmarks\n- \\`test/stress.test.ts\\` — **new** Stress and edge case tests\n- \\`test/security.test.ts\\` — **new** Security validation tests\n- \\`test/e2e.test.ts\\` — **new** End-to-end round-trip tests\n- \\`test/fixtures/large/\\` — **new** Large synthetic test data\n- \\`scripts/generate-fixtures.ts\\` — **new** Test data generator\n\n## Acceptance Criteria\n- [ ] All correctness tests pass on a clean install\n- [ ] Ingestion throughput: ≥ 50 sessions/second\n- [ ] FTS search: < 50ms at 10k messages\n- [ ] Vector search: < 200ms at 10k vectors\n- [ ] No memory leaks during 1-hour watch daemon run\n- [ ] Zero secrets survive the full pipeline in security tests\n- [ ] Corrupt/malformed input produces clear error messages, never crashes\n\n## Real User Testing Plan\n\n| Scenario | What to Measure | Risk if Untested |\n|----------|----------------|-----------------|\n| Fresh install + first ingest | Time-to-first-search, error messages | Bad first impression |\n| 500+ sessions accumulated | Search latency, DB size, \\`smriti status\\` accuracy | Performance cliff |\n| Multi-project workspace | Project ID derivation accuracy, cross-project search | Wrong project attribution |\n| Team sharing (2+ developers) | Sync conflicts, dedup accuracy, content hash stability | Duplicate/lost knowledge |\n| Long-running session (4+ hours) | Memory during ingest, block count accuracy, cost tracking | OOM or missed data |\n| Rapid session creation | Watch daemon debouncing, no duplicate ingestion | Double-counting |\n| Agent switch mid-task | Cross-agent file operation tracking, timeline accuracy | Gaps in activity log |\n\n## Testing\n```bash\nbun test test/benchmark.test.ts # Performance benchmarks\nbun test test/stress.test.ts # Stress tests\nbun test test/security.test.ts # Security validation\nbun test test/e2e.test.ts # End-to-end round-trips\nbun run scripts/generate-fixtures.ts # Generate large test data\n```","comments":[],"createdAt":"2026-02-11T10:21:18Z","labels":[{"id":"LA_kwDORM6Bzs8AAAACXowH-Q","name":"enhancement","description":"New feature or request","color":"a2eeef"},{"id":"LA_kwDORM6Bzs8AAAACXwf2xw","name":"phase-5","description":"Phase 5: Telemetry & validation","color":"5319E7"}],"number":11,"state":"OPEN","title":"Real User Testing & Performance Validation","updatedAt":"2026-02-11T10:21:18Z"},{"author":{"id":"MDQ6VXNlcjc5MjY2NjE=","is_bot":false,"login":"ashu17706","name":"Ashutosh Tripathi"},"body":"## What\nOpt-in local telemetry that collects usage metrics to \\`~/.smriti/telemetry.json\\` — session counts, tool frequencies, search patterns, ingestion performance, and error rates. No network calls, fully local.\n\n## Why\nWithout telemetry, we're flying blind on how Smriti is actually used: which commands are popular, how large databases get, whether search is fast enough, and what errors users hit. Local-only collection respects privacy while enabling data-driven improvements.\n\n## Tasks\n- [ ] **Telemetry store**: append-only \\`~/.smriti/telemetry.json\\` (JSONL format)\n- [ ] **Automatic collection** (opt-in via \\`SMRITI_TELEMETRY=1\\` or \\`smriti telemetry --enable\\`):\n - Command invocations: which CLI commands are run, how often\n - Ingestion metrics: sessions ingested, messages processed, duration, errors\n - Search metrics: query count, result count, latency, filter usage\n - Database size: total sessions, messages, sidecar table row counts\n - Embedding metrics: vectors built, search latency\n- [ ] **\\`smriti telemetry\\`** command:\n - \\`smriti telemetry --enable\\` / \\`--disable\\` to toggle collection\n - \\`smriti telemetry --show\\` to view collected metrics\n - \\`smriti telemetry --clear\\` to delete collected data\n - \\`smriti telemetry --export\\` to dump as JSON for analysis\n- [ ] **Event structure**: \\`{ timestamp, event, data, version }\\`\n- [ ] **Rotation**: auto-rotate when file exceeds 10MB\n- [ ] **Privacy**: never collect message content, file paths, or search queries — only counts and durations\n- [ ] **Performance**: telemetry writes must not impact CLI latency (async append)\n\n## Files\n- \\`src/telemetry/collector.ts\\` — **new** Event collection and storage\n- \\`src/telemetry/events.ts\\` — **new** Event type definitions\n- \\`src/telemetry/report.ts\\` — **new** Telemetry reporting/export\n- \\`src/index.ts\\` — Add \\`telemetry\\` command, instrument existing commands\n- \\`src/config.ts\\` — Add \\`SMRITI_TELEMETRY\\` config\n- \\`test/telemetry.test.ts\\` — **new** Telemetry collection tests\n\n## Data We Collect\n\n| Metric | Example Value | Purpose |\n|--------|--------------|---------|\n| \\`command_invoked\\` | \\`{ command: \"search\", flags: [\"--agent\"] }\\` | Command popularity |\n| \\`ingest_completed\\` | \\`{ agent: \"claude-code\", sessions: 5, messages: 120, durationMs: 340 }\\` | Ingestion performance |\n| \\`search_executed\\` | \\`{ resultCount: 8, latencyMs: 12, hasFilters: true }\\` | Search performance |\n| \\`db_stats\\` | \\`{ sessions: 200, messages: 15000, toolUsage: 8500 }\\` | Database growth |\n| \\`error_occurred\\` | \\`{ command: \"ingest\", errorType: \"parse_error\" }\\` | Error tracking |\n| \\`embed_completed\\` | \\`{ vectors: 500, latencyMs: 2100 }\\` | Embedding performance |\n\n## Acceptance Criteria\n- [ ] Telemetry is off by default — requires explicit opt-in\n- [ ] \\`smriti telemetry --enable\\` starts collecting, \\`--disable\\` stops\n- [ ] \\`smriti telemetry --show\\` displays human-readable summary\n- [ ] No message content, file paths, or search queries are ever recorded\n- [ ] Telemetry writes don't add > 1ms to CLI command latency\n- [ ] File auto-rotates at 10MB\n- [ ] \\`smriti telemetry --clear\\` completely removes all collected data\n\n## Testing\n```bash\nbun test test/telemetry.test.ts # Collection + rotation tests\nSMRITI_TELEMETRY=1 smriti ingest claude # Verify metrics recorded\nsmriti telemetry --show # View collected data\nsmriti telemetry --clear # Verify deletion\n```","comments":[],"createdAt":"2026-02-11T10:21:13Z","labels":[{"id":"LA_kwDORM6Bzs8AAAACXowH-Q","name":"enhancement","description":"New feature or request","color":"a2eeef"},{"id":"LA_kwDORM6Bzs8AAAACXwf2xw","name":"phase-5","description":"Phase 5: Telemetry & validation","color":"5319E7"}],"number":10,"state":"OPEN","title":"Telemetry & Metrics Collection","updatedAt":"2026-02-11T10:21:13Z"},{"author":{"id":"MDQ6VXNlcjc5MjY2NjE=","is_bot":false,"login":"ashu17706","name":"Ashutosh Tripathi"},"body":"## What\nA configurable policy engine that detects and redacts secrets, PII, and sensitive data during ingestion and before team sharing, with configurable rules and audit logging.\n\n## Why\nAI coding sessions routinely contain API keys, database passwords, auth tokens, and internal URLs — either typed by the user or surfaced in tool outputs. Without redaction, \\`smriti share\\` could leak secrets into git-committed \\`.smriti/\\` knowledge files, and even local search results could expose credentials.\n\n## Tasks\n- [ ] **Built-in secret patterns**: AWS keys, GitHub tokens, JWT, API keys, private keys, database URLs, .env values\n- [ ] **PII detection**: email addresses, IP addresses, phone numbers (configurable)\n- [ ] **Redaction during ingestion**: scan \\`plainText\\` and block content before storage\n- [ ] **Redaction during sharing**: additional pass before \\`smriti share\\` writes to \\`.smriti/\\`\n- [ ] **Policy configuration**: \\`.smriti/policy.json\\` or env vars to customize rules\n - Enable/disable specific pattern categories\n - Add custom regex patterns\n - Allowlist specific values (e.g., public test keys)\n- [ ] **Audit log**: record what was redacted, when, in which session (without storing the secret)\n- [ ] **\\`smriti scan\\`** command: dry-run that reports potential secrets without redacting\n- [ ] **Pre-commit hook support**: \\`smriti scan --check .smriti/\\` for CI pipelines\n- [ ] **Redaction format**: \\`[REDACTED:aws-key]\\`, \\`[REDACTED:github-token]\\` — preserves context while removing value\n\n## Files\n- \\`src/policy/patterns.ts\\` — **new** Built-in secret detection patterns\n- \\`src/policy/redactor.ts\\` — **new** Redaction engine\n- \\`src/policy/config.ts\\` — **new** Policy configuration loader\n- \\`src/policy/audit.ts\\` — **new** Audit log writer\n- \\`src/ingest/claude.ts\\` — Hook redactor into ingestion pipeline\n- \\`src/team/share.ts\\` — Hook redactor into share pipeline\n- \\`src/index.ts\\` — Add \\`scan\\` command\n- \\`test/redactor.test.ts\\` — **new** Redaction tests\n- \\`test/fixtures/secrets/\\` — **new** Test fixtures with fake secrets\n\n## Acceptance Criteria\n- [ ] AWS access keys (\\`AKIA...\\`) are redacted to \\`[REDACTED:aws-key]\\` during ingestion\n- [ ] GitHub tokens (\\`ghp_\\`, \\`gho_\\`, \\`github_pat_\\`) are detected and redacted\n- [ ] \\`smriti scan\\` reports potential secrets without modifying data\n- [ ] Custom patterns in \\`.smriti/policy.json\\` are applied alongside built-ins\n- [ ] Redacted content is still searchable by surrounding context (not the secret itself)\n- [ ] Audit log records redaction events with session ID, pattern name, and timestamp\n- [ ] Zero false positives on common code patterns (hex colors, UUIDs, base64 test data)\n- [ ] \\`smriti share\\` refuses to export if unredacted secrets are detected (unless \\`--force\\`)\n\n## Testing\n```bash\nbun test test/redactor.test.ts # Pattern matching + redaction tests\nsmriti scan # Dry-run secret detection\nsmriti ingest claude # Verify redaction during ingestion\nsmriti share --project smriti # Verify redaction before export\n```","comments":[],"createdAt":"2026-02-11T10:21:03Z","labels":[{"id":"LA_kwDORM6Bzs8AAAACXowH-Q","name":"enhancement","description":"New feature or request","color":"a2eeef"},{"id":"LA_kwDORM6Bzs8AAAACXwf2WQ","name":"phase-4","description":"Phase 4: Security & policy","color":"FBCA04"}],"number":9,"state":"OPEN","title":"Secret Redaction & Policy Engine","updatedAt":"2026-02-11T10:21:03Z"},{"author":{"id":"MDQ6VXNlcjc5MjY2NjE=","is_bot":false,"login":"ashu17706","name":"Ashutosh Tripathi"},"body":"## What\nQuery APIs and CLI commands that leverage the sidecar tables (tool usage, file operations, commands, git operations, errors, costs) for analytics, filtering, and intelligent recall.\n\n## Why\nThe sidecar tables from Phase 1 store rich structured data but there's no way to query them yet. Developers should be able to ask \"what files did I edit today?\", \"show me all failed commands in project X\", or \"which sessions cost the most tokens\".\n\n## Tasks\n- [ ] **File activity queries**: \"what files were touched in session X\" / \"most-edited files this week\"\n- [ ] **Tool usage analytics**: tool frequency, success rates, average duration per tool\n- [ ] **Error analysis**: error type distribution, most common errors, sessions with highest error rate\n- [ ] **Git activity**: commits per session, PR creation timeline, branch activity\n- [ ] **Cost tracking**: token usage per session/project/day, cost trends, cache hit rates\n- [ ] **Search filters**: extend \\`smriti search\\` with \\`--tool\\`, \\`--file\\`, \\`--error-type\\`, \\`--git-op\\` flags\n- [ ] **\\`smriti stats\\`** command overhaul: show sidecar table summaries alongside existing stats\n- [ ] **\\`smriti activity\\`** command: timeline of file operations + commands for a session\n- [ ] **Recall enrichment**: include sidecar data in recall context (e.g., \"this session edited 5 files and ran 12 commands\")\n- [ ] JSON output for all analytics queries (\\`--format json\\`)\n\n## Files\n- \\`src/search/index.ts\\` — Add sidecar-aware search filters\n- \\`src/search/analytics.ts\\` — **new** Analytics query functions\n- \\`src/search/recall.ts\\` — Enrich recall with sidecar context\n- \\`src/index.ts\\` — Add \\`stats\\`, \\`activity\\` CLI commands\n- \\`src/format.ts\\` — Format analytics output (table, JSON, CSV)\n- \\`test/analytics.test.ts\\` — **new** Analytics query tests\n\n## Acceptance Criteria\n- [ ] \\`smriti search \"auth\" --tool Bash\\` returns only sessions where Bash tool was used\n- [ ] \\`smriti search \"auth\" --file \"src/auth.ts\"\\` returns sessions that touched that file\n- [ ] \\`smriti stats\\` shows tool usage, error rates, and cost summaries\n- [ ] \\`smriti activity \\` shows chronological timeline of operations\n- [ ] \\`smriti recall \"query\" --synthesize\\` includes sidecar context in synthesis\n- [ ] All analytics queries return results in < 100ms for databases with 10k+ messages\n\n## Testing\n```bash\nbun test test/analytics.test.ts # Analytics query tests\nsmriti stats # Overview with sidecar data\nsmriti activity # Session activity timeline\nsmriti search \"fix bug\" --tool Bash --format json\n```","comments":[],"createdAt":"2026-02-11T10:17:44Z","labels":[{"id":"LA_kwDORM6Bzs8AAAACXowH-Q","name":"enhancement","description":"New feature or request","color":"a2eeef"},{"id":"LA_kwDORM6Bzs8AAAACXwf2Ag","name":"phase-3","description":"Phase 3: Auto-ingestion & search","color":"D93F0B"}],"number":8,"state":"OPEN","title":"Enhanced Search & Analytics on Structured Data","updatedAt":"2026-02-11T10:17:44Z"},{"author":{"id":"MDQ6VXNlcjc5MjY2NjE=","is_bot":false,"login":"ashu17706","name":"Ashutosh Tripathi"},"body":"## What\nA \\`smriti watch\\` command that monitors agent log directories via \\`fs.watch()\\` and auto-ingests new/changed sessions in real-time.\n\n## Why\nCurrently ingestion is manual (\\`smriti ingest claude\\`). Developers forget to run it, or run it too late after context is cold. Auto-ingestion means Smriti always has the latest session data available for search and recall.\n\n## Tasks\n- [ ] Implement \\`smriti watch\\` CLI command with graceful start/stop\n- [ ] Use \\`fs.watch()\\` (or Bun's equivalent) to monitor \\`~/.claude/projects/\\` and other agent log dirs\n- [ ] Debounce file change events (JSONL files get appended to frequently during active sessions)\n- [ ] Incremental ingestion: track file size/mtime, only re-parse appended content\n- [ ] Handle session file rotation (new session creates new file)\n- [ ] PID file at \\`~/.smriti/watch.pid\\` for single-instance enforcement\n- [ ] \\`smriti watch --daemon\\` for background mode (detached process)\n- [ ] \\`smriti watch --stop\\` to kill running daemon\n- [ ] \\`smriti watch --status\\` to check if daemon is running\n- [ ] Optional auto-embed: trigger embedding generation after ingestion\n- [ ] Optional auto-categorize: trigger categorization after ingestion\n- [ ] Configurable debounce interval via \\`SMRITI_WATCH_DEBOUNCE_MS\\` (default: 2000)\n\n## Files\n- \\`src/watch.ts\\` — **new** Watch daemon implementation\n- \\`src/index.ts\\` — Add \\`watch\\` command to CLI\n- \\`src/config.ts\\` — Add watch-related config vars\n- \\`test/watch.test.ts\\` — **new** Watch daemon tests (using temp directories)\n\n## Acceptance Criteria\n- [ ] \\`smriti watch\\` starts monitoring and logs ingestion events\n- [ ] New Claude sessions appear in \\`smriti search\\` within seconds of creation\n- [ ] Appending to existing session files triggers incremental re-ingestion\n- [ ] Only one watch daemon runs at a time (PID file enforcement)\n- [ ] \\`smriti watch --stop\\` cleanly terminates the daemon\n- [ ] CPU usage stays below 1% when idle (no busy polling)\n- [ ] Handles agent log directory not existing (waits for creation)\n\n## Testing\n```bash\nbun test test/watch.test.ts # Unit tests with temp dirs\nsmriti watch # Manual: start watching\n# In another terminal, use Claude Code — sessions should auto-ingest\nsmriti watch --status # Check daemon status\nsmriti watch --stop # Stop cleanly\n```","comments":[],"createdAt":"2026-02-11T10:17:19Z","labels":[{"id":"LA_kwDORM6Bzs8AAAACXowH-Q","name":"enhancement","description":"New feature or request","color":"a2eeef"},{"id":"LA_kwDORM6Bzs8AAAACXwf2Ag","name":"phase-3","description":"Phase 3: Auto-ingestion & search","color":"D93F0B"}],"number":7,"state":"OPEN","title":"Auto-Ingestion Watch Daemon","updatedAt":"2026-02-11T10:17:19Z"},{"author":{"id":"MDQ6VXNlcjc5MjY2NjE=","is_bot":false,"login":"ashu17706","name":"Ashutosh Tripathi"},"body":"## What\nAdd ingestion parsers for Cline (VS Code extension) and Aider (terminal-based coding agent) conversation logs, producing the same `StructuredMessage` format as the Claude parser.\n\n## Why\nTeams using multiple AI agents lose cross-tool visibility. A developer might debug with Aider, implement with Claude Code, and review with Cline — all touching the same files. Without unified ingestion, Smriti only captures one agent's perspective.\n\n## Tasks\n- [ ] Research Cline log format (VS Code extension storage, `.cline/` or workspace-level)\n- [ ] Implement `parseClineSession()` → `StructuredMessage[]`\n- [ ] Map Cline tool calls to `MessageBlock` types (file edits, terminal commands, browser actions)\n- [ ] Research Aider log format (`.aider.chat.history.md`, `.aider.input.history`)\n- [ ] Implement `parseAiderSession()` → `StructuredMessage[]`\n- [ ] Extract Aider-specific data: `/commands`, edit format (diff/whole/architect), lint results\n- [ ] Add `cline` and `aider` to `smriti_agents` seed data\n- [ ] Session discovery for both agents (`discoverClineSessions()`, `discoverAiderSessions()`)\n- [ ] Register parsers in `src/ingest/index.ts` orchestrator\n- [ ] Test with real session files from both agents\n\n## Files\n- `src/ingest/cline.ts` — **new** Cline parser\n- `src/ingest/aider.ts` — **new** Aider parser\n- `src/ingest/index.ts` — Register new agents in ingest orchestrator\n- `src/db.ts` — Add `cline`/`aider` to `DEFAULT_AGENTS`\n- `test/cline.test.ts` — **new** Cline parser tests\n- `test/aider.test.ts` — **new** Aider parser tests\n- `test/fixtures/cline/` — **new** Sample Cline session files\n- `test/fixtures/aider/` — **new** Sample Aider session files\n\n## Acceptance Criteria\n- [ ] `smriti ingest cline` ingests Cline sessions with structured blocks\n- [ ] `smriti ingest aider` ingests Aider sessions with structured blocks\n- [ ] `smriti ingest all` includes both new agents\n- [ ] File operations, commands, and errors populate sidecar tables\n- [ ] Cross-agent search returns results from all three agents\n- [ ] No regressions in existing Claude parser tests\n\n## Testing\n```bash\nbun test test/cline.test.ts # Cline parser unit tests\nbun test test/aider.test.ts # Aider parser unit tests\nbun test # Full suite — no regressions\nsmriti ingest all # Real ingestion of all agents\nsmriti search \"fix auth\" --agent cline # Cross-agent search\n```","comments":[],"createdAt":"2026-02-11T10:17:14Z","labels":[{"id":"LA_kwDORM6Bzs8AAAACXowH-Q","name":"enhancement","description":"New feature or request","color":"a2eeef"},{"id":"LA_kwDORM6Bzs8AAAACXwf1zw","name":"phase-2","description":"Phase 2: New agent parsers","color":"1D76DB"}],"number":6,"state":"OPEN","title":"Cline + Aider Agent Parsers","updatedAt":"2026-02-11T10:17:14Z"},{"author":{"id":"MDQ6VXNlcjc5MjY2NjE=","is_bot":false,"login":"ashu17706","name":"Ashutosh Tripathi"},"body":"## What\nStructured block extraction from Claude Code JSONL transcripts — every tool call, file operation, git command, error, and thinking block is parsed into typed `MessageBlock` objects and stored in queryable sidecar tables.\n\n## Why\nPreviously Smriti ingested sessions as flat text, losing 80%+ of structured data: which files were edited, what commands ran, token costs, git operations, and error patterns. This phase makes that data queryable.\n\n## Tasks\n- [x] Define `StructuredMessage` and `MessageBlock` union type with 13 block types (`src/ingest/types.ts`)\n- [x] Implement block extraction from raw Claude API content blocks (`src/ingest/blocks.ts`)\n- [x] Git command detection and parsing (commit messages, branches, PR creation)\n- [x] `gh pr create` detection via `parseGhPrCommand()`\n- [x] Storage limits and truncation for all block types\n- [x] `flattenBlocksToText()` for backward-compatible FTS indexing\n- [x] System event parsing (turn_duration, pr-link, file-history-snapshot)\n- [x] Enriched `parseClaudeJsonlStructured()` parser alongside legacy `parseClaudeJsonl()`\n- [x] Sidecar table schema: `smriti_tool_usage`, `smriti_file_operations`, `smriti_commands`, `smriti_errors`, `smriti_git_operations`, `smriti_session_costs`\n- [x] Sidecar table population during ingestion pipeline\n- [x] Token/cost accumulation via `upsertSessionCosts()`\n- [x] Full test coverage for block extraction, git parsing, structured parsing, and sidecar inserts\n\n## Files\n- `src/ingest/types.ts` — `StructuredMessage`, `MessageBlock` union, `MessageMetadata`, storage limits\n- `src/ingest/blocks.ts` — `extractBlocks()`, `toolCallToBlocks()`, `parseGitCommand()`, `flattenBlocksToText()`\n- `src/ingest/claude.ts` — `parseClaudeJsonlStructured()`, enriched `ingestClaude()` with sidecar population\n- `src/ingest/index.ts` — Updated orchestrator types\n- `src/db.ts` — 6 new sidecar tables + indexes + insert helpers\n- `test/blocks.test.ts` — Block extraction tests\n- `test/structured-ingest.test.ts` — End-to-end structured parsing tests\n- `test/team.test.ts` — Updated for new schema\n\n## Acceptance Criteria\n- [x] All 13 block types extracted from real Claude JSONL transcripts\n- [x] Git commands parsed into structured `GitBlock` with operation, branch, message\n- [x] Tool calls decomposed into both generic `ToolCallBlock` + domain-specific blocks\n- [x] Sidecar tables populated atomically during ingestion\n- [x] Legacy `parseClaudeJsonl()` still works unchanged\n- [x] 142 tests passing, 415 expect() calls\n\n## Testing\n```bash\nbun test # All 142 tests pass\nbun test test/blocks.test.ts # Block extraction unit tests\nbun test test/structured-ingest.test.ts # Structured parsing integration\n```","comments":[],"createdAt":"2026-02-11T10:16:02Z","labels":[{"id":"LA_kwDORM6Bzs8AAAACXowH-Q","name":"enhancement","description":"New feature or request","color":"a2eeef"},{"id":"LA_kwDORM6Bzs8AAAACXwf1eQ","name":"phase-1","description":"Phase 1: Enriched ingestion","color":"0E8A16"},{"id":"LA_kwDORM6Bzs8AAAACXwf3Ng","name":"done","description":"Completed work","color":"0E8A16"}],"number":5,"state":"OPEN","title":"[DONE] Enriched Claude Code Parser","updatedAt":"2026-02-11T10:16:02Z"},{"author":{"id":"MDQ6VXNlcjc5MjY2NjE=","is_bot":false,"login":"ashu17706","name":"Ashutosh Tripathi"},"body":"Ideas to explore:\n\n1. **Searchable auto-generated documentation** — Use ingested sessions to auto-generate searchable project documentation from the knowledge base.\n\n2. **Onboarding-driven prompt generation** — During onboarding, talk to the user to understand their team's ethos and coding philosophy, then auto-generate category-specific prompts that reflect those values.\n\n3. **Further token cost optimization** — Explore more aggressive deduplication, smarter context selection, and compression strategies to push token savings even further.\n\n4. **Open exploration** — What else can a persistent, searchable AI memory layer enable? Plugin system? IDE integrations beyond Claude Code? Cross-team knowledge graphs?\n\n---\n\n> I have to stop building anything on this and start reaching out to devs to try this out. Happy coding. Happy vibe coding, let ideas flow. See ya!","comments":[],"createdAt":"2026-02-10T18:40:13Z","labels":[],"number":4,"state":"OPEN","title":"Future ideas & possibilities","updatedAt":"2026-02-10T18:40:13Z"},{"author":{"id":"MDQ6VXNlcjc5MjY2NjE=","is_bot":false,"login":"ashu17706","name":"Ashutosh Tripathi"},"body":"## The question\n\nWhen smriti shares a session about a bug fix, should the resulting article look the same as one about an architecture decision? Or a code pattern?\n\nRight now, every session — regardless of category — goes through the same reflection prompt and produces the same 5-section structure. That works, but it means a bug investigation article emphasizes the same things as a design tradeoff article. They probably shouldn't.\n\n## What exists today\n\nThe `share --reflect` pipeline works like this:\n\n1. Sessions are categorized into one of 7 top-level categories (with 21 subcategories): `bug`, `code`, `architecture`, `decision`, `feature`, `project`, `topic`\n2. When sharing, **all categories** go through the same prompt template: `src/team/prompts/share-reflect.md`\n3. That prompt produces 5 fixed sections: **Summary**, **Changes**, **Decisions**, **Insights**, **Context**\n4. Projects can override the prompt by placing a custom `share-reflect.md` at `.smriti/prompts/share-reflect.md` — but that's a single override for the whole project, not per-category\n\nThe prompt loading in `reflect.ts` is straightforward — `loadPromptTemplate()` checks for a project-level override, then falls back to the built-in default. There's no category awareness in the resolution path.\n\n## The idea\n\nWhat if prompt templates were resolved per-category? Something like:\n\n```\n.smriti/prompts/\n├── share-reflect.md # default fallback (exists today)\n├── bug/\n│ └── share-reflect.md # bug-specific template\n├── architecture/\n│ └── share-reflect.md # architecture-specific template\n└── code/\n └── share-reflect.md # code-specific template\n```\n\nThe resolution order would be:\n\n1. `.smriti/prompts/{category}/share-reflect.md` — project + category override\n2. Built-in category default (shipped with smriti)\n3. `.smriti/prompts/share-reflect.md` — project-wide override\n4. Built-in default (what exists today)\n\n## Concrete examples\n\nHere's how different categories might benefit from different section structures:\n\n**Bug fix** (`bug/fix`):\n\n```markdown\n### Summary\n### Root Cause\n### Reproduction Steps\n### Fix Applied\n### Verification\n### Related Areas\n```\n\nThe emphasis is on *what went wrong and how to prevent it*. \"Decisions\" and \"Insights\" from the generic template don't guide the LLM toward root cause analysis.\n\n**Architecture decision** (`architecture/decision`):\n\n```markdown\n### Summary\n### Problem Statement\n### Options Considered\n### Decision & Rationale\n### Tradeoffs Accepted\n### Implications\n```\n\nHere the value is in *capturing alternatives that were rejected and why*. The generic \"Decisions\" section doesn't explicitly prompt for alternatives considered.\n\n**Code pattern** (`code/pattern`):\n\n```markdown\n### Summary\n### Pattern Description\n### When to Use\n### Usage Example\n### Gotchas\n```\n\nA code pattern article should be *reference material* — something you can skim and apply. The generic template's \"Changes\" and \"Context\" sections add noise here.\n\n## Possible directions\n\nA few ways this could work — not mutually exclusive:\n\n**1. Hierarchical prompt resolution**\nExtend `loadPromptTemplate()` to accept a category ID and walk up the hierarchy: `bug/fix` → `bug` → default. This is the minimal change — mostly just path resolution logic.\n\n**2. Category-specific section structures**\nShip built-in prompt templates for each top-level category. The `parseSynthesis()` function would need to become more flexible — instead of looking for hardcoded `### Summary`, `### Changes`, etc., it would parse whatever `###` sections the template defines.\n\n**3. Category-specific sanitization**\nDifferent categories might also benefit from different content filtering. A bug session might want to preserve error messages and stack traces that the current sanitizer strips. A code pattern might want to preserve more code blocks. This is a secondary concern but worth thinking about alongside prompt templates.\n\n**4. Template inheritance / composition**\nInstead of fully separate templates, allow templates to extend a base. E.g., a bug template could say \"use the default sections, but add Root Cause after Summary and rename Changes to Fix Applied.\" This is more complex but avoids template drift.\n\n## Open questions\n\nThese are the things I'm not sure about — would love input:\n\n- **Is per-category the right granularity?** Should it be per top-level category (`bug`), per subcategory (`bug/fix` vs `bug/investigation`), or something else entirely?\n- **Should sections vary or stay fixed?** There's a simplicity argument for keeping the same 5 sections but changing the *instructions within each section* per category. Versus fully different section structures per category.\n- **How should subcategories resolve?** If `bug/fix` doesn't have a template, should it fall back to `bug`, then to default? Or is one level enough?\n- **Built-in vs user-only?** Should smriti ship opinionated per-category templates, or just provide the mechanism for users to create their own?\n- **What about the parser?** `parseSynthesis()` currently looks for 5 specific section headers. If sections vary by category, the parser needs to become dynamic. What's the right abstraction?\n\n## Current extension points\n\nFor anyone who wants to prototype this, here's where things connect:\n\n- **Prompt loading**: `src/team/reflect.ts` → `loadPromptTemplate(projectSmritiDir?)` — this is where category-aware resolution would go\n- **Prompt template**: `src/team/prompts/share-reflect.md` — the `{{conversation}}` placeholder and section structure\n- **Synthesis parsing**: `src/team/reflect.ts` → `parseSynthesis(text)` — hardcoded section headers that would need to flex\n- **Category info**: `src/categorize/schema.ts` — category IDs and hierarchy\n- **Share entry point**: `src/team/share.ts` → `shareKnowledge()` — where category is known and could be passed to `synthesizeSession()`\n- **Session tags**: `smriti_session_tags` table — maps sessions to categories with confidence scores\n\nThe minimal prototype would be: pass the session's category ID into `loadPromptTemplate()`, check for `prompts/{category}/share-reflect.md` before the default, and see if the output quality improves for a few specific categories.","comments":[],"createdAt":"2026-02-10T18:00:48Z","labels":[{"id":"LA_kwDORM6Bzs8AAAACXowH-Q","name":"enhancement","description":"New feature or request","color":"a2eeef"},{"id":"LA_kwDORM6Bzs8AAAACXrxBJA","name":"discussion","description":"Open-ended discussion or RFC","color":"c2e0c6"}],"number":3,"state":"OPEN","title":"RFC: Per-category prompt templates for knowledge representation","updatedAt":"2026-02-10T18:00:48Z"},{"author":{"id":"MDQ6VXNlcjc5MjY2NjE=","is_bot":false,"login":"ashu17706","name":"Ashutosh Tripathi"},"body":"## Problem\n\nCustom categories are per-machine only. They live in each user's local SQLite `smriti_categories` table and never travel with the repo.\n\nWhen a team defines custom categories to organize their codebase (e.g., `client/web-ui`, `infra/k8s`, `ops/incident`), every teammate has to manually recreate them. Worse — if someone shares a session tagged with a custom category, `smriti sync` writes the tag into `smriti_session_tags` but the category doesn't exist in the importing user's `smriti_categories` table. The tag becomes an orphan: it exists in the tags table but can't be filtered, listed, or validated.\n\n### Current state of `.smriti/config.json`\n\nThe file already exists — `share.ts` creates it at line 331-344:\n\n```json\n{\n \"version\": 1,\n \"allowedCategories\": [\"*\"],\n \"autoSync\": false\n}\n```\n\nBut it's **write-only**: `sync.ts` never reads it. It has no category definitions.\n\n## Proposal\n\nExtend `.smriti/config.json` to be the team's shared configuration file. It gets committed to git with the rest of `.smriti/` and is read by `smriti sync` to bootstrap the importing user's environment.\n\n### Config format\n\n```json\n{\n \"version\": 2,\n \"categories\": [\n {\n \"id\": \"client\",\n \"name\": \"Client-side\",\n \"description\": \"Frontend and client-side development\"\n },\n {\n \"id\": \"client/web-ui\",\n \"name\": \"Web UI\",\n \"parent\": \"client\"\n },\n {\n \"id\": \"client/mobile\",\n \"name\": \"Mobile\",\n \"parent\": \"client\"\n },\n {\n \"id\": \"infra\",\n \"name\": \"Infrastructure\"\n },\n {\n \"id\": \"infra/k8s\",\n \"name\": \"Kubernetes\",\n \"parent\": \"infra\"\n }\n ],\n \"allowedCategories\": [\"*\"],\n \"autoSync\": false\n}\n```\n\nOnly custom categories need to be listed — the 7 built-in top-level categories and 21 subcategories are always present (seeded in `db.ts`).\n\n## Implementation Plan\n\n### 1. Define config schema (`src/team/config.ts` — new file)\n\n```ts\ninterface SmritiConfig {\n version: number;\n categories?: CustomCategoryDef[];\n allowedCategories?: string[];\n autoSync?: boolean;\n}\n\ninterface CustomCategoryDef {\n id: string;\n name: string;\n parent?: string;\n description?: string;\n}\n```\n\nAdd functions:\n- `readConfig(projectPath: string): SmritiConfig` — reads and validates `.smriti/config.json`\n- `writeConfig(projectPath: string, config: SmritiConfig)` — writes config (used by share)\n- `mergeCategories(db: Database, categories: CustomCategoryDef[])` — idempotently ensures all listed categories exist in the local DB\n\n### 2. Update `share.ts` to export custom categories\n\nDuring `smriti share`, query `smriti_categories` for any categories **not** in the built-in `DEFAULT_CATEGORIES` list. Write them into the `categories` array in `config.json`.\n\n```ts\n// Pseudocode\nconst builtinIds = new Set(DEFAULT_CATEGORIES.flatMap(c => [c.id, ...c.children.map(ch => ch.id)]));\nconst custom = db.prepare(\n `SELECT id, name, parent_id, description FROM smriti_categories WHERE id NOT IN (${[...builtinIds].map(() => '?').join(',')})`\n).all(...builtinIds);\n\nconfig.categories = custom.map(c => ({\n id: c.id,\n name: c.name,\n parent: c.parent_id || undefined,\n description: c.description || undefined,\n}));\n```\n\nBump version to `2` when categories are present.\n\n### 3. Update `sync.ts` to import custom categories\n\nBefore importing knowledge files, read `.smriti/config.json` and call `mergeCategories()`:\n\n```ts\nconst config = readConfig(smritiDir);\nif (config.categories?.length) {\n mergeCategories(db, config.categories);\n}\n// Then proceed with existing file import...\n```\n\n`mergeCategories` should:\n- Sort categories so parents come before children (topological order)\n- For each category, call `createCategory()` if it doesn't already exist (use `INSERT OR IGNORE` semantics)\n- Skip categories that already exist with the same ID (idempotent)\n- Log newly created categories so the user sees what was added\n\n### 4. Add CLI command to manage team config\n\n```bash\n# Initialize .smriti/config.json in the current project\nsmriti config init\n\n# Add a custom category to the team config (writes to .smriti/config.json)\nsmriti config add-category --name [--parent ] [--description ]\n\n# Show current team config\nsmriti config show\n```\n\n`smriti config add-category` should both:\n- Add the category to the local SQLite DB (so it's immediately usable)\n- Append it to `.smriti/config.json` (so it travels with git)\n\nThis gives teams a single command to define a shared custom category.\n\n### 5. Backward compatibility\n\n- `version: 1` configs (no `categories` field) continue to work — sync just skips category import\n- `version: 2` configs are forward-compatible — unknown fields are ignored\n- The existing `allowedCategories` and `autoSync` fields are preserved\n\n### 6. Update classifier to include custom categories (`src/categorize/classifier.ts`)\n\nCurrently `classifyByLLM()` sends only `ALL_CATEGORY_IDS` (built-in) in its prompt. After this change:\n- Query the DB for all categories (built-in + custom)\n- Include custom category IDs in the LLM prompt so Ollama can classify into them\n- Custom categories won't have rule-based patterns (no keyword rules), so they'll rely on LLM classification or manual tagging\n\n### 7. Tests\n\n| Test | File | What it verifies |\n|------|------|-----------------|\n| Config roundtrip | `test/team.test.ts` | Write config with categories → read it back → same data |\n| Sync imports categories | `test/team.test.ts` | Sync from a `.smriti/` with custom categories → categories exist in local DB |\n| Idempotent merge | `test/team.test.ts` | Sync twice with same config → no duplicates, no errors |\n| Share exports custom cats | `test/team.test.ts` | Add custom category → share → config.json contains it |\n| Parent ordering | `test/team.test.ts` | Config with child before parent → merge still works (topological sort) |\n| Version 1 compat | `test/team.test.ts` | Sync with v1 config (no categories) → no errors |\n\n## Files to Modify\n\n| File | Change |\n|------|--------|\n| `src/team/config.ts` | **New** — Config schema, read/write/merge functions |\n| `src/team/share.ts` | Export custom categories to config.json |\n| `src/team/sync.ts` | Read config.json and import categories before syncing files |\n| `src/index.ts` | Add `smriti config` subcommand |\n| `src/categorize/classifier.ts` | Include custom categories in LLM classification prompt |\n| `test/team.test.ts` | Config roundtrip, sync, idempotency, backward compat tests |\n\n## End-to-End Example\n\n```bash\n# Alice sets up custom categories for her team\nsmriti categories add client --name \"Client-side\"\nsmriti categories add client/web-ui --name \"Web UI\" --parent client\n\n# Alice shares — custom categories are written to .smriti/config.json\nsmriti share --project myapp\n\n# Alice commits\ngit add .smriti/ && git commit -m \"Share team knowledge\"\ngit push\n\n# Bob pulls and syncs\ngit pull\nsmriti sync --project myapp\n# Output:\n# Imported 2 custom categories: client, client/web-ui\n# Imported 5 sessions from .smriti/knowledge/\n\n# Bob can now filter by the team's custom categories\nsmriti list --category client\nsmriti search \"button styling\" --category client/web-ui\n```","comments":[],"createdAt":"2026-02-10T17:46:45Z","labels":[],"number":2,"state":"OPEN","title":"Add .smriti/config.json as team-shared config with custom categories","updatedAt":"2026-02-10T17:46:45Z"},{"author":{"id":"MDQ6VXNlcjc5MjY2NjE=","is_bot":false,"login":"ashu17706","name":"Ashutosh Tripathi"},"body":"## Problem\n\nWhen sessions are shared via `smriti share`, **all** category tags are serialized into the YAML frontmatter — the primary category as a scalar `category` field and all tags (including secondary ones) as a `tags` array:\n\n```yaml\n---\ncategory: project\ntags: [\"project\", \"project/dependency\", \"decision/tooling\"]\n---\n```\n\nHowever, when a teammate runs `smriti sync`, **only the primary `category` field is read**. The `tags` array is ignored entirely. This means secondary tags are silently lost during the roundtrip.\n\n### Example\n\nA session tagged with `project`, `project/dependency`, and `decision/tooling`:\n\n| Stage | Tags |\n|-------|------|\n| Before share | `project`, `project/dependency`, `decision/tooling` |\n| In frontmatter | `category: project` + `tags: [\"project\", \"project/dependency\", \"decision/tooling\"]` |\n| After sync | `project` only |\n\n## Goal\n\nMake serialization and deserialization symmetric — every tag written by `share` must be restored by `sync`.\n\n## Implementation Plan\n\n### 1. Fix `parseFrontmatter()` array parsing (`src/team/sync.ts`)\n\nThe current `parseFrontmatter()` is a naive key-value parser that treats every value as a plain string. It does not handle JSON-style arrays like `[\"project\", \"project/dependency\"]`.\n\n**Changes:**\n- After splitting on the first `:`, detect if the trimmed value starts with `[` and ends with `]`\n- If so, parse the array elements (split by `,`, trim whitespace and quotes from each element)\n- Return the parsed array instead of the raw string\n\n```ts\n// Before\nmeta[key] = value.replace(/^[\"']|[\"']$/g, \"\");\n\n// After\nif (value.startsWith(\"[\") && value.endsWith(\"]\")) {\n meta[key] = value\n .slice(1, -1)\n .split(\",\")\n .map((s) => s.trim().replace(/^[\"']|[\"']$/g, \"\"));\n} else {\n meta[key] = value.replace(/^[\"']|[\"']$/g, \"\");\n}\n```\n\n### 2. Restore all tags during sync (`src/team/sync.ts`)\n\nCurrently sync only calls `tagSession()` once for `meta.category`. After parsing `meta.tags` as an array, iterate and restore each tag.\n\n**Changes** (around line 191-193 in `sync.ts`):\n\n```ts\n// Before\nif (meta.category) {\n tagSession(db, sessionId, meta.category, 1.0, \"team\");\n}\n\n// After\nif (meta.tags && Array.isArray(meta.tags)) {\n for (const tag of meta.tags) {\n if (isValidCategory(db, tag)) {\n tagSession(db, sessionId, tag, 1.0, \"team\");\n }\n }\n} else if (meta.category) {\n // Fallback for older exports that only have the scalar field\n tagSession(db, sessionId, meta.category, 1.0, \"team\");\n}\n```\n\nThis is backward-compatible: older shared files without a `tags` array still work via the `category` fallback.\n\n### 3. Validate tags on import\n\nUse `isValidCategory(db, tag)` (already exists in `src/categorize/schema.ts`) to skip any tag IDs that don't exist in the importing user's category tree. This prevents sync from crashing if the sharer had custom categories the importer hasn't added yet.\n\nOptionally log a warning: `\"Skipping unknown category: ops/incident\"` so the user knows to run `smriti categories add` if needed.\n\n### 4. Add tests (`test/team.test.ts`)\n\n- **Roundtrip test**: Create a session with multiple tags → share → sync into a fresh DB → assert all tags are present\n- **Backward compat test**: Sync a file with only `category:` (no `tags:` array) → assert primary tag is restored\n- **Invalid tag test**: Sync a file with a `tags` array containing an unknown category → assert valid tags are restored and invalid ones are skipped with a warning\n- **Frontmatter parser test**: Verify `parseFrontmatter()` correctly parses `tags: [\"a\", \"b/c\", \"d\"]` into a string array\n\n## Files to Modify\n\n| File | Change |\n|------|--------|\n| `src/team/sync.ts` | Update `parseFrontmatter()` to handle arrays; restore all tags from `meta.tags` |\n| `test/team.test.ts` | Add roundtrip, backward-compat, and invalid-tag tests |\n\n## Notes\n\n- No changes needed to `share.ts` — it already serializes all tags correctly\n- The `confidence` and `source` fields are not preserved in the roundtrip (hardcoded to `1.0` and `\"team\"` on import). This is acceptable — team-imported tags should be high-confidence by definition. Could be revisited separately if needed.","comments":[],"createdAt":"2026-02-10T17:40:27Z","labels":[],"number":1,"state":"OPEN","title":"Sync should restore all secondary category tags from frontmatter","updatedAt":"2026-02-10T17:40:27Z"}]