This document records the validation of Kanon's proof of concept. Each stage has automated tests (run with pytest tests/test_validation.py -v) and narrative findings.
For the full validation plan and criteria, see docs/raw/poc-validation-plan.md.
Hypothesis: Given structured knowledge entities, the system produces usable training documents traceable to their source entities.
| # | Test | Status | Evidence |
|---|---|---|---|
| 1.1 | Dry-run populates all sections without fallback repetition | ✅ | test_validation.py::test_dry_run_no_repeated_sections |
| 1.1b | All sections have content (no empty sections) | ✅ | test_validation.py::test_dry_run_all_sections_populated |
| 1.2 | LLM output contains only knowledge graph content | ⬜ | Manual review (requires API call) |
| 1.3 | Same asset generates differently for two audiences | ✅ | test_validation.py::test_audience_adaptation_dry_run |
| 1.4 | Multi-concept asset reflects relationships | ✅ | test_validation.py::test_multi_concept_generation |
| 1.5 | Food domain generates with no code changes | ✅ | test_validation.py::test_food_domain_generation |
| 1.5b | Food domain multi-concept | ✅ | test_validation.py::test_food_domain_multi_concept |
| 1.5c | Food domain prerequisites resolved | ✅ | test_validation.py::test_food_domain_prerequisites_resolved |
1.1 PASS (fixed). _build_section now has specific handlers for all template sections: verification uses facts as checkable claims plus task completion checks, troubleshooting derives step-by-step diagnostic guides from tasks, exercises generates exercise prompts from tasks, common_questions derives Q&A from facts. No two sections produce identical content.
1.3 PASS (partial). Audience adaptation at the dry-run level only changes the targets metadata — the actual content is identical for both audiences. The dry-run assembler doesn't use audience information to adapt tone or structure. LLM generation does handle this (confirmed by manual testing), but the dry-run path doesn't.
1.5 PASS. The food/recipe domain generates successfully with zero code changes. Entity models, graph loading, template rendering, and relationship traversal all work across domains. This validates that the ontology model is domain-agnostic.
Hypothesis: The system can trace every claim in a generated asset back to its source entity and flag claims that lack backing.
| # | Test | Status | Evidence |
|---|---|---|---|
| 2.1 | Asset lists all contributing source entities | ✅ | test_validation.py::test_asset_traceability |
| 2.1b | Food domain traceability | ✅ | test_validation.py::test_asset_traceability_food |
| 2.2 | Confidence scores change when entities change | ✅ | test_validation.py::test_confidence_reflects_changes |
| 2.3 | Stale facts produce lower confidence than fresh facts | ✅ | test_validation.py::test_stale_facts_lower_confidence |
| 2.3b | Assets below threshold flagged for review | ✅ | test_validation.py::test_needs_review_threshold |
| 2.4 | Coverage gaps are surfaced, not silently ignored | ✅ | test_validation.py::test_coverage_gaps_surfaced |
2.1 PASS (fixed). _collect_evidence now also searches for facts that reference concepts in the subgraph via reverse lookup, not just forward-edge traversal. The dry-run generator also injects these facts into the subgraph so section handlers (verification, troubleshooting, common_questions) can use them.
2.2, 2.3 PASS. The confidence scoring engine correctly produces lower scores when evidence coverage is partial or evidence is stale. The math works.
2.4 PASS. Sections without matching graph content produce placeholder text rather than being silently empty.
Hypothesis: When source material changes, the system identifies what's affected and what needs to be updated.
| # | Test | Status | Evidence |
|---|---|---|---|
| 3.1 | Evidence change identifies all backed facts | ✅ | test_validation.py::test_drift_finds_stale_facts |
| 3.1b | Food domain drift detection | ✅ | test_validation.py::test_drift_finds_stale_facts_food |
| 3.2 | Stale facts propagate to affected assets | ✅ | test_validation.py::test_drift_propagates_to_assets |
| 3.2b | Food domain drift propagation to assets | ✅ | test_validation.py::test_drift_propagates_to_assets_food |
| 3.3 | Impact traces through concept dependencies | ✅ | test_validation.py::test_drift_cascading_impact |
| 3.4 | Confidence drops on drift, recovers on update | ✅ | test_validation.py::test_confidence_drift_lifecycle |
| 3.5 | Regenerated asset incorporates updated facts | ✅ | test_validation.py::test_regeneration_after_drift |
All Stage 3 tests PASS. Drift detection works end-to-end across both domains:
- Evidence changes correctly identify all facts backed by that evidence
- Impact propagates from facts through concepts to affected assets
- Confidence scores drop when evidence becomes stale, recover when refreshed
- Regenerated assets pick up updated content from modified entities
The full lifecycle works: evidence changes → stale facts found → assets flagged → content updated → asset regenerated with new content.
Tests 1.5, and stages 2-3 repeated against food domain entities confirm the system generalizes beyond the Claude/AI training domain.
# All validation tests
pytest tests/test_validation.py -v
# Just one stage
pytest tests/test_validation.py -v -k "stage1"
pytest tests/test_validation.py -v -k "stage2"
pytest tests/test_validation.py -v -k "stage3"| Stage | Pass | Fail | Not Run | Notes |
|---|---|---|---|---|
| Stage 1: Generation | 7 | 0 | 1 | LLM content-only test requires API call |
| Stage 2: Review | 6 | 0 | 0 | All pass |
| Stage 3: Drift | 7 | 0 | 0 | All pass across both domains |
_build_sectionfallback (Stage 1.1) — verification and troubleshooting sections repeated the concept content_block. Fixed by adding section-specific handlers for verification, troubleshooting, exercises, common_questions, and learning_objectives._collect_evidencetraversal (Stage 2.1) — subgraph followed forward edges only, never reaching facts that point back to concepts. Fixed by adding reverse lookup for facts referencing concepts in the subgraph.
- ⬜ Not yet run
- ✅ Pass
- ❌ Fail — see findings