feat(extraction): atomization for self-contained knowledge#14
Conversation
Add prompt-level and code-level enforcement to ensure extracted knowledge statements are independently interpretable without conversation context. Prompt changes: - ATOMIZATION_RULES constant prohibiting pronouns, relative time, ambiguous references, and first-person in extracted statements - Stronger reference_timestamp wording (REJECTED vs resolve) - Good/bad atomization examples in both extraction prompts Code-level validation (_pipeline.py): - Third-person check (statement must start with "User") - First-person pronoun rejection (I, my, me, we, our) - Unresolved relative time rejection (yesterday, last week, etc.) - Assistant-sourced pattern rejection (was advised/suggested) - Temporal backfill step between extract and validate New temporal module (src/memv/processing/temporal.py): - contains_relative_time: regex detection of unresolved patterns - parse_temporal_expression: relative-to-absolute via dateutil - backfill_temporal_fields: populate valid_at/invalid_at from temporal_info strings when LLM leaves them None Dependency: python-dateutil>=2.9.0 Tests: 179 total (48 new), all passing
- Use word-boundary regex (^User\b) instead of startswith("User")
- Expand _ASSISTANT_SOURCE_RE to catch told/instructed/encouraged/shown/given
- Use removesuffix("s") instead of rstrip("s") for unit normalization
- Add comment on intentional .match() anchoring in temporal parser
- Make _FIRST_PERSON_RE case-insensitive for my/me/we/our (not I)
- Compute preview lazily (after confidence gate)
- Use placeholder syntax in ATOMIZATION_RULES date examples
- Add tests: same-weekday edge case, backfill with relative date,
assistant-sourced "told" variant
- Fix test fixtures for word-boundary validation (User1 → User)
182 tests passing, lint and typecheck clean.
- Add negative lookahead/lookbehind for "/" in _FIRST_PERSON_RE to avoid matching "I" in "I/O" - Document ^User\b scope constraint in _validate_extraction docstring - Add control flow comment in backfill_temporal_fields
Regex can't distinguish semantic intent — "User was given a promotion" (valid) vs "User was told to use Docker" (assistant-sourced). Prompt- level EXCLUSIONS already handle this correctly at the right layer.
Add missing _ASSISTANT_SOURCE_RE code-level check with infinitive
requirement ("was advised to") to avoid false positives on legitimate
passive constructions. Fix _THIRD_PERSON_RE to accept lowercase "user".
Fix began? regex typo, remove later/earlier false positives, remove
bare "to" from _UNTIL_PATTERN, extract weekday constant.
Change test_rejects_first_person to use a statement that passes the third-person check so _FIRST_PERSON_RE is actually exercised.
Regex filters (_THIRD_PERSON_RE, _FIRST_PERSON_RE, _ASSISTANT_SOURCE_RE, contains_relative_time) only worked for English. Extraction quality is now enforced entirely at the prompt level via ATOMIZATION_RULES and EXCLUSIONS. _validate_extraction retains confidence >= 0.7 as the sole code-level gate. Prompts updated with nuanced source rules: extract assistant-communicated facts (dates, appointments) while rejecting suggestions/hypotheticals.
Code ReviewThe rationale here is solid — English-only regex filters are the wrong layer for quality enforcement, and the language-agnostic shift to prompt-level atomization rules is the right call. The temporal backfill module is well-tested and the code is clean. A few issues worth addressing before merge: [High]
|
- Change "will be REJECTED" to "are INVALID — resolve them or omit" (no code-level enforcement, prompt-only instruction) - Fix output format: "A concrete, self-contained fact from the conversation" instead of "USER explicitly stated" - Merge assistant-sourced exclusion into SOURCE RULES to remove tension between EXCLUSIONS list and source rules - Fix test statement: use atomized "User's dentist appointment was moved to March 15, 2024" instead of ambiguous "The appointment"
Code ReviewThe direction here is correct — language-agnostic quality enforcement at the prompt level is the right architectural call, and the temporal module is well-structured with solid test coverage. The existing automated review raised a few valid points; I'll add perspective on what's addressed and what isn't. [High]
|
- Rename test_accepts_non_user_subject to test_accepts_assistant_communicated_fact (name/docstring now matches what the test actually validates) - Assert which item survived confidence boundary (Vstorm, not tea)
Code ReviewThe architectural direction is correct — prompt-level atomization rules are more maintainable than language-specific regex guards. The temporal module is solid (32 tests, edge cases documented, [High]
|
- Align cold_start output format with warm extraction (both now say "A concrete, self-contained fact from the conversation") - Simplify backfill condition: temporal_info alone triggers backfill (function is already idempotent on non-None fields) - Add retrieve assertion to test_accepts_assistant_communicated_fact
Code ReviewThree automated reviews have already covered this PR thoroughly. I'll synthesize what remains open vs. resolved. [High]
|
Summary
temporal.py) that resolves relative dates to absolute and backfillsvalid_at/invalid_aton extracted knowledge_THIRD_PERSON_RE,_FIRST_PERSON_RE,_ASSISTANT_SOURCE_RE,contains_relative_time) — quality enforcement moved entirely to prompt level for language agnosticismTest plan
tests/test_atomization.py— confidence threshold, boundary cases, temporal backfill, prompt content assertionstests/test_temporal.py— 32 tests for relative time detection, temporal expression parsing, backfill logictests/test_extractor.py— cold start and warm extraction with atomization rulestests/test_memory_e2e.py— confidence filter e2e, full pipeline regressionuv run pytest— 183 tests pass