Skip to content

Comments

feat(extraction): atomization for self-contained knowledge#14

Merged
brgsk merged 11 commits intomainfrom
feat/atomization
Feb 21, 2026
Merged

feat(extraction): atomization for self-contained knowledge#14
brgsk merged 11 commits intomainfrom
feat/atomization

Conversation

@brgsk
Copy link
Member

@brgsk brgsk commented Feb 21, 2026

Summary

  • Add atomization rules enforcing self-contained knowledge statements (no pronouns, no relative time, no ambiguous references)
  • Add temporal parsing module (temporal.py) that resolves relative dates to absolute and backfills valid_at/invalid_at on extracted knowledge
  • Replace blanket "ONLY extract from USER lines" with nuanced source rules: extract assistant-communicated facts (dates, appointments) while rejecting suggestions/hypotheticals
  • Remove English-only regex filters (_THIRD_PERSON_RE, _FIRST_PERSON_RE, _ASSISTANT_SOURCE_RE, contains_relative_time) — quality enforcement moved entirely to prompt level for language agnosticism
  • Confidence >= 0.7 remains as sole code-level extraction gate

Test plan

  • tests/test_atomization.py — confidence threshold, boundary cases, temporal backfill, prompt content assertions
  • tests/test_temporal.py — 32 tests for relative time detection, temporal expression parsing, backfill logic
  • tests/test_extractor.py — cold start and warm extraction with atomization rules
  • tests/test_memory_e2e.py — confidence filter e2e, full pipeline regression
  • uv run pytest — 183 tests pass

Add prompt-level and code-level enforcement to ensure extracted
knowledge statements are independently interpretable without
conversation context.

Prompt changes:
- ATOMIZATION_RULES constant prohibiting pronouns, relative time,
  ambiguous references, and first-person in extracted statements
- Stronger reference_timestamp wording (REJECTED vs resolve)
- Good/bad atomization examples in both extraction prompts

Code-level validation (_pipeline.py):
- Third-person check (statement must start with "User")
- First-person pronoun rejection (I, my, me, we, our)
- Unresolved relative time rejection (yesterday, last week, etc.)
- Assistant-sourced pattern rejection (was advised/suggested)
- Temporal backfill step between extract and validate

New temporal module (src/memv/processing/temporal.py):
- contains_relative_time: regex detection of unresolved patterns
- parse_temporal_expression: relative-to-absolute via dateutil
- backfill_temporal_fields: populate valid_at/invalid_at from
  temporal_info strings when LLM leaves them None

Dependency: python-dateutil>=2.9.0
Tests: 179 total (48 new), all passing
- Use word-boundary regex (^User\b) instead of startswith("User")
- Expand _ASSISTANT_SOURCE_RE to catch told/instructed/encouraged/shown/given
- Use removesuffix("s") instead of rstrip("s") for unit normalization
- Add comment on intentional .match() anchoring in temporal parser
- Make _FIRST_PERSON_RE case-insensitive for my/me/we/our (not I)
- Compute preview lazily (after confidence gate)
- Use placeholder syntax in ATOMIZATION_RULES date examples
- Add tests: same-weekday edge case, backfill with relative date,
  assistant-sourced "told" variant
- Fix test fixtures for word-boundary validation (User1 → User)

182 tests passing, lint and typecheck clean.
- Add negative lookahead/lookbehind for "/" in _FIRST_PERSON_RE to
  avoid matching "I" in "I/O"
- Document ^User\b scope constraint in _validate_extraction docstring
- Add control flow comment in backfill_temporal_fields
Regex can't distinguish semantic intent — "User was given a promotion"
(valid) vs "User was told to use Docker" (assistant-sourced). Prompt-
level EXCLUSIONS already handle this correctly at the right layer.
Add missing _ASSISTANT_SOURCE_RE code-level check with infinitive
requirement ("was advised to") to avoid false positives on legitimate
passive constructions. Fix _THIRD_PERSON_RE to accept lowercase "user".
Fix began? regex typo, remove later/earlier false positives, remove
bare "to" from _UNTIL_PATTERN, extract weekday constant.
Change test_rejects_first_person to use a statement that passes
the third-person check so _FIRST_PERSON_RE is actually exercised.
Regex filters (_THIRD_PERSON_RE, _FIRST_PERSON_RE, _ASSISTANT_SOURCE_RE,
contains_relative_time) only worked for English. Extraction quality is now
enforced entirely at the prompt level via ATOMIZATION_RULES and EXCLUSIONS.

_validate_extraction retains confidence >= 0.7 as the sole code-level gate.

Prompts updated with nuanced source rules: extract assistant-communicated
facts (dates, appointments) while rejecting suggestions/hypotheticals.
@claude
Copy link

claude bot commented Feb 21, 2026

Code Review

The rationale here is solid — English-only regex filters are the wrong layer for quality enforcement, and the language-agnostic shift to prompt-level atomization rules is the right call. The temporal backfill module is well-tested and the code is clean. A few issues worth addressing before merge:


[High] statement text can now contain unresolved relative time

Removing contains_relative_time from _validate_extraction closes the last code-level safety net against storing "User started using FastAPI yesterday" verbatim.

The pipeline flow is:

step 6: backfill_temporal_fields(item.temporal_info, ...)  # updates valid_at/invalid_at only
step 7: _validate_extraction(item)                         # now only checks confidence

backfill_temporal_fields resolves valid_at/invalid_at from temporal_info but never touches item.statement. If the LLM populates statement = \"User started using FastAPI yesterday\" but leaves temporal_info = None, the statement gets stored with "yesterday" in it as long as confidence ≥ 0.7.

The prompt says:

Statements with unresolved relative time ("yesterday", "last week") will be REJECTED.

This is now a false promise — the code no longer enforces it. Either re-add the contains_relative_time(stmt) guard (after backfill, so resolved statements pass), or update the prompt wording to remove the claim of code-level rejection.


[Medium] test_accepts_non_user_subject validates an atomization violation

statement="The appointment was moved to March 15"

"The appointment" is an ambiguous reference — exactly the pattern that ATOMIZATION_RULES prohibits:

PROHIBIT in output statements: Ambiguous references: "the project", "the tool", "that place", "the same thing"

The test is trying to confirm that non-User-prefixed subjects are now accepted (correct goal), but the chosen statement violates atomization. A cleaner example would be \"User's dentist appointment was moved to March 15, 2024\", which demonstrates both points without self-contradicting the rules being tested.


[Medium] Prompt inconsistency in extraction_prompt_with_prediction output format

The output format section still says:

- statement: A fact the USER explicitly stated (not assistant suggestions)

But the new SOURCE RULES directly above now say:

- DO extract factual info communicated by the assistant (dates, appointments, confirmations).

This will confuse the LLM. The output format description should be updated to match the broader source rule, something like: "A concrete, self-contained fact from the conversation."


[Low] EXCLUSIONS and SOURCE RULES are in tension

EXCLUSIONS still lists:

  • Assistant-sourced knowledge ("was advised to", "was suggested to", "was recommended to")

But the SOURCE RULES immediately below say to DO extract factual assistant-communicated info. The distinction (factual confirmation vs. suggestion) is subtle enough that an LLM might be confused by the two adjacent instructions pulling in opposite directions. Consider restructuring so the distinction is stated once clearly rather than appearing as a contradiction between sections.


Nit: test_cold_start_contains_strong_timestamp_wording asserts a now-misleading prompt string

assert "will be REJECTED" in prompt

The test passes (the string exists in the prompt), but the assertion is validating a claim that is no longer true at the code level. If the contains_relative_time guard is not restored, the prompt wording should change and this test should be updated accordingly.


What's working well

  • Temporal module is solid: 32 tests, edge cases covered (weekday arithmetic, match() vs search() distinction documented, began vs began? gotcha noted)
  • backfill_temporal_fields correctly preserves existing values and only fills None fields
  • The confidence boundary test (0.7 passes, 0.69 fails) is a good regression check
  • Moving quality enforcement to prompts via ATOMIZATION_RULES is the right architectural direction for multilingual support

- Change "will be REJECTED" to "are INVALID — resolve them or omit"
  (no code-level enforcement, prompt-only instruction)
- Fix output format: "A concrete, self-contained fact from the
  conversation" instead of "USER explicitly stated"
- Merge assistant-sourced exclusion into SOURCE RULES to remove
  tension between EXCLUSIONS list and source rules
- Fix test statement: use atomized "User's dentist appointment was
  moved to March 15, 2024" instead of ambiguous "The appointment"
@claude
Copy link

claude bot commented Feb 21, 2026

Code Review

The direction here is correct — language-agnostic quality enforcement at the prompt level is the right architectural call, and the temporal module is well-structured with solid test coverage. The existing automated review raised a few valid points; I'll add perspective on what's addressed and what isn't.


[High] contains_relative_time removal leaves statement text unguarded

The existing bot review flagged this accurately, and it's still the most important issue. The pipeline flow is:

backfill_temporal_fields(item.temporal_info, ...)  # fills valid_at / invalid_at only
_validate_extraction(item)                         # confidence check only

backfill_temporal_fields never mutates item.statement. If the LLM outputs statement="User started using FastAPI yesterday" with temporal_info=None and confidence=0.8, it now passes all code-level gates and gets stored verbatim with "yesterday" in it — defeating the entire point of the temporal module.

The prompt says these are "INVALID — resolve them or omit", which is softer than the previous "will be REJECTED" and no longer reflects code reality. Two options:

  • Re-add the contains_relative_time(item.statement) guard in _validate_extraction (after backfill, so correctly resolved statements pass)
  • Or move the check into backfill_temporal_fields as a return value indicating failure, and reject from the caller

[Medium] Test name contradicts its own assertion

async def test_accepts_non_user_subject(tmp_path):
    """Statements not starting with 'User' are accepted."""
    ...
    statement="User's dentist appointment was moved to March 15, 2024"

The docstring says "not starting with 'User'" but the example statement begins with "User's". The statement subject is "User's dentist appointment" — it does start with "User". If the intent is to test that assistant-communicated facts are now accepted, the test name and docstring should say that instead. The example statement is perfectly valid; it's the name/docstring that misleads.


[Low] Prompt tension: EXCLUSIONS constant vs. new SOURCE RULES

The ATOMIZATION_RULES/EXCLUSIONS constants are separate from the inline function text — if the standalone EXCLUSIONS constant still lists "Assistant-sourced knowledge" as a blanket exclusion, LLM behaviour will be inconsistent between prompt calls that include EXCLUSIONS directly and those that use the updated inline source rules. Worth verifying the constants are in sync with the new intent.


[Addressed] Bot's prompt-inconsistency finding

The existing review flagged that extraction_prompt_with_prediction output format still said "A fact the USER explicitly stated". The diff does fix this:

-- statement: A fact the USER explicitly stated (not assistant suggestions)
+- statement: A concrete, self-contained fact from the conversation (see SOURCE RULES above)

That medium concern is resolved in this PR.


[Nit] Confidence boundary test uses two LLM items in one response

[
    ExtractedKnowledge(statement="User works at Vstorm", ..., confidence=0.7),
    ExtractedKnowledge(statement="User likes tea", ..., confidence=0.69),
]

This is good coverage for the 0.7 boundary but note it relies on ordering: both items go through _validate_extraction independently, so the test is sound. Consider adding an assertion that the stored item is specifically the Vstorm one (e.g. check result.knowledge[0].statement) rather than relying solely on count == 1 — it would confirm which item survived, not just that one did.


What's working well

  • Removing English-only regex filters is the right call for a multilingual-ready system — prompt-level atomization scales better
  • backfill_temporal_fields correctly preserves existing values (idempotent on non-None fields)
  • 32 temporal tests with weekday arithmetic edge cases is thorough
  • The new source rules distinguish "assistant-confirmed facts" from "assistant-suggestions" cleanly in intent — the implementation just needs the contains_relative_time guard restored to make it complete

- Rename test_accepts_non_user_subject to test_accepts_assistant_communicated_fact
  (name/docstring now matches what the test actually validates)
- Assert which item survived confidence boundary (Vstorm, not tea)
@claude
Copy link

claude bot commented Feb 21, 2026

Code Review

The architectural direction is correct — prompt-level atomization rules are more maintainable than language-specific regex guards. The temporal module is solid (32 tests, edge cases documented, began vs began? gotcha preserved in CLAUDE.md). I've read through the full diff and current file state; a few items to address:


[High] contains_relative_time guard is gone but backfill_temporal_fields never touches item.statement

Both automated reviews flagged this and it remains unresolved. The pipeline flow is:

# step 6 (_pipeline.py:160-167)
for item in extracted:
    if item.temporal_info and (item.valid_at is None or item.invalid_at is None):
        item.valid_at, item.invalid_at = backfill_temporal_fields(...)

# step 7 (_pipeline.py:170)
extracted = [item for item in extracted if self._validate_extraction(item)]  # confidence only

backfill_temporal_fields resolves valid_at/invalid_at — it never mutates item.statement. If the LLM outputs statement="User started using FastAPI yesterday" with temporal_info=None and confidence=0.8, step 6 is skipped entirely (falsy temporal_info), step 7 passes (confidence >= 0.7), and "yesterday" lands in the database.

There's also a subtler case: even when temporal_info is present, backfill_temporal_fields only fills the datetime fields, not the statement text. A statement like "User's meeting was scheduled yesterday" with temporal_info="yesterday afternoon" passes step 7 with "yesterday" still in the statement.

The simplest fix is to restore the contains_relative_time(item.statement) check in _validate_extraction after backfill (so properly resolved statements pass):

def _validate_extraction(self, item: ExtractedKnowledge) -> bool:
    if item.confidence < 0.7:
        return False
    if contains_relative_time(item.statement):
        logger.debug("Rejected (unresolved relative time): %s", item.statement[:60])
        return False
    return True

[Medium] cold_start_extraction_prompt output format inconsistency (missed in existing reviews)

The PR correctly updated extraction_prompt_with_prediction's output format section:

- statement: A fact the USER explicitly stated (not assistant suggestions)
+ statement: A concrete, self-contained fact from the conversation (see SOURCE RULES above)

But cold_start_extraction_prompt output format (line 304) was not updated — it still says:

- statement: A clean, declarative fact about the user (third-person: "User...", not "I...")

This implies all statements must start with "User", which contradicts the new SOURCE RULES that allow assistant-communicated facts (dates, appointments). Both prompt functions now include the same SOURCE RULES via the EXCLUSIONS constant, so their output format descriptions should be identical.


[Low] Step 6 backfill condition is narrower than intended

if item.temporal_info and (item.valid_at is None or item.invalid_at is None):

The condition skips backfill entirely when both valid_at and invalid_at are already populated — but backfill_temporal_fields is already idempotent (preserves existing values). The outer guard is harmless but slightly misleading: if a future case adds a third temporal field, the guard would silently skip it. The condition could just be if item.temporal_info: since the function handles the non-None case safely.


What's resolved from earlier reviews

  • extraction_prompt_with_prediction output format — fixed in this PR
  • test_accepts_non_user_subject name/docstring contradiction — test was renamed to test_accepts_assistant_communicated_fact with a consistent statement
  • EXCLUSIONS vs SOURCE RULES tension — resolved; SOURCE RULES are now embedded in EXCLUSIONS with a clear positive/negative split
  • test_confidence_boundary retrieve assertion — already present (result.retrieved_knowledge[0].statement == "User works at Vstorm")

Nit

test_accepts_assistant_communicated_fact only asserts count == 1 without verifying the stored statement via memory.retrieve(...). test_confidence_boundary does this correctly — worth applying the same pattern here for consistency.

- Align cold_start output format with warm extraction (both now say
  "A concrete, self-contained fact from the conversation")
- Simplify backfill condition: temporal_info alone triggers backfill
  (function is already idempotent on non-None fields)
- Add retrieve assertion to test_accepts_assistant_communicated_fact
@brgsk brgsk merged commit 36f3a78 into main Feb 21, 2026
4 checks passed
@brgsk brgsk deleted the feat/atomization branch February 21, 2026 21:06
@claude
Copy link

claude bot commented Feb 21, 2026

Code Review

Three automated reviews have already covered this PR thoroughly. I'll synthesize what remains open vs. resolved.


[High] contains_relative_time guard still missing — one issue remains unresolved

All three prior reviews flagged this and the current diff does not address it. The pipeline flow is:

# step 6: fills valid_at / invalid_at only — never touches item.statement
for item in extracted:
    if item.temporal_info:
        item.valid_at, item.invalid_at = backfill_temporal_fields(...)

# step 7: confidence only
extracted = [item for item in extracted if self._validate_extraction(item)]

If the LLM outputs statement="User started using FastAPI yesterday" with temporal_info=None and confidence=0.8, step 6 is skipped entirely (falsy temporal_info) and step 7 passes. "yesterday" lands in the database verbatim.

The minimal fix is to restore the check in _validate_extraction, applied after backfill so correctly resolved statements pass:

from memv.processing.temporal import backfill_temporal_fields, contains_relative_time

def _validate_extraction(self, item: ExtractedKnowledge) -> bool:
    """Filter low-confidence extractions and statements with unresolved relative time."""
    if item.confidence < 0.7:
        logger.debug("Rejected (low confidence %.2f): %s", item.confidence, item.statement[:60])
        return False
    if contains_relative_time(item.statement):
        logger.debug("Rejected (unresolved relative time): %s", item.statement[:60])
        return False
    return True

[Medium] No test covers the unresolved-relative-time rejection path

test_rejects_relative_time was removed alongside the code-level guard. If the guard is restored, the test should come back too. Without it, a future regression (e.g., someone tightening the prompt wording) could re-introduce stored "yesterday" statements silently. The test is straightforward to restore:

async def test_rejects_unresolved_relative_time(tmp_path):
    """Statement with unresolved relative time is rejected even at high confidence."""    llm, embedder = MockLLM(), MockEmbedder()
    llm.set_responses("generate", [_episode_json()])
    llm.set_responses(
        "generate_structured",
        [_extraction([ExtractedKnowledge(statement="User started using FastAPI yesterday", knowledge_type="new", confidence=0.9)])],
    )
    memory = _make_memory(tmp_path, llm, embedder)
    async with memory:
        await memory.add_exchange("user1", "I started using FastAPI yesterday", "Nice!", timestamp=_ts())
        count = await memory.process("user1")
        assert count == 0

What's resolved from prior reviews

  • extraction_prompt_with_prediction output format — fixed ("A concrete, self-contained fact from the conversation")
  • cold_start_extraction_prompt output format — also fixed in this diff (third review flagged it; it is addressed)
  • Backfill condition — narrowed from if item.temporal_info and (...) to if item.temporal_info:, making it correctly idempotent
  • Test namingtest_accepts_non_user_subjecttest_accepts_assistant_communicated_fact with a consistent statement
  • test_confidence_boundary retrieve assertion — present and correct
  • EXCLUSIONS vs. SOURCE RULES tension — the inline source rules now give a clear positive/negative split

Nit: test_accepts_assistant_communicated_fact missing retrieve assertion

test_confidence_boundary correctly verifies which item survived via memory.retrieve(). test_accepts_assistant_communicated_fact only checks count == 1. The retrieve assertion is already written (line ~103) — worth keeping for consistency.

Actually looking again at the diff it does include a retrieve assertion in test_accepts_assistant_communicated_fact:

result = await memory.retrieve("dentist appointment", user_id="user1")
assert "March 15, 2024" in result.retrieved_knowledge[0].statement

That's fine — disregard this nit.


Summary

The PR is one fix away from being mergeable: restoring contains_relative_time(item.statement) in _validate_extraction (with its companion test). Everything else flagged in prior reviews has been addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant