Skip to content

Comments

feat(benchmarks): LongMemEval harness + pipeline parallelization#15

Draft
brgsk wants to merge 5 commits intomainfrom
feat/benchmarks
Draft

feat(benchmarks): LongMemEval harness + pipeline parallelization#15
brgsk wants to merge 5 commits intomainfrom
feat/benchmarks

Conversation

@brgsk
Copy link
Member

@brgsk brgsk commented Feb 22, 2026

Summary

  • Production-ready LongMemEval benchmark harness with concurrent question processing, JSONL checkpoint/resume, per-question timeouts
  • Pipeline parallelization: concurrent episode processing + concurrent segmentation batches (asyncio.Semaphore(10) + gather)
  • Extraction quality fix: scoped prompts to user-specific knowledge (2035→441 facts, -78%, same accuracy)
  • End-to-end runner (run.py) with --model/--stages/--max-concurrent flags
  • fast config preset for quick iteration (skips predict-calibrate + dedup)

Details

Harness (benchmarks/longmemeval/):

  • add.py / search.py: semaphore-bounded concurrency, JSONL checkpoint for crash recovery, asyncio.wait_for timeout
  • run.py: single entry point for add→search→evaluate pipeline
  • config.py: fast preset added alongside existing default, no_predict_calibrate, etc.
  • evaluate.py: LLM-judge with type-specific prompts (unchanged from prior work)

Pipeline parallelization (src/memv/):

  • _pipeline.py: sequential episode loop → asyncio.gather with Semaphore(10)
  • batch_segmenter.py: sequential batch loop → asyncio.gather with Semaphore(10)

Extraction quality (src/memv/processing/prompts.py):

  • KNOWLEDGE_CATEGORIES: scoped to "ABOUT THE USER"
  • EXCLUSIONS: added general/topical knowledge, educational content, topic summaries
  • Both extraction prompts: quality-over-quantity directive, concrete bad examples

Smoke test (3 questions)

  • 441 facts extracted (was 2035 before prompt fix)
  • 2/3 correct (66.7%)
  • 2.1 min total

Test plan

  • make benchmark-smoke passes (3 questions, fast config)
  • Checkpoint/resume works (kill + re-run skips completed questions)
  • Full 500-question baseline run (planned as follow-up)
  • Existing test suite passes (uv run pytest)

…parallelization

Benchmark harness (benchmarks/longmemeval/):
- Concurrent question processing via asyncio.Semaphore + gather
- JSONL checkpoint/resume for crash recovery (add.jsonl, search.jsonl)
- Per-question timeout with graceful error recording
- Single entry point runner (run.py) with --stages, --model, --config flags
- Config presets: default, fast (skips predict-calibrate + dedup)
- Makefile targets: benchmark, benchmark-smoke, benchmark-baseline

Pipeline parallelization (src/memv/):
- Concurrent episode processing in _pipeline.py (was sequential)
- Concurrent segmentation batches in batch_segmenter.py (was sequential)
- Both bounded by semaphore (max 10 concurrent)
- Reduces per-question time from ~10min to ~2.5min

Smoke test (3 questions, fast config): 2.8 min end-to-end.
Full 500-question run estimated at ~2 hours with --max-concurrent 20.
Extraction was producing ~634 facts per question including general knowledge
(radiation therapy, Bitcoin, cooking tips). Root cause: "Extract ALL concrete
facts" directive encouraged extracting every piece of information discussed.

Changes:
- KNOWLEDGE_CATEGORIES: scope to "ABOUT THE USER", add critical exclusion rule
- EXCLUSIONS: add general/topical knowledge, educational content, topic summaries
- cold_start_extraction_prompt: quality-over-quantity closing directive
- extraction_prompt_with_prediction: same quality-over-quantity directive
- Both prompts: add concrete bad examples matching observed problems

Result: 2035 → 441 facts (78% reduction) at same accuracy (2/3 on smoke test).
@claude
Copy link

claude bot commented Feb 22, 2026

Code Review

Overall this is solid benchmark infrastructure with good concurrency patterns and a meaningful prompt quality fix. A few things need attention before merging.


Critical: Concurrent episode processing breaks predict-calibrate semantics

src/memv/memory/_pipeline.py

The old comment read: "Process episodes sequentially to ensure each sees prior extractions." That guarantee was load-bearing. The predict-calibrate approach avoids re-extracting already-known facts by querying the KB before extraction. When episodes run concurrently, all of them query the same initial KB state — none can see what siblings extracted — so facts discovered by multiple concurrent episodes will all be extracted, and dedup cleans up after the fact.

This is a real semantic change to the core algorithm, not just a performance tweak. It likely increases extraction count (more duplicates entering before dedup) and may reduce the effectiveness of the predict step. The 441-fact result in the PR was measured after the prompt fix too, so the isolated effect of this change is unknown.

Options:

  • Document this explicitly as an intentional tradeoff (speed vs predict-calibrate fidelity)
  • Or cap MAX_CONCURRENT_EPISODES = 1 by default and only increase it in benchmarks where throughput matters more than dedup precision

Bug: Spurious # noqa: F401 on asyncio import

src/memv/memory/_pipeline.py, line 5

import asyncio  # noqa: F401  # used in process() for concurrent episode processing

asyncio is used directly in the same file (asyncio.Semaphore, asyncio.gather). The noqa suppression is wrong and the comment is misleading. Just import asyncio.


Bug: Evaluation errors silently counted as incorrect answers

benchmarks/longmemeval/evaluate.py

async def eval_with_semaphore(item: dict) -> tuple[bool, dict]:
    async with semaphore:
        if item.get("error"):
            return False, item   # counted in total_count, not correct_answers

Items that errored in the search stage (timeout, missing DB) are counted in the denominator but not the numerator, artificially deflating accuracy. These should be excluded from scoring or tracked separately in the output ("skipped": N). The same applies when evaluate_single raises and returns False — that's an LLM call failure being silently treated as a wrong answer.


Design: enable_episode_merging=False hardcoded in add.py, overriding config

benchmarks/longmemeval/add.py, line 108

memory = Memory(
    ...
    enable_episode_merging=False,   # always disabled, regardless of config
    enable_embedding_cache=True,
)

This silently overrides whatever the named config says about episode merging. The no_dedup and fast configs already set this to False, but if someone runs with --config default they'd expect the default behavior. Either remove the hardcoded override and let the config drive it, or document why benchmarks always disable merging.


Design: evaluate.py has no checkpoint/resume, unlike the other two stages

add.py and search.py both write JSONL checkpoints and support --no-resume. evaluate.py runs all evaluations and only writes at the end. A crash at question 499/500 loses everything. Adding the same JSONL checkpoint pattern here would make it consistent.


Duplication: JSONL helpers copy-pasted between add.py and search.py

_load_completed, _append_jsonl, and _load_all_results are identical in both files. A small benchmarks/longmemeval/_checkpoint.py module would eliminate the duplication and make it easier to add checkpointing to evaluate.py.


Missing: run.py doesn't expose --no-resume

add.py and search.py both accept --no-resume, but run.py's main() has no such flag and always passes resume=True (the default). If you want a clean re-run via run.py, you currently can't do it without editing code or running stages individually.


Minor

  • parse_longmemeval_date fallback (add.py, line 84): Falls back to datetime(2023, 1, 1, ...) on parse failure. A hard-coded year could skew temporal-reasoning questions. Using epoch or raising an exception would be safer.
  • time.strftime in evaluate.py: CLAUDE.md requires UTC datetimes (datetime.now(timezone.utc)). The evaluation timestamp uses time.strftime which gives local time.
  • Untyped parameters: question_data in add.py/search.py process_question functions has no type annotation. Typing it as LongMemEvalQuestion would help. (Not enforced by ty since it's outside src/, but still useful.)
  • The fast config comment says it "skips predict-calibrate + dedup" — it would help to note in config.py that this also means the benchmark result is not comparable to the default config.

The prompt quality fix is well-motivated and the 78% reduction in extracted facts looks correct directionally. The benchmark harness structure (JSONL checkpoints, semaphore-bounded concurrency, per-question timeout) is solid. The main thing to resolve before merging is the predict-calibrate concurrency issue — even if the decision is to accept the tradeoff, it needs to be documented explicitly.

@claude
Copy link

claude bot commented Feb 22, 2026

PR Review: feat(benchmarks): LongMemEval harness + pipeline parallelization

Overall this is solid infrastructure work — the harness design (JSONL checkpointing, semaphore-bounded concurrency, per-question timeouts, clean stage separation) is well-thought-out. A few issues worth addressing before merging, one of them significant.


Critical: Episode parallelization breaks predict-calibrate correctness

src/memv/memory/_pipeline.py

The original sequential loop had this comment:

# Process episodes sequentially to ensure each sees prior extractions

That comment was load-bearing. The predict-calibrate extractor queries existing KB to predict what an episode should contain, then only extracts what was unpredicted (novel). When episodes run concurrently, every episode sees the same stale KB snapshot — none of the concurrent siblings' extractions are committed yet. The result:

  • Each episode independently extracts facts that will be duplicated across siblings
  • The predict step becomes a no-op for intra-batch novelty — it can only suppress facts already in KB before this process() call
  • Knowledge dedup catches semantic overlap after the fact, but at the cost of wasted LLM calls and dedup overhead

The 78% fact reduction (2035 → 441) is attributed to the prompt fix, but it may be masking an increase in duplication from concurrent extraction. It's worth verifying: do any of the 441 extracted facts repeat across episodes that ran concurrently?

Recommendation: Keep batch_segmenter.py concurrent (those batches are independent — correct change). Revert _pipeline.py to sequential episode processing, or add an explicit note that this is an intentional accuracy/throughput trade-off accepted for benchmark purposes, and gate it behind a config flag rather than changing the core pipeline behavior unconditionally.

@claude
Copy link

claude bot commented Feb 22, 2026

Unnecessary # noqa: F401 on asyncio import

src/memv/memory/_pipeline.py, line 4

import asyncio  # noqa: F401  # used in process() for concurrent episode processing

asyncio is used in process(), so ruff won't flag it. The noqa suppression is redundant and will mislead future readers into thinking the import is unused. Remove it.


enable_episode_merging=False hardcoded in add.py overrides all configs

benchmarks/longmemeval/add.py, lines 107-108

memory = Memory(
    ...
    enable_episode_merging=False,  # <- hardcoded regardless of config_name
    enable_embedding_cache=True,
)

This kwarg overrides whatever the config preset says about episode merging for all configurations, including default. This means default and no_predict_calibrate presets silently lose episode merging. If disabling merging is intentional for the benchmark (e.g., for speed), it should either be baked into all presets or documented. fast already has it; the others don't expect it.


Code duplication across add.py and search.py

_load_completed, _append_jsonl, _load_all_results, and RESULTS_DIR are identically defined in both files. Extracting them to a benchmarks/longmemeval/utils.py would reduce maintenance surface.


Error items silently dilute accuracy in evaluate.py

benchmarks/longmemeval/evaluate.py, eval_with_semaphore

if item.get("error"):
    return False, item  # counted in total_count, not total_correct

Questions that errored during search are counted in the denominator (total_count) but scored as incorrect, which artificially deflates accuracy. Better options:

  1. Skip errored items and report them separately: n_errors=X, accuracy over N-X questions
  2. Or explicitly document that errors count as wrong

As-is, a run with many timeouts will show misleadingly low accuracy.


Things done well

  • JSONL checkpoint/resume is correctly implemented — _append_jsonl is called while holding the semaphore, so writes are serialized despite concurrent tasks. The resume logic is clean.
  • asyncio.wait_for per-question timeout is the right tool here.
  • dataset.py Pydantic model with field_validator for answer coercion is clean.
  • config.py presets give a clear ablation surface — the fast preset is a good addition for iteration speed.
  • batch_segmenter.py concurrent segmentation is correct: time-gap batches are independent, so concurrency here doesn't violate any ordering invariant.
  • Prompt quality fix (scoping to user-specific knowledge, concrete bad examples) is a meaningful improvement.

Minor

  • process_question in add.py and search.py has question_data without a type annotation — LongMemEvalQuestion would be appropriate.
  • embedding_client / llm_client parameters are typed implicitly as Any throughout the harness. Using the EmbeddingClient / LLMClient protocol types from memv.protocols would be more consistent with the rest of the codebase.
  • The PR checklist has "Existing test suite passes (uv run pytest)" unchecked — worth verifying before merging given the _pipeline.py behavior change.

- Remove spurious `# noqa: F401` on asyncio import in _pipeline.py
- Document concurrent episode processing tradeoff (stale KB, dedup catches)
- Extract JSONL checkpoint helpers to _checkpoint.py (was duplicated in add/search)
- Add checkpoint/resume to evaluate.py (was missing, crash at Q499 lost everything)
- Exclude errored items from accuracy scoring (were silently counted as incorrect)
- Remove hardcoded enable_episode_merging=False from add.py (let config drive it)
- Add --no-resume flag to run.py (was missing, always resumed)
- Add type annotations (LongMemEvalQuestion) to process_question functions
- Fix UTC timestamp in evaluate.py (was using local time via time.strftime)
- Add note to fast config that results aren't comparable to default
@claude
Copy link

claude bot commented Feb 22, 2026

Code Review

The existing automated reviews cover most of the important issues. This review focuses on a few items not yet addressed and adds some nuance to the critical episode-parallelization concern.


Critical (already flagged): Episode parallelization — SQLite write semantics matter here

The existing reviews correctly flag the predict-calibrate correctness concern. There's also a deeper question about whether the concurrent writes to the same SQLite database are actually safe.

_process_episode for all concurrent episodes writes to the same DB file (same Memory instance, same connection). If aiosqlite uses a single background thread per connection (which it does), writes serialize through that thread — so no database is locked errors. But this means the actual concurrency benefit is only on the I/O-bound parts (LLM calls, embedding API calls). The SQLite reads/writes serialize anyway.

This is actually a good thing for correctness: the dedup check (read + conditional write) is safe. But it means the comment "concurrent episode processing" overstates the parallelism. What's actually parallelized is the LLM/embedding API calls while KB writes stay serialized. Worth clarifying in the code comment.


Bug: evaluate.py has no per-evaluation timeout

add.py and search.py both wrap their processing with asyncio.wait_for(..., timeout=timeout). evaluate.py's eval_with_semaphore has no timeout — a single hanging LLM judge call will stall that semaphore slot indefinitely:

async def eval_with_semaphore(item: dict) -> None:
    async with semaphore:
        # No asyncio.wait_for here
        is_correct = await evaluate_single(llm_client, ...)

A 60-second timeout per evaluation call would be consistent with the other stages.


Bug: --stages input not validated in run.py

stages=args.stages.split(","),

A typo like --stages add,serach,evaluate silently skips the search stage. Adding a validation step:

valid_stages = {"add", "search", "evaluate"}
stages = args.stages.split(",")
invalid = set(stages) - valid_stages
if invalid:
    parser.error(f"Unknown stages: {invalid}. Valid: {valid_stages}")

would catch this before anything runs.


Note on existing reviews vs. current diff

The second and third automated reviews flag code duplication between add.py and search.py, recommending extraction to a shared module. This PR already includes benchmarks/longmemeval/_checkpoint.py which does exactly that — both files import from it. Those review comments appear to have been generated against an earlier version of the diff.


Minor

  • no_segmentation config (config.py): segmentation_threshold=9999 is a magic number workaround to effectively disable segmentation. A comment explaining why 9999 (e.g., "effectively disables segmentation — threshold higher than any realistic session length") would help.
  • max_concurrent * 2 in run.py: The search stage is passed max_concurrent * 2 because it's lighter. If --max-concurrent 50 is passed, search gets 100 concurrent tasks. A cap or a separate --search-concurrent flag would prevent accidental rate-limit hammering on large runs.

What's working well

  • The _checkpoint.py abstraction is clean and the JSONL append-while-holding-semaphore pattern correctly prevents interleaved writes.
  • batch_segmenter.py concurrent segmentation is correctly applied — time-gap batches are independent so there's no ordering invariant to break there.
  • Prompt changes are focused and the concrete bad examples (radiation therapy, Bitcoin, knife-sharpening) directly address the problem of assistant-provided educational content leaking into the KB.
  • LongMemEvalQuestion with field_validator for answer coercion handles the dataset's mixed answer field types cleanly.

@claude
Copy link

claude bot commented Feb 22, 2026

Code Review

Good benchmark infrastructure overall. The checkpoint/resume pattern is solid, concurrency handling is correct, and the prompt quality changes are the most impactful part of this PR. A few things worth addressing:


Correction on existing review

The previous automated review flagged two bugs that don't exist:

  1. import asyncio # noqa: F401 — this comment is not in the diff; the import is just import asyncio with no suppression comment.
  2. "Evaluation errors silently counted as incorrect" — also incorrect. Errored items get is_correct: None, and the aggregation loop explicitly checks if scored.get("is_correct") is None: total_errors += 1; continue, so they're excluded from the accuracy denominator. The scoring logic is correct.

MAX_CONCURRENT_EPISODES should live in MemoryConfig, not as a module constant

src/memv/memory/_pipeline.py and src/memv/processing/batch_segmenter.py

MAX_CONCURRENT_EPISODES = 10
MAX_CONCURRENT_SEGMENTATIONS = 10

These values directly affect the predict-calibrate semantics in production (more concurrency = more stale-KB reads = more duplicates entering before dedup). Hardcoding them in src/ with no override path means all users get the same tradeoff. They belong in MemoryConfig alongside the other tuning knobs, with a note that higher values trade predict-calibrate fidelity for throughput.


Shared MemoryConfig instances in config.py

CONFIGS: dict[str, MemoryConfig] = {
    "default": MemoryConfig(),
    "fast": MemoryConfig(...),
    ...
}

def get_config(name: str) -> MemoryConfig:
    return CONFIGS[name]  # returns the same object every call

If any caller mutates the returned config (even accidentally), all subsequent calls to get_config(name) return the mutated version. Since MemoryConfig is a dataclass, it's mutable by default. Use dataclasses.replace(CONFIGS[name]) to return a copy, or define CONFIGS as factories.


Missing type annotations on client parameters

Across add.py, search.py, evaluate.py, and run.py, the embedding_client and llm_client parameters are untyped:

async def process_question(
    ...
    embedding_client,   # should be: embedding_client: EmbeddingClient
    llm_client,         # should be: llm_client: LLMClient

The protocols are already defined in memv.protocols. Using them here would catch mismatched clients at the call site and make the intended interface explicit.


_make_clients() is duplicated three times

add.py, search.py, and run.py each define their own _make_clients() with slightly different signatures (the run.py version accepts a model argument, the others don't). Extract this into a shared benchmarks/longmemeval/_clients.py to keep them in sync.


--stages parsing silently drops stages with leading spaces

run.py

stages=args.stages.split(",")

--stages "add, search, evaluate" produces ["add", " search", " evaluate"] — the space-prefixed names won't match the if "add" in stages checks. Either strip whitespace after split or use nargs="+" with --stages add search evaluate.


process_with_guard return type annotation is misleading

add.py and search.py

async def process_with_guard(...) -> dict | None:

There's no code path that returns None — both exception branches assign to result and return it. The | None annotation is vestigial. Should be -> dict.


enable_embedding_cache=True in add.py but not search.py

add.py explicitly passes enable_embedding_cache=True to Memory(...). search.py omits it, falling back to MemoryConfig.enable_embedding_cache (which defaults to True anyway, but the inconsistency is surprising). Either be explicit in both or omit it in both.


Prompt changes look good

The prompts.py changes are well-targeted. Scoping all categories to "ABOUT THE USER", adding concrete bad examples for general knowledge, and explicitly allowing an empty return for topical conversations addresses a real over-extraction problem. The 78% reduction in fact count with the same smoke-test accuracy is a strong signal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant