feat(benchmarks): LongMemEval harness + pipeline parallelization#15
feat(benchmarks): LongMemEval harness + pipeline parallelization#15
Conversation
…parallelization Benchmark harness (benchmarks/longmemeval/): - Concurrent question processing via asyncio.Semaphore + gather - JSONL checkpoint/resume for crash recovery (add.jsonl, search.jsonl) - Per-question timeout with graceful error recording - Single entry point runner (run.py) with --stages, --model, --config flags - Config presets: default, fast (skips predict-calibrate + dedup) - Makefile targets: benchmark, benchmark-smoke, benchmark-baseline Pipeline parallelization (src/memv/): - Concurrent episode processing in _pipeline.py (was sequential) - Concurrent segmentation batches in batch_segmenter.py (was sequential) - Both bounded by semaphore (max 10 concurrent) - Reduces per-question time from ~10min to ~2.5min Smoke test (3 questions, fast config): 2.8 min end-to-end. Full 500-question run estimated at ~2 hours with --max-concurrent 20.
Extraction was producing ~634 facts per question including general knowledge (radiation therapy, Bitcoin, cooking tips). Root cause: "Extract ALL concrete facts" directive encouraged extracting every piece of information discussed. Changes: - KNOWLEDGE_CATEGORIES: scope to "ABOUT THE USER", add critical exclusion rule - EXCLUSIONS: add general/topical knowledge, educational content, topic summaries - cold_start_extraction_prompt: quality-over-quantity closing directive - extraction_prompt_with_prediction: same quality-over-quantity directive - Both prompts: add concrete bad examples matching observed problems Result: 2035 → 441 facts (78% reduction) at same accuracy (2/3 on smoke test).
Code ReviewOverall this is solid benchmark infrastructure with good concurrency patterns and a meaningful prompt quality fix. A few things need attention before merging. Critical: Concurrent episode processing breaks predict-calibrate semantics
The old comment read: "Process episodes sequentially to ensure each sees prior extractions." That guarantee was load-bearing. The predict-calibrate approach avoids re-extracting already-known facts by querying the KB before extraction. When episodes run concurrently, all of them query the same initial KB state — none can see what siblings extracted — so facts discovered by multiple concurrent episodes will all be extracted, and dedup cleans up after the fact. This is a real semantic change to the core algorithm, not just a performance tweak. It likely increases extraction count (more duplicates entering before dedup) and may reduce the effectiveness of the predict step. The 441-fact result in the PR was measured after the prompt fix too, so the isolated effect of this change is unknown. Options:
Bug: Spurious
|
PR Review: feat(benchmarks): LongMemEval harness + pipeline parallelizationOverall this is solid infrastructure work — the harness design (JSONL checkpointing, semaphore-bounded concurrency, per-question timeouts, clean stage separation) is well-thought-out. A few issues worth addressing before merging, one of them significant. Critical: Episode parallelization breaks predict-calibrate correctness
The original sequential loop had this comment: # Process episodes sequentially to ensure each sees prior extractionsThat comment was load-bearing. The predict-calibrate extractor queries existing KB to predict what an episode should contain, then only extracts what was unpredicted (novel). When episodes run concurrently, every episode sees the same stale KB snapshot — none of the concurrent siblings' extractions are committed yet. The result:
The 78% fact reduction (2035 → 441) is attributed to the prompt fix, but it may be masking an increase in duplication from concurrent extraction. It's worth verifying: do any of the 441 extracted facts repeat across episodes that ran concurrently? Recommendation: Keep |
Unnecessary
|
- Remove spurious `# noqa: F401` on asyncio import in _pipeline.py - Document concurrent episode processing tradeoff (stale KB, dedup catches) - Extract JSONL checkpoint helpers to _checkpoint.py (was duplicated in add/search) - Add checkpoint/resume to evaluate.py (was missing, crash at Q499 lost everything) - Exclude errored items from accuracy scoring (were silently counted as incorrect) - Remove hardcoded enable_episode_merging=False from add.py (let config drive it) - Add --no-resume flag to run.py (was missing, always resumed) - Add type annotations (LongMemEvalQuestion) to process_question functions - Fix UTC timestamp in evaluate.py (was using local time via time.strftime) - Add note to fast config that results aren't comparable to default
Code ReviewThe existing automated reviews cover most of the important issues. This review focuses on a few items not yet addressed and adds some nuance to the critical episode-parallelization concern. Critical (already flagged): Episode parallelization — SQLite write semantics matter hereThe existing reviews correctly flag the predict-calibrate correctness concern. There's also a deeper question about whether the concurrent writes to the same SQLite database are actually safe.
This is actually a good thing for correctness: the dedup check (read + conditional write) is safe. But it means the comment "concurrent episode processing" overstates the parallelism. What's actually parallelized is the LLM/embedding API calls while KB writes stay serialized. Worth clarifying in the code comment. Bug:
|
Code ReviewGood benchmark infrastructure overall. The checkpoint/resume pattern is solid, concurrency handling is correct, and the prompt quality changes are the most impactful part of this PR. A few things worth addressing: Correction on existing reviewThe previous automated review flagged two bugs that don't exist:
|
Summary
asyncio.Semaphore(10)+gather)run.py) with--model/--stages/--max-concurrentflagsfastconfig preset for quick iteration (skips predict-calibrate + dedup)Details
Harness (
benchmarks/longmemeval/):add.py/search.py: semaphore-bounded concurrency, JSONL checkpoint for crash recovery,asyncio.wait_fortimeoutrun.py: single entry point for add→search→evaluate pipelineconfig.py:fastpreset added alongside existingdefault,no_predict_calibrate, etc.evaluate.py: LLM-judge with type-specific prompts (unchanged from prior work)Pipeline parallelization (
src/memv/):_pipeline.py: sequential episode loop →asyncio.gatherwithSemaphore(10)batch_segmenter.py: sequential batch loop →asyncio.gatherwithSemaphore(10)Extraction quality (
src/memv/processing/prompts.py):KNOWLEDGE_CATEGORIES: scoped to "ABOUT THE USER"EXCLUSIONS: added general/topical knowledge, educational content, topic summariesSmoke test (3 questions)
Test plan
make benchmark-smokepasses (3 questions, fast config)uv run pytest)