MemBenchmark is an early-stage research project on how to evaluate memory systems for modern AI agents.
This repository currently represents the first survey stage of the project. The reports in this repository are the canonical finalized survey outputs.
The current research focus is:
- modern agent-memory benchmark methods, tasks, and metrics
- open-source benchmark tooling and evaluation harnesses
- modern agent memory system designs and implementation patterns
The target agent class is modern acting agents, not only chat assistants. That includes:
- coding agents
- workflow/research agents
- web agents
- GUI agents
- tool-using autonomous systems
See modern_agent_memory_benchmarking_survey.md
This report covers:
- benchmark methods for modern acting-agent memory
- open-source benchmark tools and software
- gap analysis for what remains missing
See modern_agent_memory_systems_survey.md
This report covers:
- major memory architecture families
- representative memory systems and frameworks
- write / retrieve / update / forget behavior
- failure modes
- implications for future benchmark design
There is now meaningful progress in both:
- benchmark design for modern agent memory
- memory system design itself
But there still appears to be a real gap in:
- neutral, reusable benchmarking for modern acting agents
- especially coding / workflow / research agents
That is the current opening for MemBenchmark.
This repo is currently in the research and design phase.
Planned next step:
- move from survey to a MemBenchmark v0 design memo
- define target agent type, benchmark dimensions, unit under test, and minimal benchmark interface