LoCoMo Benchmark — MemoryLake

Evaluating MemoryLake on the LoCoMo long-term conversational memory benchmark.

What is LoCoMo?

LoCoMo (Long-term Conversational Memory) is a rigorous benchmark created by SNAP Research to evaluate how well AI systems remember and reason over very long conversations.

Each conversation in LoCoMo spans:

~300 turns across up to 35 sessions
~9,000 tokens on average
Long time horizons where facts evolve across sessions

Question Types

LoCoMo tests four distinct reasoning abilities:

Type	Description
Single-hop	Direct factual recall — "When did Alice visit Paris?"
Multi-hop	Chained reasoning across multiple facts — requires connecting several pieces of information
Temporal	Time-aware reasoning — understanding when things happened, ordering events, duration
Open-domain	Broad world knowledge combined with conversation context

These four categories reflect the real-world demands of a personal memory assistant: not just storing facts, but reasoning about them over time.

What is MemoryLake?

MemoryLake is an AI memory service that gives LLM applications long-term, persistent memory. It stores, retrieves, and reasons over historical context so that AI assistants can remember users across sessions.

MemoryLake Results

Tested across 10 conversation records from the LoCoMo dataset. Scores are F1 (%).

      record  single_hop   multi_hop    temporal open_domain     overall
------------------------------------------------------------------------
          R1      100.00       84.38      100.00      100.00       96.71
          R2       95.45      100.00      100.00         N/A       97.53
          R3       96.51       96.77       96.30       87.50       96.05
          R4       97.30       94.59       92.50       72.73       94.47
          R5       96.26       83.87       80.77       85.71       91.01
          R6       96.77       86.67       87.50       42.86       89.43
          R7       98.80      100.00       85.29       76.92       94.00
          R8       95.76       85.71       97.62      100.00       95.29
          R9       94.52       97.30       84.85       92.31       92.95
         R10       96.55       93.75       84.38      100.00       93.67
------------------------------------------------------------------------
         avg       96.79       91.84       91.28       85.42       94.03

Comparison with Other Memory Systems

MemoryLake ranks #1 overall on the LoCoMo benchmark among all evaluated systems.

System	Single-hop	Multi-hop	Temporal	Open-domain	Overall
MemoryLake 🥇	96.79%	91.84%	91.28%	85.42%	94.03%
EverMemOS	96.08%	91.13%	89.72%	70.83%	92.32%
Full-context	94.93%	90.43%	87.95%	71.88%	91.21%
Zep	90.84%	81.91%	77.26%	75.00%	85.22%
MemOS	85.37%	79.43%	75.08%	64.58%	80.76%
M-flow	N/A	N/A	N/A	N/A	76.50%
MemU	74.91%	72.34%	43.61%	54.17%	66.67%
Mem0	68.97%	61.70%	58.26%	50.00%	64.20%

Key highlights:

MemoryLake leads on every single question category
MemoryLake's temporal reasoning score (91.28%) is 1.56 pp above EverMemOS and 3.33 pp above the full-context baseline — a category where most systems struggle
MemoryLake's open-domain score (85.42%) is the highest by a wide margin (+10+ pp over the next best)
MemoryLake outperforms even the full-context (no compression) baseline while using far less tokens

Competitor data source: EverMind-AI/EverMemOS evaluation; M-flow benchmarks. All competitor systems used GPT-4.1-mini as the answer LLM to ensure fair comparison. M-flow reported LLM-Judge overall score only (76.5%); per-category breakdown not disclosed.

Visualization

Charts generated by plot_results.py.

Overall Score

Score by Question Type

Category Breakdown (Top 4 Systems)

Score Heatmap

MemoryLake — Per-Record Breakdown

Data Sources

Source	Link
LoCoMo benchmark	snap-research.github.io/locomo
Competitor results	EverMind-AI/EverMemOS/evaluation
M-flow results	FlowElement-ai/mflow-benchmarks
MemoryLake product	memorylake.ai

Raw per-question evaluation results are available in locomo_result_*.json files in this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
charts		charts
README.md		README.md
locomo_result_1.json		locomo_result_1.json
locomo_result_10.json		locomo_result_10.json
locomo_result_2.json		locomo_result_2.json
locomo_result_3.json		locomo_result_3.json
locomo_result_4.json		locomo_result_4.json
locomo_result_5.json		locomo_result_5.json
locomo_result_6.json		locomo_result_6.json
locomo_result_7.json		locomo_result_7.json
locomo_result_8.json		locomo_result_8.json
locomo_result_9.json		locomo_result_9.json
plot_results.py		plot_results.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LoCoMo Benchmark — MemoryLake

What is LoCoMo?

Question Types

What is MemoryLake?

MemoryLake Results

Comparison with Other Memory Systems

Visualization

Overall Score

Score by Question Type

Category Breakdown (Top 4 Systems)

Score Heatmap

MemoryLake — Per-Record Breakdown

Data Sources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Languages

Folders and files

Latest commit

History

Repository files navigation

LoCoMo Benchmark — MemoryLake

What is LoCoMo?

Question Types

What is MemoryLake?

MemoryLake Results

Comparison with Other Memory Systems

Visualization

Overall Score

Score by Question Type

Category Breakdown (Top 4 Systems)

Score Heatmap

MemoryLake — Per-Record Breakdown

Data Sources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Languages

Packages