CLI-first benchmark toolkit for OpenClaw memory-layer plugins.
This project is designed to provide a reproducible, non-interactive evaluation workflow for memory plugins such as:
openclaw-memmemory-core(OpenClaw built-in memory)memory-lancedbmemu-engine-for-OpenClaw- future OpenClaw memory-layer plugins
Public SaaS memory providers already have benchmark visibility. The OpenClaw community needs a neutral, plugin-focused benchmark harness with:
- deterministic retrieval metrics
- optional end-to-end answer/judge metrics
- reproducible run manifests
- machine-readable outputs for CI and community comparison
- Track A: Retrieval benchmark (deterministic)
- Hit@K, Recall@K, Precision@K, MRR, nDCG
- latency (p50/p95)
- Track B: End-to-end benchmark (optional)
- retrieval → answer model → judge model
- accuracy + cost/latency metadata
- CLI and non-interactive by default
- Reproducibility first (versioned manifests, pinned config)
- Plugin neutrality (same protocol for all adapters)
- Transparent reporting (JSON outputs + docs)
uv sync
uv run openclaw-memory-bench doctor
# Generate manifest
uv run openclaw-memory-bench plan --provider openclaw-mem --benchmark locomo
# Run deterministic retrieval benchmark using example dataset
uv run openclaw-memory-bench run-retrieval \
--provider openclaw-mem \
--dataset examples/mini_retrieval.json \
--top-k 5
# Run against isolated memory-core profile (no main-system pollution)
uv run openclaw-memory-bench run-retrieval \
--provider memory-core \
--dataset examples/mini_retrieval.json \
--top-k 5 \
--memory-core-profile membench-memory-core
# Prepare canonical dataset (download + convert)
uv run openclaw-memory-bench prepare-dataset \
--benchmark longmemeval \
--limit 50 \
--out data/datasets/longmemeval-50.json
# Writes dataset + sidecar metadata:
# - data/datasets/longmemeval-50.json
# - data/datasets/longmemeval-50.json.meta.jsonThe run command writes a JSON report under artifacts/<run-id>/retrieval-report.json by default.
Reports now embed a reproducibility manifest (report.manifest) containing toolkit version, git commit, dataset hash/meta, provider config (sanitized), and runtime metadata.
For dataset schema, see docs/dataset-format.md.
See PRELIMINARY_RESULTS.md for currently available early comparison artifacts and caveats.
To avoid misleading comparisons, benchmark providers should be read as follows:
openclaw-memprovider in this repo = standalone sidecar engine run (openclaw-memCLI ingest/search on benchmark-managed sqlite files).- It is not automatically combined with
memory-coreormemory-lancedbin current leaderboard numbers.
- It is not automatically combined with
memory-coreprovider = OpenClaw built-in backend (openclaw memory index/search) under an isolated profile.memory-lancedbprovider = canonical memory tool path (memory_store/memory_recall/memory_forget) via Gateway invoke.
Current reports are primarily independent-provider comparisons. A full combination matrix (e.g., sidecar + backend pairings) is tracked as follow-up work.
uv run openclaw-memory-bench run-retrieval \
--provider memory-lancedb \
--dataset data/datasets/longmemeval-50.json \
--top-k 10 \
--session-key mainThis adapter uses
memory_store+memory_recall+memory_forgetvia Gateway invoke.
uv run openclaw-memory-bench run-retrieval \
--provider memu-engine \
--dataset data/datasets/longmemeval-50.json \
--top-k 10 \
--skip-ingest \
--gateway-url http://127.0.0.1:18789For
memu-engine, default ingest mode isnoop(pre-ingested search). Use--memu-ingest-mode memory_storeonly if your memory slot exposesmemory_store.
--limit Nrun first N questions--sample-size N --sample-seed Sdeterministic seeded subset sampling--fail-faststop on first question failure--db-root <dir>per-container sqlite storage root foropenclaw-mem--openclaw-mem-cmd ...override adapter command base when needed--memory-core-profile <name>isolated OpenClaw profile formemory-core--skip-ingestrun search-only against existing memory state--preindex-onceingest/index selected dataset once, then run per-question search--gateway-url/--gateway-tokenfor gateway-backed providers (memu-engine,memory-lancedb)--memu-ingest-mode noop|memory_storefor memu ingestion strategy--lancedb-recall-limit-factor Ncandidate pool multiplier before container filtering--memory-core-index-retries N+--memory-core-timeout-sec Nfor timeout resilience--memory-core-max-messages-per-session,--memory-core-max-message-chars,--memory-core-max-chars-per-sessionfor long-session ingest stabilization
scripts/run_two_plugin_baseline.sh \
--profile configs/run-profiles/two-plugin-baseline.jsonThis orchestrator emits both provider reports and merged compare artifacts under artifacts/full-benchmark/<run-group>/.
See docs/PHASE_A_EXECUTION.md for fallback behavior and fast pilot mode.
scripts/run_memory_core_vs_openclaw_mem.sh \
--dataset examples/dual_language_mini.json \
--top-k 5Artifacts are written under artifacts/sidecar-compare/<run-group>/.
This path is isolated from the main OpenClaw system via an independent memory-core profile (membench-memory-core) and per-run openclaw-mem sqlite roots.
scripts/run_memory_triplet_comprehensive.sh \
--benchmark longmemeval \
--dataset-limit 100 \
--question-limit 100 \
--top-k 10Artifacts are written under artifacts/comprehensive-triplet/<run-group>/.
The triplet orchestrator now supports provider-level watchdogs, progress logs, and fail-fast behavior so one stuck provider does not silently block the whole run:
scripts/run_memory_triplet_comprehensive.sh \
--benchmark longmemeval \
--dataset-limit 30 \
--question-limit 10 \
--top-k 5 \
--provider-timeout-sec 900 \
--progress-log artifacts/comprehensive-triplet/debug-progress.logUseful flags:
--provider-timeout-sec <sec>: hard wall-time timeout per provider run (default1500)--fail-fast-provider: stop remaining providers after first provider failure--progress-log <path>: timestamped progress log (default<run-group>/progress.log)
When a provider fails or times out, the run still emits compare-*.json / compare-*.md with structured failure status under provider_status, instead of aborting with only partial artifacts.
openclaw-mem: retrieval-track adapter implemented (MVP, CLI-driven)memory-core: retrieval-track adapter implemented (isolated--profilemode)memory-lancedb: gateway-backed adapter implemented for canonical memory tools (memory_store/memory_recall/memory_forget)memu-engine: gateway-backed adapter implemented formemory_search(ingest modes:noop/memory_store)- Canonical dataset conversion command available (
prepare-dataset)
docs/PROJECT_PLAN_AND_TODOS.mddocs/FULL_BENCHMARK_PLAN.md(two-plugin full report execution plan)docs/PHASE_A_EXECUTION.md(locked profile + one-shot baseline runner)
src/openclaw_memory_bench/cli.py— main CLIsrc/openclaw_memory_bench/protocol.py— provider adapter protocolsrc/openclaw_memory_bench/adapters/— plugin adaptersdocs/decisions/— architecture decisionsdocs/devlog/— implementation progress logs
MIT License. See LICENSE.
This toolkit is inspired by the benchmark design ideas from:
When specific code-level adaptations are introduced, they will be explicitly documented in ACKNOWLEDGEMENTS.md with file-level references.