Run timestamp: 2026-03-06T00:50:38Z
This benchmark compares four memory compression strategies for AI conversation context:
| # | Strategy | Description |
|---|---|---|
| 1 | NoCompression | Raw conversation text β baseline |
| 2 | RandomDrop | Random token drop at 40% retention β LLMLingua-2 proxy |
| 3 | RuleCompressor | claw-compactor Layers 1-5 β deterministic rules, zero LLM |
| 4 | Engram | LLM Observer + Reflector β Layer 6 semantic compression |
Averages across all samples.
| Compressor | Ratioβ | Saved% | ROUGE-Lβ | IR-F1β | Latency(ms) | LLM Calls |
|---|---|---|---|---|---|---|
| Engram | 0.125 | 87.5% | 0.038 | 0.414 | 34533 | 2.0 |
| RandomDrop | 0.785 | 21.5% | 0.852 | 0.911 | 0 | 0.0 |
| RuleCompressor | 0.910 | 9.0% | 0.923 | 0.958 | 6 | 0.0 |
| NoCompression | 1.000 | 0.0% | 1.000 | 1.000 | 0 | 0.0 |
DevOps troubleshooting session β Docker / CI pipeline Original tokens: 4,404
| Compressor | Ratio | Saved% | ROUGE-L | IR-F1 | Latency(ms) | LLM Calls |
|---|---|---|---|---|---|---|
| Engram | 0.122 | 87.8% | 0.029 | 0.452 | 43494 | 2 |
| RandomDrop | 0.847 | 15.3% | 0.907 | 0.947 | 0 | 0 |
| RuleCompressor | 0.882 | 11.8% | 0.904 | 0.966 | 8 | 0 |
| NoCompression | 1.000 | 0.0% | 1.000 | 1.000 | 0 | 0 |
Quantitative trading strategy discussion session Original tokens: 3,460
| Compressor | Ratio | Saved% | ROUGE-L | IR-F1 | Latency(ms) | LLM Calls |
|---|---|---|---|---|---|---|
| Engram | 0.132 | 86.8% | 0.046 | 0.407 | 38832 | 2 |
| RandomDrop | 0.763 | 23.7% | 0.847 | 0.846 | 0 | 0 |
| RuleCompressor | 0.862 | 13.8% | 0.887 | 0.947 | 5 | 0 |
| NoCompression | 1.000 | 0.0% | 1.000 | 1.000 | 0 | 0 |
Short ML model training session Original tokens: 1,856
| Compressor | Ratio | Saved% | ROUGE-L | IR-F1 | Latency(ms) | LLM Calls |
|---|---|---|---|---|---|---|
| Engram | 0.155 | 84.5% | 0.055 | 0.384 | 18681 | 2 |
| RandomDrop | 0.784 | 21.6% | 0.836 | 0.889 | 0 | 0 |
| RuleCompressor | 0.940 | 6.0% | 0.941 | 0.947 | 3 | 0 |
| NoCompression | 1.000 | 0.0% | 1.000 | 1.000 | 0 | 0 |
Mixed topics long session β system architecture, DB tuning, security Original tokens: 4,597
| Compressor | Ratio | Saved% | ROUGE-L | IR-F1 | Latency(ms) | LLM Calls |
|---|---|---|---|---|---|---|
| Engram | 0.098 | 90.2% | 0.026 | 0.407 | 38377 | 2 |
| RandomDrop | 0.740 | 26.0% | 0.826 | 0.889 | 0 | 0 |
| RuleCompressor | 0.933 | 6.7% | 0.939 | 0.966 | 8 | 0 |
| NoCompression | 1.000 | 0.0% | 1.000 | 1.000 | 0 | 0 |
System administration and network configuration session Original tokens: 3,248
| Compressor | Ratio | Saved% | ROUGE-L | IR-F1 | Latency(ms) | LLM Calls |
|---|---|---|---|---|---|---|
| Engram | 0.118 | 88.2% | 0.035 | 0.420 | 33280 | 2 |
| RandomDrop | 0.793 | 20.7% | 0.843 | 0.983 | 0 | 0 |
| RuleCompressor | 0.932 | 6.8% | 0.944 | 0.966 | 6 | 0 |
| NoCompression | 1.000 | 0.0% | 1.000 | 1.000 | 0 | 0 |
| Metric | Definition | Better |
|---|---|---|
| Compression Ratio | compressed_tokens / original_tokens β lower means more compact |
β Lower |
| Saved% | (1 - ratio) Γ 100 β percentage of tokens eliminated |
β Higher |
| ROUGE-L | LCS-based recall/precision/F1 between compressed and original | β Higher |
| IR-F1 | Information Retention F1 β keyword overlap between original and compressed | β Higher |
| Latency | Wall-clock compression time in milliseconds | β Lower |
| LLM Calls | Number of LLM API calls required | β Lower |
- Best compression ratio: Engram (0.125, 87.5% savings)
- Best ROUGE-L (text fidelity): NoCompression (F1=1.000)
- Best IR-F1 (information retention): NoCompression (F1=1.000)
- Best latency (fastest): NoCompression (0ms avg)
Strategy Trade-offs:
NoCompression β Zero compression, perfect fidelity. Useful as ground truth only.
RandomDrop β High compression, but random loss degrades quality unpredictably.
Cannot target important information β acts as adversarial baseline.
RuleCompressor β Moderate compression via deterministic rules. Zero latency, zero LLM cost.
Safe and predictable, but limited by rule expressiveness.
Engram (LLM) β Highest semantic compression. Observer extracts key events;
Reflector distills to long-term context. Requires LLM calls but
achieves intent-aware compression that preserves critical information.
For production AI conversation memory compression:
- Short-term memory (< 5min old): Skip compression β use raw messages
- Medium-term (5min β 2hr): Apply RuleCompressor for 20-40% savings at zero cost
- Long-term (> 2hr): Apply Engram (Observer + Reflector) for 60-90% savings
- Never use RandomDrop in production β information loss is uncontrolled
- Token counts use CJK-aware heuristic (4 chars/token for ASCII, 1.5 for CJK)
- ROUGE-L implemented in pure Python using LCS dynamic programming
- IR-F1 uses top-30 keyword extraction with stopword filtering
- RandomDrop uses fixed seed (42) for reproducibility
- EngramCompressor uses LLM proxy at
http://localhost:8403, modelclaude-code/sonnet - All test data is synthetic / fully anonymized β no real user data