EcoClaw-Bench is the benchmark and experiment workspace for EcoClaw, focused on token-efficient continual agents under long-history settings.
The current live method includes:
stable-prefixreductiontool-result persistencetask-state estimationdecoupled + fifo eviction
This repository currently does not claim a live compaction runtime path. Compaction is intentionally excluded from the method description and result claims below.
In continual single-session agents, history grows with every task:
- prompt size keeps increasing
- old completed tasks still occupy context
- cache reuse becomes less stable as history structure drifts
- token cost scales poorly over long sessions
Our goal is not just to delete history aggressively.
The goal is to reduce token cost while preserving enough structure for later tasks to remain solvable.
The current EcoClaw runtime relies on three active components:
This keeps the reusable prefix more stable across turns and improves upstream cache reuse.
It does not directly remove history, but it strongly affects:
- cache-hit stability
- input-token variance
- long-session cost behavior
This performs request-level local slimming on expensive prompt content such as:
- repeated reads
- oversized tool payloads
- HTML / exec outputs
- oversized persisted tool results
Reduction addresses local prompt bloat before it becomes long-term history cost.
EcoClaw maintains a task-aware canonical history with:
activecompletedevictable
Old cold tasks are removed from canonical history once they become evictable.
The most stable version we found is:
decoupled + fifo
meaning:
- the estimator only predicts task progression
- eviction timing is handled by a separate FIFO promotion rule
The method works because it reduces cost at multiple levels rather than relying on a single compression trick.
Many tokens are spent on content that is not truly useful for future reasoning:
- repeated file reads
- long tool outputs
- bulky HTML
- oversized message payloads
reduction trims these local costs before they dominate future prompts.
The baseline mostly behaves like one ever-growing dialogue stream.
EcoClaw instead keeps a task-aware canonical history:
- active task content stays
- recently completed task content may stay briefly
- older completed tasks become evictable
- evictable tasks are removed from canonical history
This produces structural savings instead of cosmetic text shortening.
One of the main findings from our experiments is that it is unstable to let the same small model decide both:
- whether a task is completed
- whether it should already be evicted
The decoupled + fifo design narrows the estimator’s job and makes eviction timing much easier to control.
We currently use two evaluation settings.
Directory:
Characteristics:
- each task runs independently
- no shared long history
- useful for measuring single capability changes
We mainly use this setting to study:
stable-prefixalonereductionalonestability + reduction
This answers:
- how much prompt optimization helps without continual-history pressure
Directory:
Characteristics:
- all tasks run in one continuous session
- history accumulates over time
- this is the setting that exposes the real history-management problem
This setting has two main branches:
-
reduction baseline line
-
eviction line
Within continual evaluation, we run both:
top-10full
where:
top-10is mainly for fast debugging and curve inspectionfullis used for final score/token comparisons
The most complete and reliable results currently come from:
- PinchBench
Another benchmark:
- ClawEval
has not yet been fully summarized in this README. So the claims below should be interpreted as:
- PinchBench: current main result line
- ClawEval: to be added later
Reference baseline:
- run
10154 - score:
81.8%(18.8 / 23.0) - total tokens:
2,140,641
We ran a batchturn ablation for continual full evaluation:
turnbatch=1->10167turnbatch=2->10168turnbatch=3->10169turnbatch=4->10170turnbatch=5->10171
The best current efficiency/quality tradeoff is:
- run
10169 - configuration:
decoupled + fifo + turnbatch=3 - score:
85.3%(19.6 / 23.0) - total tokens:
1,139,456
Compared with baseline 10154:
- token reduction:
1,001,185 - relative reduction: about
46.8% - score improvement:
81.8% -> 85.3%
This is the main result we currently stand behind.
We also observed:
- run
10170 turnbatch=4- score:
86.0% - total tokens:
1,990,529
So:
turnbatch=4is a useful accuracy-oriented reference- but it is not the best efficiency operating point
Relative to the baseline, EcoClaw reduces token cost in four concrete ways:
- old tasks no longer remain in canonical history indefinitely
- oversized tool outputs are slimmed or persisted before dominating future prompts
- stable-prefix handling improves cache reuse continuity
- decoupled eviction avoids the prompt churn and cache-locality damage seen in the earlier coupled variant
So the gains do not come from a single compression trick. They come from combining:
- cache stability
- local reduction
- task-aware eviction
Included in this README:
- stable-prefix
- reduction
- tool-result persistence
- task-state estimator
- decoupled FIFO eviction
Not included in this README:
- live compaction runtime
- compaction result claims
- future lifecycle-aware reduction redesigns
Key directories:
-
experiments/
benchmark harness, task definitions, and experiment scripts -
scripts/
commonly used launch scripts -
results/
structured benchmark outputs -
save/
archived runs and generated artifacts -
docs/
bench-side notes, bug reports, and cleanup records
For a quick understanding of the project:
- this README
- architecture notes under
EcoClaw/docs/architecture/ docs/experiments/estimator-eviction-decoupling.md- the main scripts under
scripts/
At this stage, the main conclusion is straightforward:
- naive continual history replay is too expensive
- pure coupled eviction is unstable
decoupled + fifois more controllablestable-prefix + reduction + task-aware evictioncan cut total tokens by nearly half on full continual runs while maintaining or improving benchmark score
That is the current core value proposition of EcoClaw.