- Repository:
ExperienceEngine - Primary hosts exercised in this phase:
Codexfor day-to-day real product useOpenClawfor strict baseline and scenario evaluation
- Date range:
- primary evidence from
2026-03-17to2026-03-19
- primary evidence from
- Primary task families:
build_debugtest_debug- repeated repository verification work
Why ExperienceEngine was relevant here:
- the repository repeatedly runs the same verification tasks during active development
- the same debugging and validation patterns appear across multiple sessions
- this makes it a good self-hosted environment for testing whether experience intervention produces net value instead of extra prompt noise
At the start of this pass:
- ExperienceEngine was already wired into
Codex,Claude Code, andOpenClaw - the core learning loop was functioning, but product work was still focused on making the outputs more reusable:
- Experience Pack
- runtime delivery path
- deploy/status visibility
- repeated validation tasks in the repository created a realistic stream of similar build/test work
This made the repository a good fit for testing two things at once:
- whether interventions stayed useful during active development
- whether the resulting experience could be promoted into reusable assets
This pass used ExperienceEngine in the repository itself rather than in a synthetic demo repo.
Main surfaces used:
ee doctor codexee inspect --lastee helped- OpenClaw baseline and high-confidence scenario evaluation
Evidence-bearing artifacts were generated through the evaluation pipeline:
- benchmark report
- evaluation bundle
- case study
- evidence package
No published Pack was active in the current Codex scope at the time of the latest doctor snapshot:
Enabled packs: 0Published packs: 0Compiled targets: 0
That means the observed behavior in this case study primarily reflects the runtime learning loop itself, not a Pack-controlled deployment.
From ee doctor codex on 2026-03-19:
Distillation mode: llmDistillation source: explicit_providerEvaluation mode: liveHoldout rate: 0.2Raw task records: 132Task runs: 25Formal experience nodes: 18
This indicates that the repository is no longer in a cold-start state. ExperienceEngine is operating with durable experience already present, and the LLM distillation path depends on an explicitly configured provider API.
From ee inspect --last:
- task family:
build_debug - intervention:
inject - automatic feedback:
helped - automatic feedback reason:
success_outcome - injected nodes:
3 - scorecard risk:
medium - recommendation: use hints selectively and confirm with focused verification
Why it matched:
- exact task-family match
- nodes already active and above the evidence threshold
From:
Observed outcome:
Verdict: healthyDelivery rate: 1Helpful rate: 1Harmful rate: 0Net helpful rate: 1Suggested mode: live
Mode comparison:
live: decisions=4 delivered=4 suppressed=0 helpful=4 harmed=0 net=1 verdict=healthy
This is the clearest current evidence that the repository can sustain repeated intervention without degrading into noisy retrieval.
Three things were clearly valuable in this pass:
- Repeated build/test verification tasks produced reusable strategy nodes instead of only one-off session context.
ee inspect --lastmade intervention reasons and node provenance visible enough to judge whether the guidance was legitimate.- The OpenClaw scenario outputs gave a stable benchmark surface that could be compared over time instead of relying on subjective impressions.
In practice, the useful pattern was not “more memory.” It was:
- detect recurring verification work
- inject a small amount of already-proven guidance
- confirm the result quickly
- keep or retire the experience based on evidence
The main friction observed in this case study was not runtime correctness. It was product-shape friction:
- most durable evidence still lives in generated artifacts and CLI output rather than a dedicated review UI
- the current scope relied on runtime learning and intervention rather than any separate static asset flow
This means the runtime loop is ahead of the asset adoption loop.
Current decision for this repository:
- keep
livemode - continue using
Codexas the primary real product host - continue collecting explicit feedback through
ee inspect --lastandee helped/ee harmed - start promoting stable node clusters into Packs only when the task family is clearly recurring
- treat OpenClaw scenario artifacts as the stricter benchmark signal for whether intervention quality remains healthy
Primary artifact set used for this case study: