Agent Performance Report — Week of 2026-03-31 #23642

2026-03-31T04:47:00Z

github-actions[bot]
bot Mar 31, 2026

Executive Summary

Workflows in ecosystem: 178 (copilot: 118, claude: 40, codex: 19, gemini: 1)
Quality score: 79/100 (↓3 from 82 last week)
Effectiveness score: 73/100 (↓2 from 75 last week)
Health score: 72/100 (→ stable, per WHM)
P1 issues tracked: 3 (Smoke Codex [aw] Smoke Codex failed #23431, Smoke Update/Create Cross-Repo PR [aw] Smoke Update Cross-Repo PR failed #23193/[WHM] Smoke Create Cross-Repo PR - Persistent Schedule Failures (Systemic Bug) #23447)
Top performers: Issue Monster, Schema Consistency Checker, GitHub Remote MCP Auth Test, Smoke Copilot
Needs immediate attention: Documentation Unbloat (100% failure), Contribution Check (poor control), Smoke Claude (persistent PR failures)

Performance Rankings

Top Performing Agents 🏆

Schema Consistency Checker — lean, read-only, efficient (37s, no errors). Model-appropriate task decomposition.
Issue Monster — consistent schedule success, fast execution (~26s), reliable safe outputs.
GitHub Remote MCP Auth Test — success with 467k tokens / 18 turns; validates MCP infra effectively.
Smoke Copilot — success with 12 safe items, broad PR coverage.
CLI Version Checker — produced actionable issue [ca] Update CLI/MCP versions: Playwright MCP 0.0.69, MCP Gateway v0.2.10 #23637; accurate version tracking across 4 tools.

Agents Needing Improvement 📉

Documentation Unbloat (Effectiveness: ~0% this window)
- 100% failure rate (1 run, 1 failure, 1.7m — fast/early failure)
- No output produced
- Needs investigation: whether the failure is structural or transient
- Action: New issue created ([agent-perf] Documentation Unbloat: 100% failure rate — needs investigation #23640)
Contribution Check (Quality: 65/100)
- Latest run concluded failure despite producing 3 safe outputs (including report [Contribution Check Report] Contribution Check — 2026-03-31 #23623)
- 44 turns, 1.36M tokens — flagged resource_heavy_for_domain, poor_agentic_control, partially_reducible
- At 44 turns this is the heaviest copilot run in the current window
- Output quality is high (clear, well-structured report), but the runaway turn count suggests prompt or tool-use control issues
Smoke Claude (Effectiveness: ~50%)
- PR-triggered run #23779290925 failed: 39 turns, 1.51M tokens, 16 write actions, write_heavy
- Schedule run succeeded (32 turns, 988k tokens) — PR vs schedule behavioral divergence
- Issue [aw] Smoke Claude failed #23067 (protected files block) and [observability escalation] Smoke test workflows repeatedly exceed resource and control thresholds (Smoke Claude, Smoke Copilot) #23528 (observability escalation) already tracking
- Pattern persists: resource_heavy + partially_reducible across every recent run
Release Workflow (Reliability: ~33% success in 7-day window)
- 2 of 3 runs failed as early/infra failures before agent started
- Successful run (#23779110984) was heavy: 20.1m, 863k tokens, 15 turns, partially_reducible
- Early failures may be external (env/trigger) rather than agent quality

Inactive or Stale Concerns

Smoke Codex ([aw] Smoke Codex failed #23431): ALL run types failing since 2026-03-28 — OpenAI API access restriction. P1.
Smoke Update Cross-Repo PR ([aw] Smoke Update Cross-Repo PR failed #23193) and Smoke Create Cross-Repo PR ([WHM] Smoke Create Cross-Repo PR - Persistent Schedule Failures (Systemic Bug) #23447): schedule failures from push_repo_memory → post-setup-scripts bug. Systemic root cause identified, fix pending.

Quality Analysis

Output Quality Distribution

Tier	Count	Workflows
Excellent (80–100)	~6	Issue Monster, Schema Consistency Checker, GitHub Remote MCP Auth Test, Smoke Copilot, CLI Version Checker, Agent Persona Explorer
Good (60–79)	~3	Contribution Check (output good, execution poor), Release (when it completes), Smoke Claude (schedule)
Fair (40–59)	~1	Smoke Claude (PR — write_heavy, uncontrolled)
Poor (<40)	~1	Documentation Unbloat (non-functional this period)

Common Quality Issues Observed

Poor agentic control — Contribution Check (44 turns), Smoke Claude (39 turns PR): agents not self-limiting when task is clear
Partially reducible patterns — CLI Version Checker, Release, Smoke Claude: ~70–90% of turns are data-gathering that could move to deterministic pre-steps
Schedule vs. PR behavioral divergence — Smoke Claude behaves differently on PR triggers (more exploratory, write-heavy) vs. scheduled runs. Context injection should normalize this.

Effectiveness Analysis

Resource Efficiency — Flagged Runs

Workflow	Engine	Turns	Tokens	Profile	Assessment
Contribution Check	copilot	44	1,360k	heavy	resource_heavy + poor_control + partially_reducible
Smoke Claude (PR)	claude	39	1,507k	heavy	resource_heavy + partially_reducible
Smoke Claude (schedule)	claude	32	988k	heavy	resource_heavy + partially_reducible
Release	copilot	15	863k	heavy	resource_heavy + partially_reducible
CLI Version Checker	claude	23	1,207k	heavy	resource_heavy + partially_reducible
GitHub Remote MCP Auth Test	copilot	18	467k	heavy	resource_heavy + partially_reducible

Ecosystem-wide: 6 out of ~10 agentic runs in the last 7 days flagged as resource-heavy.

Cost & Efficiency Insights

Smoke Claude schedule run: ~$0.83 per run — expensive for a smoke test. This fire-and-repeat daily pattern compounds rapidly.
CLI Version Checker: ~$0.77 for checking 4 tool versions — most data-gathering is deterministic and could be pre-steps.
Contribution Check: 44 turns with poor_agentic_control is a reliability risk beyond just cost.
Agent Persona Explorer: flagged model_downgrade_available — could use a smaller model for equivalent output quality.

Behavioral Patterns

Productive Patterns ✅

CLI Version Checker: external-facing research → precise, actionable issue output (PR [ca] Update CLI/MCP versions: Playwright MCP 0.0.69, MCP Gateway v0.2.10 #23637). Well-structured.
Schema Consistency Checker: lean + read-only — exemplary posture for a validation workflow.
Contribution Check: produces high-quality structured reports despite the execution quality issues.

Problematic Patterns ⚠️

Resource-heavy smoke tests: Smoke Claude and Smoke Copilot (already flagged in [observability escalation] Smoke test workflows repeatedly exceed resource and control thresholds (Smoke Claude, Smoke Copilot) #23528). Smoke tests should be lean and directed, not exploratory and write-heavy.
Partially reducible data-gathering: 5+ workflows spending majority of turns on deterministic lookups (version checks, repo scans, PR listings) that could be pre-steps.
PR vs. schedule divergence: Smoke Claude behaves significantly differently depending on trigger event. Prompt conditioning needs to normalize this.
Documentation Unbloat silent failure: Fails fast without actionable error output — needs better failure signaling.

Recommendations

High Priority

Diagnose Documentation Unbloat failures — 100% failure rate, no output, fast fail. Check if this is a dependency issue, auth issue, or structural workflow bug. Issue: [agent-perf] Documentation Unbloat: 100% failure rate — needs investigation #23640.
Add turn limit / self-stopping criteria to Contribution Check — 44 turns with poor_agentic_control is unsustainable. Add explicit instructions like "stop after reviewing X PRs" or implement a turn budget. Expected improvement: reduce turns to <20, eliminate failure conclusion.
Continue resolving P1 systemic issues:
- Smoke Codex API access ([aw] Smoke Codex failed #23431) — coordinate with OpenAI/infra
- push_repo_memory → post-setup-scripts bug ([aw] Smoke Update Cross-Repo PR failed #23193, [WHM] Smoke Create Cross-Repo PR - Persistent Schedule Failures (Systemic Bug) #23447) — add git checkout HEAD restore step

Medium Priority

Move data-gathering to deterministic pre-steps for CLI Version Checker, GitHub Remote MCP Auth Test, Release — reduce agent turns by ~70%, cut costs significantly. Reference: Deterministic & Agentic Patterns guide.
Model downgrade for Agent Persona Explorer — flagged model_downgrade_available. Switch to a smaller model to reduce cost with equivalent output.
Normalize Smoke Claude PR vs. schedule behavior — add explicit event-context prompting to prevent write_heavy posture on PR triggers.

Low Priority

Establish turn budget guardrails for all copilot-engine workflows: recommended max 20 turns for smoke/check patterns, 30 turns for heavier analysis workflows.
Review Release workflow pre-flight stability — 2/3 recent runs failed before agent started. May be trigger or env issue rather than agent quality.

Coverage Analysis

Well-Covered Areas

Smoke / connectivity testing (copilot, claude, codex, gemini, MCP auth)
PR contribution review and triage
Version/dependency tracking (CLI Version Checker, Lockfile Stats)
Schema and syntax validation
Documentation maintenance

Coverage Gaps / Opportunities

Documentation Unbloat non-functional — documentation quality maintenance has a gap while this is broken
Codex smoke testing fully down (API restriction) — no recent signal from codex engine health
Agent efficiency optimization — no dedicated workflow monitoring resource usage trends and recommending prompt improvements

Trends

Metric	This Week	Last Week	Δ
Quality score	79/100	82/100	↓3
Effectiveness score	73/100	75/100	↓2
Health score	72/100	72/100	→
Ecosystem size	178 workflows	177 workflows	+1
P1 issues	3	3	→

Score decline this week driven by: Documentation Unbloat 0% success, Contribution Check failure with poor control, Smoke Claude PR failure.

Actions Taken This Run

Created improvement issue for Documentation Unbloat recurring failures
Identified Contribution Check turn-count regression (44 turns, poor_agentic_control)
Noted model_downgrade_available opportunity for Agent Persona Explorer
Confirmed issue [observability escalation] Smoke test workflows repeatedly exceed resource and control thresholds (Smoke Claude, Smoke Copilot) #23528 (Observability Kit) already tracking Smoke Claude/Copilot resource issues
Updated repo memory with current assessment

References

§23775795700 — Contribution Check (failure, 44 turns)
§23779290925 — Smoke Claude PR (failure)
§23780267414 — CLI Version Checker (success, resource-heavy)

Analysis period: 2026-03-24 to 2026-03-31 | Next report: ~2026-04-07

AI generated by Agent Performance Analyzer - Meta-Orchestrator · history

expires on Apr 1, 2026, 4:47 AM UTC

2026-04-01T05:27:46Z

github-actions[bot]
bot Apr 1, 2026
Author

This discussion was automatically closed because it expired on 2026-04-01T04:47:00.222Z.

Closed by Workflow

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent Performance Report — Week of 2026-03-31 #23642

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Agent Performance Report — Week of 2026-03-31 #23642

Uh oh!

github-actions[bot] bot Mar 31, 2026

Executive Summary

Performance Rankings

Top Performing Agents 🏆

Agents Needing Improvement 📉

Inactive or Stale Concerns

Quality Analysis

Effectiveness Analysis

Behavioral Patterns

Productive Patterns ✅

Problematic Patterns ⚠️

Recommendations

High Priority

Medium Priority

Low Priority

Coverage Analysis

Trends

Actions Taken This Run

References

Replies: 1 comment

Uh oh!

github-actions[bot] bot Apr 1, 2026 Author

github-actions[bot]
bot Mar 31, 2026

github-actions[bot]
bot Apr 1, 2026
Author