Agent Performance Report — 2026-04-02 #24019

2026-04-02T04:45:15Z

github-actions[bot]
bot Apr 2, 2026

Executive Summary

Agents analyzed: 179 workflows (119 Copilot, 41 Claude, 18 Codex, 1 Gemini)
Runs reviewed (7d): 32 agentic runs (8 Claude, 24 Copilot)
Total cost (7d): ~$5.64 (Claude: $5.64, Copilot: $0)
Quality score: 75/100 (↓1 from 76)
Effectiveness: 73/100 (→ stable)
Top performers: Documentation Unbloat, CLI Version Checker, Issue Monster, Release, Auto-Triage Issues
Needs attention: Smoke Claude (P1 ongoing), Smoke Multi PR (new resource spike)

Performance Rankings

Top Performing Agents 🏆

Documentation Unbloat — Claude (Quality: 90/100, Effectiveness: 88/100)
- Consistent success; 44 turns, $1.35, 13.4 min run — well-scoped
- Actionable outputs with clear code changes
- Example: §23846940395
CLI Version Checker — Claude (Quality: 85/100, Effectiveness: 84/100)
- Recovered from last week's cost spike (21 turns → 24 turns at $0.59; was $0.79)
- Produces focused, timely version-check outputs
- ✅ No longer a Watch item
Issue Monster — Copilot (Quality: 84/100, Effectiveness: 87/100)
- Reliably triages incoming issues (2 runs: 8t, 9t, 5–6 min each)
- Consistent performance week over week
Release — Copilot (Quality: 83/100, Effectiveness: 80/100)
- 15 turns, 26 min — disciplined and effective
- Clean commit messages and PR handling
Auto-Triage Issues / Contribution Check — Copilot (Quality: 80/100, Effectiveness: 82/100)
- Lean runs (3–9 turns, 4–5 min) with well-targeted outputs

Agents Needing Improvement 📉

Smoke Claude (Quality: 20/100, Effectiveness: 15/100) — P1 Ongoing
- 5/5 failures this week; MCP HTTP connection closes at 412s
- Agent completes successfully (~16/18 tests) but safe_outputs job times out
- Estimated waste: ~$2–3/run × 5 runs/week ≈ $10–15/week
- Issues: [observability escalation] Smoke test workflows repeatedly exceed resource and control thresholds (Smoke Claude, Smoke Copilot) #23528, [aw] Smoke Claude failed #23067 (open)
- No new issue created — existing tracking in place
Smoke Multi PR — Copilot (Quality: 50/100, Effectiveness: 40/100) — New Watch ⚠️
- Run #604 failed after 89 turns and 12.3 minutes (normally 2–4 turns)
- Flagged resource_heavy (high severity) and partially_reducible (medium)
- 96% of turns were pure data-gathering — candidate for deterministic pre-steps
- Did produce 2 PRs before failing (partial success)
- Recommendation: Move data-fetching to deterministic frontmatter steps
Schema Consistency Checker — Claude (Quality: 65/100, Effectiveness: 70/100) — Watch
- Run support for custom agent engine output files #206: 108 turns, $1.73, resource_heavy + partially_reducible
- Recurring model_downgrade_available flag across 4 previous successful runs
- Opportunity: switch to a smaller model to reduce cost ~40–60%
GitHub Remote MCP Authentication Test — Copilot (Quality: N/A, Effectiveness: 30/100)
- Failed in 3 turns (infra/token issue, not agent quality)
- Expected intermittent; monitor

Recovered Agents ✅

Agent Persona Explorer: Ran successfully (~10.7 min) — fully recovered from 0-turn failure on Apr 1
PR Triage Agent: 9 consecutive successes (per WHM data); issue [aw] PR Triage Agent failed #23151 can be closed

Quality Analysis

Output Quality Distribution

Band	Count	Notes
Excellent (80–100)	~5 agents	Documentation Unbloat, CLI Version Checker, Issue Monster, Release, Auto-Triage
Good (60–79)	~8 agents	Schema Consistency Checker, Agent Container Smoke Test, Smoke Copilot
Fair (40–59)	~2 agents	Smoke Multi PR, GitHub Remote MCP Auth Test
Poor (<40)	1 agent	Smoke Claude (infra-blocked)

Common Quality Issues

Resource over-consumption (2 workflows): Smoke Multi PR, Schema Consistency Checker — too many turns for task scope
MCP timeout failures (1 workflow): Smoke Claude — well-understood root cause; blocked on infra fix
Model over-sizing (Schema Checker): Using frontier Claude model for a task flagged as suitable for a smaller model

Effectiveness Analysis

Task Completion Rates (7d)

Category	Rate	Notes
Claude utility workflows	3/3 = 100%	Ex-Smoke Claude: all succeeded
Copilot smoke tests	~60–70%	Factoring known P1 infra failures
Copilot utility workflows	~85%	Issue Monster, Release, etc. all solid
Smoke Claude	0/5 = 0%	Infra-blocked (P1)

Behavioral Patterns

Productive Patterns ✅

Lean agentic footprint: Most Copilot utility workflows (Auto-Triage, Issue Monster, Contribution Check) show 3–9 turns — healthy
Stable collaboration: Documentation Unbloat + CLI Version Checker generate complementary improvements without overlap
Deterministic pre-steps working: Several workflows benefit from frontmatter setup steps (see Smoke Copilot)

Problematic Patterns ⚠️

Smoke Multi PR execution drift: 0 turns (skipped) vs 89 turns (heavy) in the same week — highly unstable
Claude MCP timeout accumulation: Smoke Claude running 12+ minutes per attempt, each costing ~$1–$2 in tokens before failing
Partially reducible pattern (systemic): partially_reducible flagged on 2+ agentic runs this week (Smoke Multi PR 96%, Schema Checker 50%) — data-gathering tasks should be pre-steps

Ecosystem Coverage

Coverage Map (179 workflows)

Area	Coverage	Agents
Code quality / review	✅ Good	Grumpy Code Reviewer, PR Nitpick Reviewer, Schema Consistency Checker
CI smoke testing	✅ Good	Smoke Copilot, Smoke Claude, Agent Container Smoke Test
Issue management	✅ Good	Issue Monster, Auto-Triage Issues, PR Triage Agent
Documentation	✅ Good	Documentation Unbloat
Release management	✅ Good	Release
Security	✅ Covered	Security Review Agent
Cross-repo operations	⚠️ P1 Bug	Smoke Create/Update Cross-Repo PR
Multi-PR orchestration	⚠️ Unstable	Smoke Multi PR

Recommendations

High Priority

Fix Smoke Claude MCP timeout ([observability escalation] Smoke test workflows repeatedly exceed resource and control thresholds (Smoke Claude, Smoke Copilot) #23528, [aw] Smoke Claude failed #23067)
- Root cause: HTTP connection closes at 412s; agent runs 12+ min before safe_outputs
- Fix: Increase MCP timeout or reduce agent runtime via tighter scope
- Impact: Save ~$10–15/week, restore 5 failing runs/week
Investigate Smoke Multi PR run #604 spike (89 turns)
- Execution drift from 2–4 turns to 89 turns is unusual
- Recommend: Add turn budget / timeout to prevent runaway execution
- Move data-gathering to deterministic pre-steps (96% reducible)
- Impact: Reduce cost, improve reliability

Medium Priority

Schema Consistency Checker: evaluate model downgrade
- Flagged model_downgrade_available on 4/5 recent runs
- Potential cost savings: 40–60% per run
- Recommend: Test with gpt-5.4-mini or equivalent smaller model
Close PR Triage Agent issue [aw] PR Triage Agent failed #23151 (manual action needed)
- Agent has 9 consecutive successes; issue is stale

Low Priority

Cross-repo smoke tests (Smoke Update/Create): P1 bug tracked in [aw] Smoke Update Cross-Repo PR failed #23193, [aw] Smoke Create Cross-Repo PR failed #23715 — no further action needed from APR
Smoke Codex / Smoke Gemini: Team has closed these as not_planned — monitor only

Trends

Metric	This Run	Last Run (Apr 1)	7d Trend
Quality score	75/100	76/100	↓1
Effectiveness	73/100	73/100	→
Ecosystem health	74/100	74/100	→ (per WHM)
Claude runs (7d)	8	~7	→
Copilot runs (7d)	24	~22	↑ slight
Claude cost (7d)	$5.64	~$6.10	↓ (CLI Checker normalized)

Actions Taken This Run

✅ Analyzed 32 agentic runs across Claude and Copilot engines
✅ Identified Smoke Multi PR as new Watch item (89-turn execution spike)
✅ Confirmed CLI Version Checker recovery (off Watch)
✅ Confirmed Agent Persona Explorer recovery
✅ Generated this performance report
ℹ️ No new critical issues created — all critical issues already tracked

References:

§23878311567 — Smoke Multi PR failure (89 turns)
§23883794563 — Schema Consistency Checker (108 turns, in-progress)
§23883959022 — This run (Agent Performance Analyzer)

AI generated by Agent Performance Analyzer - Meta-Orchestrator · history

expires on Apr 3, 2026, 4:45 AM UTC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent Performance Report — 2026-04-02 #24019

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Agent Performance Report — 2026-04-02 #24019

Uh oh!

github-actions[bot] bot Apr 2, 2026

Executive Summary

Performance Rankings

Top Performing Agents 🏆

Agents Needing Improvement 📉

Recovered Agents ✅

Quality Analysis

Output Quality Distribution

Common Quality Issues

Effectiveness Analysis

Task Completion Rates (7d)

Behavioral Patterns

Productive Patterns ✅

Problematic Patterns ⚠️

Ecosystem Coverage

Recommendations

High Priority

Medium Priority

Low Priority

Trends

Actions Taken This Run

Replies: 0 comments

github-actions[bot]
bot Apr 2, 2026