You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Resource over-consumption (2 workflows): Smoke Multi PR, Schema Consistency Checker — too many turns for task scope
MCP timeout failures (1 workflow): Smoke Claude — well-understood root cause; blocked on infra fix
Model over-sizing (Schema Checker): Using frontier Claude model for a task flagged as suitable for a smaller model
Effectiveness Analysis
Task Completion Rates (7d)
Category
Rate
Notes
Claude utility workflows
3/3 = 100%
Ex-Smoke Claude: all succeeded
Copilot smoke tests
~60–70%
Factoring known P1 infra failures
Copilot utility workflows
~85%
Issue Monster, Release, etc. all solid
Smoke Claude
0/5 = 0%
Infra-blocked (P1)
Behavioral Patterns
Productive Patterns ✅
Lean agentic footprint: Most Copilot utility workflows (Auto-Triage, Issue Monster, Contribution Check) show 3–9 turns — healthy
Stable collaboration: Documentation Unbloat + CLI Version Checker generate complementary improvements without overlap
Deterministic pre-steps working: Several workflows benefit from frontmatter setup steps (see Smoke Copilot)
Problematic Patterns ⚠️
Smoke Multi PR execution drift: 0 turns (skipped) vs 89 turns (heavy) in the same week — highly unstable
Claude MCP timeout accumulation: Smoke Claude running 12+ minutes per attempt, each costing ~$1–$2 in tokens before failing
Partially reducible pattern (systemic): partially_reducible flagged on 2+ agentic runs this week (Smoke Multi PR 96%, Schema Checker 50%) — data-gathering tasks should be pre-steps
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Executive Summary
Performance Rankings
Top Performing Agents 🏆
Documentation Unbloat — Claude (Quality: 90/100, Effectiveness: 88/100)
CLI Version Checker — Claude (Quality: 85/100, Effectiveness: 84/100)
Issue Monster — Copilot (Quality: 84/100, Effectiveness: 87/100)
Release — Copilot (Quality: 83/100, Effectiveness: 80/100)
Auto-Triage Issues / Contribution Check — Copilot (Quality: 80/100, Effectiveness: 82/100)
Agents Needing Improvement 📉
Smoke Claude (Quality: 20/100, Effectiveness: 15/100) — P1 Ongoing
safe_outputsjob times outSmoke Multi PR — Copilot (Quality: 50/100, Effectiveness: 40/100) — New Watch⚠️
resource_heavy(high severity) andpartially_reducible(medium)Schema Consistency Checker — Claude (Quality: 65/100, Effectiveness: 70/100) — Watch
model_downgrade_availableflag across 4 previous successful runsGitHub Remote MCP Authentication Test — Copilot (Quality: N/A, Effectiveness: 30/100)
Recovered Agents ✅
Quality Analysis
Output Quality Distribution
Common Quality Issues
Effectiveness Analysis
Task Completion Rates (7d)
Behavioral Patterns
Productive Patterns ✅
Problematic Patterns⚠️
partially_reducibleflagged on 2+ agentic runs this week (Smoke Multi PR 96%, Schema Checker 50%) — data-gathering tasks should be pre-stepsEcosystem Coverage
Coverage Map (179 workflows)
Recommendations
High Priority
Fix Smoke Claude MCP timeout ([observability escalation] Smoke test workflows repeatedly exceed resource and control thresholds (Smoke Claude, Smoke Copilot) #23528, [aw] Smoke Claude failed #23067)
Investigate Smoke Multi PR run #604 spike (89 turns)
Medium Priority
Schema Consistency Checker: evaluate model downgrade
model_downgrade_availableon 4/5 recent runsgpt-5.4-minior equivalent smaller modelClose PR Triage Agent issue [aw] PR Triage Agent failed #23151 (manual action needed)
Low Priority
not_planned— monitor onlyTrends
Actions Taken This Run
References:
Beta Was this translation helpful? Give feedback.
All reactions