[prompt-clustering] Copilot Agent Prompt Clustering Analysis — 904 Tasks, 8 Clusters #23582
Closed
Replies: 1 comment
-
|
This discussion has been marked as outdated by Copilot Agent Prompt Clustering Analysis. A newer discussion is available at Discussion #23775. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Summary
Analysis of 904 copilot agent task prompts from the
github/gh-awrepository (dataset window: January 21 – February 7, 2026). TF-IDF vectorization with cosine-similarity K-means (k=8) was used to cluster the prompts.Cluster Overview
Cluster Details
Cluster 1: MCP & CLI Tooling
207 tasks (22.9%) | Success: 57.0% | Avg commits: 3.9 | Avg files: 29.2
Top keywords: mcp, command, version, copilot, github, cli, server
Tasks related to MCP server upgrades (Sentry, gh-aw-mcpg, etc.), CLI command additions, and Copilot agent configuration. The largest cluster (22.9% of all tasks) but lowest success rate (57%), suggesting MCP/version-update tasks are frequently iterated or abandoned mid-stream.
Representative PRs:
Cluster 2: Code Quality & Testing
151 tasks (16.7%) | Success: 67.5% | Avg commits: 3.4 | Avg files: 13.0
Top keywords: code, test, files, quality, validation, schema, lines
Tasks fixing code validation issues, TypeScript type errors, schema mismatches, and test regressions. High volume (16.7%) with moderate success (67.5%) — well-understood bug-fix patterns that often require multiple iterations.
Representative PRs:
Cluster 3: Agentic Workflow Maintenance
114 tasks (12.6%) | Success: 81.6% | Avg commits: 4.4 | Avg files: 18.5
Top keywords: agentic, agentic workflow, agentic workflows, workflows, workflow, create, file
Tasks maintaining the agentic workflow infrastructure itself: updating issue templates, failure reports, compiling workflow YAML, and wiring CI hooks. Highest success rate (81.6%) — highly specific and well-scoped tasks.
Representative PRs:
@copilotto workflow sync issues when agent token availableCluster 4: Safe-Outputs & Project Infrastructure
102 tasks (11.3%) | Success: 76.5% | Avg commits: 4.7 | Avg files: 20.0
Top keywords: safe, project, safe outputs, outputs, safe output, output, create
Tasks configuring the safe-outputs MCP container (Dockerfile, git installation, node:lts images) and project-level infrastructure. Solid 76.5% success with the most commits per task (avg 4.7).
Representative PRs:
Cluster 5: Workflow Failure Investigation
100 tasks (11.1%) | Success: 78.0% | Avg commits: 3.0 | Avg files: 8.1
Top keywords: workflow, failure, agent, section, report, failed, debug
Tasks investigating and fixing specific agent run failures — debugging daily-cli-performance, ANSI escape sequences in YAML, analyzing reports. 78% success, low file churn (avg 8.1 files).
Representative PRs:
Cluster 6: Investigation & Debugging
94 tasks (10.4%) | Success: 72.3% | Avg commits: 3.3 | Avg files: 26.2
Top keywords: reference, why, investigate, debug, review, see, comment
Investigation-heavy tasks where the agent is given context from prior issues/PRs and asked to debug or review. Wide file churn (avg 26.2 files). 72.3% success rate.
Representative PRs:
Cluster 7: Campaign & Security Automation
87 tasks (9.6%) | Success: 56.3% | Avg commits: 4.3 | Avg files: 8.7
Top keywords: campaign, docs, security, alert, dependabot, project, prs
Tasks building and evolving the dependabot campaign system, security alert processing, and PR review automation. Second-lowest success rate (56.3%) — complex multi-component tasks with higher failure risk.
Representative PRs:
Cluster 8: CI Job Analysis & Go/TypeScript Fixes
49 tasks (5.4%) | Success: 75.5% | Avg commits: 3.4 | Avg files: 22.8
Top keywords: job, analyze workflow, job url, workflow, analyze, failing, logs
Smallest cluster (5.4%) — analyzing failing CI jobs by URL, fixing Go lint errors, and TypeScript type fixes. 75.5% success, highest additions per task (avg 1722) due to larger refactors.
Representative PRs:
Sample Data Table (100 most recent PRs)
@playwright/mcpversion is already updatedKey Findings & Recommendations
MCP/CLI tasks have the lowest success rate (57%) despite being the largest category (23%). Consider breaking large MCP upgrade tasks into smaller atomic steps and adding automated compile+test gates before merging.
Agentic Workflow Maintenance is the highest-success category (81.6%) — highly repetitive, well-defined tasks with explicit success criteria. Use this as a template for improving prompt quality in other categories.
Investigation tasks (Cluster 6) have high file churn (avg 26 files, 72.3% success) — providing more targeted context (specific file paths, error stacks, reproduction steps) in prompts would likely improve outcomes and reduce scope creep.
Campaign & Security Automation tasks have 56.3% success — multi-component changes are risky. Splitting into smaller focused PRs (schema changes, workflow changes separately) could improve merge rates.
Average 3–5 commits per task across all clusters — agents rarely solve problems in one shot. Adding structured acceptance criteria and expected outputs to prompts may reduce iteration counts.
References:
Beta Was this translation helpful? Give feedback.
All reactions