You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Error Clusters Identified: 4 (1 new critical: API rate limit burst)
The dominant finding today is a rate limit burst at 12:13 UTC caused by ~30 workflows completing concurrently — all triggered by the same daily schedule at 12:00 UTC — exhausting the GitHub App installation rate limit and causing 7 of today's 10 failures in a 41-second window.
Time Window: All failures within 12:13:13–12:13:54 UTC (41 seconds)
Sample Error:
##[error]Failed to add comment: API rate limit exceeded for installation.
Request ID: 1C80:176867:2287A2E:8AC3D6F:69CE5D58
Timestamp: 2026-04-02 12:13:13 UTC
```
**Root Cause**: All three workflows (Workflow Health Manager, Smoke Codex, Smoke Claude) were triggered by the same daily schedule at 12:00 UTC. They all completed their agent phases at ~12:13 UTC and simultaneously issued GitHub API calls in their safe outputs jobs, saturating the GitHub App installation rate limit.
**Impact**:
- Smoke Codex: 0/2 safe outputs succeeded — no test result recorded
- Smoke Claude: 6/11 failed — key outputs lost
- Workflow Health Manager: 1/3 failed — add_comment to dashboard issue #23881 lost; synthetic body update also failed
**Affected Runs**: [§23899445141](https://github.com/github/gh-aw/actions/runs/23899445141), [§23899414677](https://github.com/github/gh-aw/actions/runs/23899414677), [§23899414690](https://github.com/github/gh-aw/actions/runs/23899414690)
---
#### Cluster 2: push_to_pull_request_branch — Files Outside Allowed List 🟡 RECURRING (Day 2)
- **Count**: 1 failure (+ 2 cascading cancellations)
- **Affected Workflow**: Smoke Claude ([§23899414690](https://github.com/github/gh-aw/actions/runs/23899414690))
- **Affected Job Types**: `push_to_pull_request_branch`
**Sample Error**:
```
##[error]Cannot push to pull request branch: patch modifies files outside
the allowed-files list (tmp-smoke-test-pr-push-23899414690.txt). Add the
files to the allowed-files configuration field or remove them from the patch.
```
**Root Cause**: Smoke Claude agent creates run-specific test files (varying names: `.smoke`, `.json`, `.txt` suffixes) that are not in the `allowed_files` whitelist. Config only permits `.github/smoke-claude-push-test.md`. Pattern observed for 3rd time (2× on 2026-04-01, 1× today).
**Cascading Effect**: When `push_to_pull_request_branch` fails, all remaining safe outputs are cancelled.
---
#### Cluster 3: dispatch_workflow — Branch Not Found 🟡 NEW
- **Count**: 1 failure
- **Affected Workflow**: Smoke Copilot ([§23899414681](https://github.com/github/gh-aw/actions/runs/23899414681))
**Sample Error**:
```
##[error]Failed to dispatch workflow "haiku-printer":
No ref found for: refs/heads/copilot/review-job-steps-consolidation
Root Cause: Agent dispatched a workflow targeting a branch that no longer exists (was likely merged/deleted). 10 of 11 safe outputs in this run succeeded — isolated failure.
Cluster 4: Minor Warnings (Non-Fatal) 🟢
Three warnings that were handled gracefully without hard failures:
add_labels — Label not found (run §23901267724): constraint-solving and problem-of-the-day labels do not exist in the repository. Safe output continued.
Cannot reply to review comment (run §23901289775): Not in PR context; PR review submission silently skipped.
Unresolvable PR review comment line (run §23899414681): Retried successfully as body-only review.
Root Cause Analysis
API Rate Limit (Primary Concern)
The GitHub App installation rate limit is shared across all workflow runs in the repository. When a large number of workflows are triggered by the same cron schedule (12:00 UTC daily), they all complete their agent phases within a similar timeframe (~10–15 minutes) and then flood the GitHub API simultaneously. Today's burst hit at exactly 12:13 UTC across at least 30 concurrent runs.
The safe outputs handler does implement 1-retry logic for update_pull_request, but not for add_comment, create_issue, or create_pull_request_review_comment.
Smoke Claude Push Configuration (Structural Issue)
The Smoke Claude workflow intends to test push_to_pull_request_branch, but the agent consistently generates run-specific filenames rather than the expected .github/smoke-claude-push-test.md. This is a 3-occurrence recurring pattern spanning 2 days and represents a test configuration mismatch, not an intermittent error.
Recommendations
Critical Issues (Immediate Action Required)
Add retry logic for API rate limit errors in safe output handler
Priority: High
Root Cause: add_comment, create_issue, create_pull_request_review_comment have no retry on rate limit
Recommended Action: Add exponential backoff with 2–3 retries for 429/rate-limit errors. update_pull_request already has 1 retry — extend pattern to all handlers.
Fix Smoke Claude push_to_pull_request_branch — allowed files mismatch
Priority: High (recurring 3× in 2 days)
Root Cause: Agent creates run-specific filenames; allowed_files only permits .github/smoke-claude-push-test.md
Recommended Action (Option A): Update Smoke Claude agent prompt to explicitly instruct it to write only to .github/smoke-claude-push-test.md for push tests.
Recommended Action (Option B): Update allowed_files configuration to accept a pattern like tmp-smoke-test-pr-push-*.txt or a more permissive list.
Medium Priority Issues
Schedule staggering to reduce concurrent API burst
Priority: Medium
Current State: Multiple workflows share the same daily schedule (12:00 UTC), causing concurrent safe output API calls
Recommended Action: Stagger non-critical workflow schedules by 5–30 minutes to spread API load. Or add per-workflow jitter in the scheduler.
Fix Smoke Copilot dispatch_workflow — use stable branch
Priority: Medium
Root Cause: dispatch_workflow targets a copilot-generated branch that gets deleted after merge
Recommended Action: Smoke Copilot should dispatch workflows against a long-lived branch (e.g., main) or the current PR head branch, not an ephemeral copilot branch.
Low Priority Issues
Create missing labels for Constraint Solving workflow
constraint-solving and problem-of-the-day labels don't exist → create them or remove from agent instructions
Work Item Plans
Work Item 1: Add Rate-Limit Retry Logic to Safe Output Handler
Type: Bug Fix
Priority: High
Description: When GitHub App installation rate limit is hit during a safe output operation, the handler should retry with exponential backoff instead of immediately failing. Today 7 safe output messages were lost due to a single 41-second rate limit burst.
Retry uses exponential backoff (e.g., 5s, 15s) with max 3 attempts
Rate-limit retries are logged clearly (warning level)
Smoke tests pass with no new failures from rate limiting
Technical Approach: Extend existing retry pattern from update_pull_request to all handlers that make POST/PUT API calls
Estimated Effort: Small–Medium
Dependencies: None
Work Item 2: Fix Smoke Claude Push Test Configuration
Type: Bug Fix / Config
Priority: High
Description: Smoke Claude consistently generates run-specific filenames for its push test, but allowed_files only permits .github/smoke-claude-push-test.md. This has failed 3 times in 2 days and cascades into cancellations of subsequent safe outputs.
Acceptance Criteria:
Smoke Claude push test runs succeed without "files outside allowed-files list" error
No cascading cancellations from push_to_pull_request_branch failures
Fix is stable across 5 consecutive runs
Technical Approach: Update Smoke Claude agent prompt to explicitly use the whitelisted filename, or update allowed_files to accept tmp-smoke-test-pr-push-*.txt
API rate limit burst (×7), files-outside-allowed-list (×1), branch-not-found (×1)
Trends
Error rate trend: Spike today (10 failures vs. prior 4-day average of ~3). Primary driver is the new rate-limit burst pattern — a systemic concern for high-concurrency schedule days.
Recurring issue (most concerning): push_to_pull_request_branch files-outside-allowed-list — now 3 occurrences in 2 days with no fix yet.
Resolved patterns (positive): invalid_event_context (last seen Mar 30), add_labels_no_issue_number (last seen Mar 30), missing remote branch (last seen Mar 31) — all appear resolved.
Overall Metrics
Overall Safe Output Message Success Rate: 80.8%
Most Reliable Job Type: submit_pull_request_review, create_discussion, post_slack_message (0 failures)
Most Problematic Job Type: create_issue (Smoke Codex: 0% success rate today), create_pull_request_review_comment (affected by rate limit)
Primary Failure Mode: API rate limiting (70% of today's failures)
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Executive Summary
The dominant finding today is a rate limit burst at 12:13 UTC caused by ~30 workflows completing concurrently — all triggered by the same daily schedule at 12:00 UTC — exhausting the GitHub App installation rate limit and causing 7 of today's 10 failures in a 41-second window.
Safe Output Job Statistics
add_commentcreate_issuecreate_pull_request_review_commentupdate_pull_requestpush_to_pull_request_branchdispatch_workflowsubmit_pull_request_reviewcreate_discussionpost_slack_messageadd_labels,set_issue_type,add_reviewer,remove_labelsError Clusters
Cluster 1: API Rate Limit Exceeded — Concurrent Burst 🔴 NEW / HIGH PRIORITY
add_comment,create_issue,create_pull_request_review_comment,update_pull_requestSample Error:
Root Cause: Agent dispatched a workflow targeting a branch that no longer exists (was likely merged/deleted). 10 of 11 safe outputs in this run succeeded — isolated failure.
Cluster 4: Minor Warnings (Non-Fatal) 🟢
Three warnings that were handled gracefully without hard failures:
constraint-solvingandproblem-of-the-daylabels do not exist in the repository. Safe output continued.Root Cause Analysis
API Rate Limit (Primary Concern)
The GitHub App installation rate limit is shared across all workflow runs in the repository. When a large number of workflows are triggered by the same cron schedule (12:00 UTC daily), they all complete their agent phases within a similar timeframe (~10–15 minutes) and then flood the GitHub API simultaneously. Today's burst hit at exactly 12:13 UTC across at least 30 concurrent runs.
The safe outputs handler does implement 1-retry logic for
update_pull_request, but not foradd_comment,create_issue, orcreate_pull_request_review_comment.Smoke Claude Push Configuration (Structural Issue)
The Smoke Claude workflow intends to test
push_to_pull_request_branch, but the agent consistently generates run-specific filenames rather than the expected.github/smoke-claude-push-test.md. This is a 3-occurrence recurring pattern spanning 2 days and represents a test configuration mismatch, not an intermittent error.Recommendations
Critical Issues (Immediate Action Required)
Add retry logic for API rate limit errors in safe output handler
add_comment,create_issue,create_pull_request_review_commenthave no retry on rate limitupdate_pull_requestalready has 1 retry — extend pattern to all handlers.Fix Smoke Claude push_to_pull_request_branch — allowed files mismatch
allowed_filesonly permits.github/smoke-claude-push-test.md.github/smoke-claude-push-test.mdfor push tests.allowed_filesconfiguration to accept a pattern liketmp-smoke-test-pr-push-*.txtor a more permissive list.Medium Priority Issues
Schedule staggering to reduce concurrent API burst
dailyschedule (12:00 UTC), causing concurrent safe output API callsFix Smoke Copilot dispatch_workflow — use stable branch
dispatch_workflowtargets a copilot-generated branch that gets deleted after mergemain) or the current PR head branch, not an ephemeral copilot branch.Low Priority Issues
constraint-solvingandproblem-of-the-daylabels don't exist → create them or remove from agent instructionsWork Item Plans
Work Item 1: Add Rate-Limit Retry Logic to Safe Output Handler
add_comment,create_issue,create_pull_request_review_commentretry on rate limit (HTTP 429 / "rate limit exceeded")update_pull_requestto all handlers that make POST/PUT API callsWork Item 2: Fix Smoke Claude Push Test Configuration
allowed_filesonly permits.github/smoke-claude-push-test.md. This has failed 3 times in 2 days and cascades into cancellations of subsequent safe outputs.push_to_pull_request_branchfailuresallowed_filesto accepttmp-smoke-test-pr-push-*.txtHistorical Context
5-Day Trend (2026-03-29 to 2026-04-02)
Trends
push_to_pull_request_branchfiles-outside-allowed-list — now 3 occurrences in 2 days with no fix yet.invalid_event_context(last seen Mar 30),add_labels_no_issue_number(last seen Mar 30),missing remote branch(last seen Mar 31) — all appear resolved.Overall Metrics
submit_pull_request_review,create_discussion,post_slack_message(0 failures)create_issue(Smoke Codex: 0% success rate today),create_pull_request_review_comment(affected by rate limit)Next Steps
constraint-solving,problem-of-the-day) for Constraint Solving workflowdispatch_workflowbranch-not-found for recurrenceReferences:
Beta Was this translation helpful? Give feedback.
All reactions