🏥 CI Failure Investigation - Daily Perf Improver Timeout Loop

# 🏥 CI Failure Investigation - Run #17337145012

## Summary
**CRITICAL**: The Daily Perf Improver workflow is experiencing a systematic timeout pattern where the "Execute Claude Code Action" step gets stuck in "in_progress" status, causing complete workflow failure and blocking all subsequent runs.

## Failure Details
- **Run**: [17337145012](https://github.com/fsprojects/Furnace/actions/runs/17337145012)
- **Commit**: 7d01ba1ef5d97480e1f06260e085bef18e7639f2
- **Trigger**: workflow_dispatch
- **Failed Job**: daily-perf-improver (ID: 49225314587)
- **Failed Step**: "Execute Claude Code Action" (Step 11) - **stuck in "in_progress"**
- **Timeout Configured**: 30 minutes

## Root Cause Analysis

### Primary Root Cause: Claude Code Action Timeout Loop
The `anthropics/claude-code-base-action@v0.0.56` consistently gets stuck in "in_progress" status without completing or properly timing out.

### Historical Pattern Evidence
**CRITICAL PATTERN**: Multiple consecutive runs show the same failure mode:
- **Run #8** (17337145012): **FAILURE** - timeout at Execute Claude Code Action ❌
- **Run #9** (17337213781): **CANCELLED** - likely same timeout issue ❌  
- **Run #10** (17337179553): **CANCELLED** - likely same timeout issue ❌
- **Run #11** (17337247622): **IN_PROGRESS** - still running when investigated ⏳
- **Run #12** (17337280807): **PENDING** - queued behind stuck runs ⏳

This demonstrates a **100% failure rate** across the last 5 workflow runs.

## Investigation Findings

### Successfully Created Issues Previously
The workflow HAS been working correctly before, as evidenced by successful creation of:
- Issue #45: "Daily Perf Improver: Research and Plan"
- Issue #46: "Daily Test Coverage Improver: Research and Plan"  
- PRs #50 and #51: Performance and test coverage improvements

This indicates the timeout issue is a **recent development**.

### Contributing Factors
1. **Extremely Complex Prompt**: 300+ line prompt with multi-step performance improvement workflow
2. **Resource-Intensive Tasks**: F# build processes, benchmarking, and performance analysis
3. **Docker MCP Configuration**: Using `ghcr.io/github/github-mcp-server:sha-45e90ae` which may have connection/timeout issues
4. **Action Version**: `anthropics/claude-code-base-action@v0.0.56` may have timeout handling bugs

### Failed Jobs and Errors
- **Logs Inaccessible**: Job logs return 404 error via API - likely due to ongoing/stuck process
- **Step Status**: "Execute Claude Code Action" shows "in_progress" instead of "failed" or "timed out"
- **Timeout Handling**: 30-minute timeout appears to not be working correctly

## Recommended Actions

### Immediate Actions (HIGH Priority)
- [ ] **Cancel all stuck workflow runs** manually to clear the queue
- [ ] **Temporarily disable the workflow** to prevent resource waste
- [ ] **Upgrade Claude Code Action** to latest version (check for v0.1.x releases)
- [ ] **Reduce timeout** from 30 minutes to 10-15 minutes for faster failure detection

### Configuration Changes (MEDIUM Priority)  
- [ ] **Simplify the prompt** - break the 300+ line prompt into smaller, focused chunks
- [ ] **Add timeout guards** - implement step-level timeouts within the prompt itself
- [ ] **Test MCP configuration** - verify Docker GitHub MCP server connectivity
- [ ] **Add workflow concurrency limits** - prevent multiple runs from interfering

### Long-term Improvements (LOW Priority)
- [ ] **Split workflow into phases** - separate research, planning, and implementation phases
- [ ] **Add health checks** - implement periodic status reporting within long-running steps
- [ ] **Resource monitoring** - add memory and CPU usage monitoring
- [ ] **Fallback mechanisms** - implement graceful degradation when timeouts occur

## Prevention Strategies

### Workflow Robustness
1. **Timeout Cascading**: Implement multiple timeout layers (step-level, job-level, workflow-level)
2. **Progress Reporting**: Add periodic status updates within long-running Claude operations
3. **Resource Limits**: Implement memory and CPU constraints to prevent resource exhaustion
4. **Retry Logic**: Add exponential backoff retry for transient failures

### AI Team Self-Improvement
Add these prompting instructions to instructions.md for AI coding agents:

```markdown
## Workflow Timeout Prevention
- Keep AI workflow prompts under 100 lines when possible
- Implement progress reporting every 5-10 minutes in long-running operations
- Add explicit timeout guards within AI agent prompts
- Test workflow steps individually before combining into complex multi-step processes
- Use step-level timeouts (5-15 minutes) rather than relying solely on job-level timeouts
- Monitor Docker-based MCP server health and implement connection retries
```

## Historical Context

This is the **FIRST reported instance** of this specific timeout pattern. Previous workflow runs were successful, indicating:
- The workflow configuration was previously functional
- This represents a **regression** rather than an initial configuration issue  
- The issue may be related to recent changes in the Claude Code Action or infrastructure

## Impact Assessment

- **Severity**: 🔴 **CRITICAL** - Workflow completely non-functional
- **Scope**: All Daily Perf Improver workflow runs failing
- **Resource Impact**: Consuming GitHub Actions minutes without producing results
- **Development Impact**: Blocking automated performance improvements

## Technical Details

### Workflow File
`.github/workflows/daily-perf-improver.lock.yml` - Line 347-423 contains the failing step

### Timeout Configuration
```yaml
timeout_minutes: 30
```

### MCP Server Configuration
```yaml
"ghcr.io/github/github-mcp-server:sha-45e90ae"
```

### Action Version
```yaml
anthropics/claude-code-base-action@v0.0.56
```

---

> AI-generated content by [CI Failure Doctor](https://github.com/fsprojects/Furnace/actions/runs/17337269297) may contain mistakes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🏥 CI Failure Investigation - Daily Perf Improver Timeout Loop #52

🏥 CI Failure Investigation - Run #17337145012

Summary

Failure Details

Root Cause Analysis

Primary Root Cause: Claude Code Action Timeout Loop

Historical Pattern Evidence

Investigation Findings

Successfully Created Issues Previously

Contributing Factors

Failed Jobs and Errors

Recommended Actions

Immediate Actions (HIGH Priority)

Configuration Changes (MEDIUM Priority)

Long-term Improvements (LOW Priority)

Prevention Strategies

Workflow Robustness

AI Team Self-Improvement

Historical Context

Impact Assessment

Technical Details

Workflow File

Timeout Configuration

MCP Server Configuration

Action Version

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

🏥 CI Failure Investigation - Daily Perf Improver Timeout Loop #52

Description

🏥 CI Failure Investigation - Run #17337145012

Summary

Failure Details

Root Cause Analysis

Primary Root Cause: Claude Code Action Timeout Loop

Historical Pattern Evidence

Investigation Findings

Successfully Created Issues Previously

Contributing Factors

Failed Jobs and Errors

Recommended Actions

Immediate Actions (HIGH Priority)

Configuration Changes (MEDIUM Priority)

Long-term Improvements (LOW Priority)

Prevention Strategies

Workflow Robustness

AI Team Self-Improvement

Historical Context

Impact Assessment

Technical Details

Workflow File

Timeout Configuration

MCP Server Configuration

Action Version

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions