Skip to content

🏥 CI Failure Investigation - Daily Perf Improver Timeout Loop #52

@github-actions

Description

@github-actions

🏥 CI Failure Investigation - Run #17337145012

Summary

CRITICAL: The Daily Perf Improver workflow is experiencing a systematic timeout pattern where the "Execute Claude Code Action" step gets stuck in "in_progress" status, causing complete workflow failure and blocking all subsequent runs.

Failure Details

  • Run: 17337145012
  • Commit: 7d01ba1
  • Trigger: workflow_dispatch
  • Failed Job: daily-perf-improver (ID: 49225314587)
  • Failed Step: "Execute Claude Code Action" (Step 11) - stuck in "in_progress"
  • Timeout Configured: 30 minutes

Root Cause Analysis

Primary Root Cause: Claude Code Action Timeout Loop

The anthropics/claude-code-base-action@v0.0.56 consistently gets stuck in "in_progress" status without completing or properly timing out.

Historical Pattern Evidence

CRITICAL PATTERN: Multiple consecutive runs show the same failure mode:

This demonstrates a 100% failure rate across the last 5 workflow runs.

Investigation Findings

Successfully Created Issues Previously

The workflow HAS been working correctly before, as evidenced by successful creation of:

This indicates the timeout issue is a recent development.

Contributing Factors

  1. Extremely Complex Prompt: 300+ line prompt with multi-step performance improvement workflow
  2. Resource-Intensive Tasks: F# build processes, benchmarking, and performance analysis
  3. Docker MCP Configuration: Using ghcr.io/github/github-mcp-server:sha-45e90ae which may have connection/timeout issues
  4. Action Version: anthropics/claude-code-base-action@v0.0.56 may have timeout handling bugs

Failed Jobs and Errors

  • Logs Inaccessible: Job logs return 404 error via API - likely due to ongoing/stuck process
  • Step Status: "Execute Claude Code Action" shows "in_progress" instead of "failed" or "timed out"
  • Timeout Handling: 30-minute timeout appears to not be working correctly

Recommended Actions

Immediate Actions (HIGH Priority)

  • Cancel all stuck workflow runs manually to clear the queue
  • Temporarily disable the workflow to prevent resource waste
  • Upgrade Claude Code Action to latest version (check for v0.1.x releases)
  • Reduce timeout from 30 minutes to 10-15 minutes for faster failure detection

Configuration Changes (MEDIUM Priority)

  • Simplify the prompt - break the 300+ line prompt into smaller, focused chunks
  • Add timeout guards - implement step-level timeouts within the prompt itself
  • Test MCP configuration - verify Docker GitHub MCP server connectivity
  • Add workflow concurrency limits - prevent multiple runs from interfering

Long-term Improvements (LOW Priority)

  • Split workflow into phases - separate research, planning, and implementation phases
  • Add health checks - implement periodic status reporting within long-running steps
  • Resource monitoring - add memory and CPU usage monitoring
  • Fallback mechanisms - implement graceful degradation when timeouts occur

Prevention Strategies

Workflow Robustness

  1. Timeout Cascading: Implement multiple timeout layers (step-level, job-level, workflow-level)
  2. Progress Reporting: Add periodic status updates within long-running Claude operations
  3. Resource Limits: Implement memory and CPU constraints to prevent resource exhaustion
  4. Retry Logic: Add exponential backoff retry for transient failures

AI Team Self-Improvement

Add these prompting instructions to instructions.md for AI coding agents:

## Workflow Timeout Prevention
- Keep AI workflow prompts under 100 lines when possible
- Implement progress reporting every 5-10 minutes in long-running operations
- Add explicit timeout guards within AI agent prompts
- Test workflow steps individually before combining into complex multi-step processes
- Use step-level timeouts (5-15 minutes) rather than relying solely on job-level timeouts
- Monitor Docker-based MCP server health and implement connection retries

Historical Context

This is the FIRST reported instance of this specific timeout pattern. Previous workflow runs were successful, indicating:

  • The workflow configuration was previously functional
  • This represents a regression rather than an initial configuration issue
  • The issue may be related to recent changes in the Claude Code Action or infrastructure

Impact Assessment

  • Severity: 🔴 CRITICAL - Workflow completely non-functional
  • Scope: All Daily Perf Improver workflow runs failing
  • Resource Impact: Consuming GitHub Actions minutes without producing results
  • Development Impact: Blocking automated performance improvements

Technical Details

Workflow File

.github/workflows/daily-perf-improver.lock.yml - Line 347-423 contains the failing step

Timeout Configuration

timeout_minutes: 30

MCP Server Configuration

"ghcr.io/github/github-mcp-server:sha-45e90ae"

Action Version

anthropics/claude-code-base-action@v0.0.56

AI-generated content by CI Failure Doctor may contain mistakes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions