Skip to content

Conversation

@AliOsm
Copy link

@AliOsm AliOsm commented Jan 24, 2026

I faced the error described in #32 multiple times. So, I combined 2 solutions which are:

  • Retries (All tools)
  • Streaming (Claude Code only)

This fix worked for me, and allowed Ralph to continue working with Claude Code for 26 iterations until now.

@greptile-apps
Copy link

greptile-apps bot commented Jan 24, 2026

Greptile Overview

Greptile Summary

This PR addresses the Claude Code hanging issue by implementing a stream-based monitoring approach with automatic hang detection and retry logic for transient errors.

Key Changes:

  • Added --retries and --hang-timeout CLI options for configurable resilience
  • Implemented run_claude_with_stream() function that uses --output-format stream-json to detect when Claude completes work (via "type":"result" message) and terminates hung processes after a timeout
  • Added run_with_retry() function with exponential backoff for transient network/API errors
  • Both Amp and Claude Code now benefit from retry logic for common errors (ECONNRESET, ETIMEDOUT, rate limits, 5xx errors)

Issues Found:

  • Race condition in process monitoring loop (ralph.sh:136) where Claude could exit between the kill -0 check and result detection
  • Unused exit_code variable suggests the retry logic doesn't differentiate between failures and successes with warnings

Confidence Score: 4/5

  • This PR is safe to merge with minor issues that don't affect core functionality
  • The implementation successfully addresses the reported hanging issue with a practical solution. The race condition identified is a minor timing issue that's unlikely to cause problems in practice since the result message typically arrives well before process exit. The unused exit_code variable is a code cleanliness issue but doesn't impact functionality.
  • No files require special attention - the logic is sound and addresses the core issue effectively

Important Files Changed

Filename Overview
ralph.sh Adds retry logic and stream-based hang detection for Claude Code to prevent hanging issues, includes new CLI options for configurable retries and hang timeout

Sequence Diagram

sequenceDiagram
    participant Ralph as ralph.sh
    participant RetryFn as run_with_retry
    participant StreamFn as run_claude_with_stream
    participant Claude as claude process
    participant Monitor as Output Monitor
    participant Killer as Timeout Killer

    Ralph->>RetryFn: Execute iteration
    RetryFn->>RetryFn: Set attempt = 1
    
    alt Tool is Claude
        RetryFn->>StreamFn: Call run_claude_with_stream
        StreamFn->>Claude: Spawn with stream-json output
        StreamFn->>Monitor: Start monitoring loop
        
        loop While process alive
            Monitor->>Monitor: Check kill -0
            Monitor->>StreamFn: Grep for type:result
            
            alt Result detected
                Monitor->>Monitor: Set result_received=true
                Monitor->>Killer: Spawn timeout killer
                Monitor-->>StreamFn: Break loop
            end
        end
        
        alt Claude exits cleanly
            Claude-->>StreamFn: Exit normally
            StreamFn->>Killer: Kill timeout process
        else Claude hangs
            Killer->>Claude: Kill after timeout
        end
        
        StreamFn->>StreamFn: Extract result from stream
        StreamFn-->>RetryFn: Return output
    else Tool is Amp
        RetryFn->>RetryFn: Run amp directly
    end
    
    RetryFn->>RetryFn: Check is_retryable_error
    
    alt Retryable error detected
        RetryFn->>RetryFn: Increment attempt and backoff
        RetryFn->>RetryFn: Retry up to MAX_RETRIES
    else Success or non-retryable
        RetryFn-->>Ralph: Return output
    end
    
    Ralph->>Ralph: Check for COMPLETE signal
    
    alt Complete
        Ralph->>Ralph: Exit success
    else More work
        Ralph->>Ralph: Next iteration
    end
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile


# Temp file for stream output (cleaned up on exit)
STREAM_OUTPUT=$(mktemp)
trap "rm -f $STREAM_OUTPUT" EXIT
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrap $STREAM_OUTPUT in quotes to handle paths with spaces safely

Suggested change
trap "rm -f $STREAM_OUTPUT" EXIT
trap "rm -f \"$STREAM_OUTPUT\"" EXIT
Prompt To Fix With AI
This is a comment left during a code review.
Path: ralph.sh
Line: 105:105

Comment:
Wrap `$STREAM_OUTPUT` in quotes to handle paths with spaces safely

```suggestion
trap "rm -f \"$STREAM_OUTPUT\"" EXIT
```

How can I resolve this? If you propose a fix, please make it concise.


# Monitor the output file for the result message
# The key insight: "type":"result" is emitted BEFORE the hang occurs
while kill -0 $claude_pid 2>/dev/null; do
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition: process could exit between kill -0 check and grep, causing the result check to be skipped. The loop will exit if the process dies naturally before detecting the result, potentially missing successful completions.

Prompt To Fix With AI
This is a comment left during a code review.
Path: ralph.sh
Line: 136:136

Comment:
Race condition: process could exit between `kill -0` check and grep, causing the result check to be skipped. The loop will exit if the process dies naturally before detecting the result, potentially missing successful completions.

How can I resolve this? If you propose a fix, please make it concise.

local attempt=1
local delay=$INITIAL_RETRY_DELAY
local output=""
local exit_code=0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable exit_code is set but never checked - the retry logic only uses pattern matching via is_retryable_error, ignoring non-zero exit codes from successful runs with warnings

Prompt To Fix With AI
This is a comment left during a code review.
Path: ralph.sh
Line: 184:184

Comment:
Variable `exit_code` is set but never checked - the retry logic only uses pattern matching via `is_retryable_error`, ignoring non-zero exit codes from successful runs with warnings

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant