Agent Error Recovery System

Agent error recovery system enables automatic detection and correction of tool call errors through multi-turn conversations with the LLM.

Overview

When an agent encounters errors (JSON parsing errors, tool execution failures, timeouts), the system automatically:

Detects the error
Generates helpful feedback for the LLM
Allows the LLM to retry with corrections
Tracks retry attempts and enforces limits
Provides clear progress indicators to users

Features

Automatic JSON parsing error detection: Catches malformed tool calls
Tool execution error handling: Handles timeouts and exceptions
Multi-turn retry mechanism: Up to 3 retries per error type
Concise feedback prompts: Optimized to reduce LLM thinking time
Progress indicators: Real-time status updates to frontend
Configurable limits: Retry counts and timeouts via environment variables

Configuration

Environment variables (optional, with defaults):

# Maximum retry attempts for parsing errors (default: 3)
export AGENT_MAX_PARSE_RETRIES=3

# Maximum retry attempts for execution errors (default: 3)
export AGENT_MAX_EXECUTION_RETRIES=3

# Tool execution timeout in seconds (default: 30.0)
export AGENT_TOOL_TIMEOUT=30.0

# Enable/disable error recovery (default: true)
export AGENT_ENABLE_ERROR_RECOVERY=true

Streaming Message Types

The error recovery system sends these message types to the frontend:

1. `retry_attempt`

Indicates a retry is happening:

("🔄 **检测到错误，正在重试** (第 1/3 次)\n", "retry_attempt")

2. `error_feedback`

Detailed error feedback for display:

(feedback.to_prompt(), "error_feedback")

3. `info`

General information messages:

("💭 **第 2 轮对话**\n", "info")

4. `error`

Terminal error messages:

("⛔ 工具调用格式错误次数过多，无法继续。\n", "error")

Frontend Integration

TypeScript Types

// Add to frontend/src/types/streaming.ts
export type StreamMessageType = 
  | 'content'
  | 'thinking'
  | 'tool_call'
  | 'tool_result'
  | 'tool_error'
  | 'retry_attempt'    // NEW
  | 'error_feedback'   // NEW
  | 'info'             // NEW
  | 'error';

export interface StreamMessage {
  content: string;
  type: StreamMessageType;
}

Message Handler

// In your WebSocket message handler
function handleStreamMessage(message: StreamMessage) {
  switch (message.type) {
    case 'retry_attempt':
      // Show retry indicator (e.g., spinner with retry count)
      showRetryIndicator(message.content);
      break;
      
    case 'error_feedback':
      // Display error feedback (can be collapsible)
      showErrorFeedback(message.content);
      break;
      
    case 'info':
      // Show info message (e.g., round number)
      showInfoMessage(message.content);
      break;
      
    case 'error':
      // Show terminal error
      showError(message.content);
      break;
      
    // ... existing cases
  }
}

UI Components (Suggested)

// RetryIndicator.tsx
function RetryIndicator({ message }: { message: string }) {
  return (
    <div className="retry-indicator">
      <Spinner />
      <span>{message}</span>
    </div>
  );
}

// ErrorFeedback.tsx
function ErrorFeedback({ content }: { content: string }) {
  const [collapsed, setCollapsed] = useState(false);
  
  return (
    <div className="error-feedback">
      <button onClick={() => setCollapsed(!collapsed)}>
        {collapsed ? '展开' : '收起'} 错误详情
      </button>
      {!collapsed && <pre>{content}</pre>}
    </div>
  );
}

Error Recovery Flow

User Query
    ↓
Round 1: LLM generates tool call
    ↓
Parse tool call → Error detected!
    ↓
Send retry_attempt message to frontend
    ↓
Generate error_feedback
    ↓
Round 2: LLM receives feedback and retries
    ↓
Parse tool call → Success!
    ↓
Execute tool → Get result
    ↓
Round 3: LLM generates final answer
    ↓
Done

Logging

All error recovery events are logged with structured data:

logger.warning(
    "[RECOVERY] Found 1 parse errors",
    extra={"agent_id": str(agent_id)}
)

Log prefixes:

[RECOVERY]: Error recovery events
[TOOL-LOOP]: Legacy implementation events

Metrics

Error recovery statistics are returned in the response:

{
  "success": True,
  "state": ConversationState(...),
  "error_recovery_stats": {
    "total_errors": 2,
    "recovered_errors": 2,
    "retry_attempts": 3
  }
}

Testing

Run standalone tests:

cd backend
python3 test_error_recovery_standalone.py

Expected output:

All data structures work correctly
Error feedback is concise (< 600 chars)
Retry logic functions properly

Troubleshooting

Issue: LLM takes too long to respond after error

Solution: The optimized feedback prompts should reduce thinking time. If still slow:

Check LLM model performance
Consider reducing max_parse_retries to 2
Add timeout for LLM streaming (future enhancement)

Issue: Frontend doesn't show retry indicators

Solution: Ensure frontend handles new message types:

retry_attempt
error_feedback
info

Issue: Too many retries

Solution: Adjust environment variables:

export AGENT_MAX_PARSE_RETRIES=2
export AGENT_MAX_EXECUTION_RETRIES=2

References

Spec: .kiro/specs/agent-error-recovery/
Code: backend/agent_framework/base_agent.py
Tests: backend/test_error_recovery_standalone.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent Error Recovery System

Overview

Features

Configuration

Streaming Message Types

1. `retry_attempt`

2. `error_feedback`

3. `info`

4. `error`

Frontend Integration

TypeScript Types

Message Handler

UI Components (Suggested)

Error Recovery Flow

Logging

Metrics

Testing

Troubleshooting

Issue: LLM takes too long to respond after error

Issue: Frontend doesn't show retry indicators

Issue: Too many retries

References

FilesExpand file tree

agent-error-recovery.md

Latest commit

History

agent-error-recovery.md

File metadata and controls

Agent Error Recovery System

Overview

Features

Configuration

Streaming Message Types

1. retry_attempt

2. error_feedback

3. info

4. error

Frontend Integration

TypeScript Types

Message Handler

UI Components (Suggested)

Error Recovery Flow

Logging

Metrics

Testing

Troubleshooting

Issue: LLM takes too long to respond after error

Issue: Frontend doesn't show retry indicators

Issue: Too many retries

References

1. `retry_attempt`

2. `error_feedback`

3. `info`

4. `error`