Agent error recovery system enables automatic detection and correction of tool call errors through multi-turn conversations with the LLM.
When an agent encounters errors (JSON parsing errors, tool execution failures, timeouts), the system automatically:
- Detects the error
- Generates helpful feedback for the LLM
- Allows the LLM to retry with corrections
- Tracks retry attempts and enforces limits
- Provides clear progress indicators to users
- Automatic JSON parsing error detection: Catches malformed tool calls
- Tool execution error handling: Handles timeouts and exceptions
- Multi-turn retry mechanism: Up to 3 retries per error type
- Concise feedback prompts: Optimized to reduce LLM thinking time
- Progress indicators: Real-time status updates to frontend
- Configurable limits: Retry counts and timeouts via environment variables
Environment variables (optional, with defaults):
# Maximum retry attempts for parsing errors (default: 3)
export AGENT_MAX_PARSE_RETRIES=3
# Maximum retry attempts for execution errors (default: 3)
export AGENT_MAX_EXECUTION_RETRIES=3
# Tool execution timeout in seconds (default: 30.0)
export AGENT_TOOL_TIMEOUT=30.0
# Enable/disable error recovery (default: true)
export AGENT_ENABLE_ERROR_RECOVERY=trueThe error recovery system sends these message types to the frontend:
Indicates a retry is happening:
("🔄 **检测到错误,正在重试** (第 1/3 次)\n", "retry_attempt")Detailed error feedback for display:
(feedback.to_prompt(), "error_feedback")General information messages:
("💭 **第 2 轮对话**\n", "info")Terminal error messages:
("⛔ 工具调用格式错误次数过多,无法继续。\n", "error")// Add to frontend/src/types/streaming.ts
export type StreamMessageType =
| 'content'
| 'thinking'
| 'tool_call'
| 'tool_result'
| 'tool_error'
| 'retry_attempt' // NEW
| 'error_feedback' // NEW
| 'info' // NEW
| 'error';
export interface StreamMessage {
content: string;
type: StreamMessageType;
}// In your WebSocket message handler
function handleStreamMessage(message: StreamMessage) {
switch (message.type) {
case 'retry_attempt':
// Show retry indicator (e.g., spinner with retry count)
showRetryIndicator(message.content);
break;
case 'error_feedback':
// Display error feedback (can be collapsible)
showErrorFeedback(message.content);
break;
case 'info':
// Show info message (e.g., round number)
showInfoMessage(message.content);
break;
case 'error':
// Show terminal error
showError(message.content);
break;
// ... existing cases
}
}// RetryIndicator.tsx
function RetryIndicator({ message }: { message: string }) {
return (
<div className="retry-indicator">
<Spinner />
<span>{message}</span>
</div>
);
}
// ErrorFeedback.tsx
function ErrorFeedback({ content }: { content: string }) {
const [collapsed, setCollapsed] = useState(false);
return (
<div className="error-feedback">
<button onClick={() => setCollapsed(!collapsed)}>
{collapsed ? '展开' : '收起'} 错误详情
</button>
{!collapsed && <pre>{content}</pre>}
</div>
);
}User Query
↓
Round 1: LLM generates tool call
↓
Parse tool call → Error detected!
↓
Send retry_attempt message to frontend
↓
Generate error_feedback
↓
Round 2: LLM receives feedback and retries
↓
Parse tool call → Success!
↓
Execute tool → Get result
↓
Round 3: LLM generates final answer
↓
Done
All error recovery events are logged with structured data:
logger.warning(
"[RECOVERY] Found 1 parse errors",
extra={"agent_id": str(agent_id)}
)Log prefixes:
[RECOVERY]: Error recovery events[TOOL-LOOP]: Legacy implementation events
Error recovery statistics are returned in the response:
{
"success": True,
"state": ConversationState(...),
"error_recovery_stats": {
"total_errors": 2,
"recovered_errors": 2,
"retry_attempts": 3
}
}Run standalone tests:
cd backend
python3 test_error_recovery_standalone.pyExpected output:
- All data structures work correctly
- Error feedback is concise (< 600 chars)
- Retry logic functions properly
Solution: The optimized feedback prompts should reduce thinking time. If still slow:
- Check LLM model performance
- Consider reducing
max_parse_retriesto 2 - Add timeout for LLM streaming (future enhancement)
Solution: Ensure frontend handles new message types:
retry_attempterror_feedbackinfo
Solution: Adjust environment variables:
export AGENT_MAX_PARSE_RETRIES=2
export AGENT_MAX_EXECUTION_RETRIES=2- Spec:
.kiro/specs/agent-error-recovery/ - Code:
backend/agent_framework/base_agent.py - Tests:
backend/test_error_recovery_standalone.py