Skip to content

Conversation

@akkasha-cloud
Copy link

Overview

Implements automatic state persistence and recovery for agent execution, enabling agents to resume from the last successful checkpoint after failures.

Problem Solved

Currently, when agent execution fails mid-workflow, there is no mechanism to resume from the last successful step. This results in:

  • Wasted compute resources on full re-execution
  • Poor developer experience during agent development
  • Unnecessary LLM costs from redundant API calls
  • Difficulty debugging long-running workflows

Solution

Added a comprehensive checkpoint/recovery system that automatically:

  • Saves execution state after each successful node/step
  • Enables resumption from last checkpoint on failure
  • Supports configurable retention policies
  • Integrates seamlessly with existing architecture

Changes Made

New Files (3)

  • core/framework/schemas/checkpoint.py - Checkpoint data models (Pydantic schemas)
  • core/framework/graph/checkpoint_manager.py - Checkpoint orchestration and recovery logic
  • core/tests/test_checkpoint_recovery.py - Comprehensive test suite (20 tests)

Modified Files (4)

  • core/framework/storage/backend.py - Added checkpoint persistence methods
  • core/framework/graph/executor.py - Integrated checkpointing into GraphExecutor
  • core/framework/graph/flexible_executor.py - Integrated checkpointing into FlexibleGraphExecutor
  • core/framework/runner/runner.py - Wired CheckpointManager into AgentRunner

Features

  • Automatic checkpointing after each successful node/step
  • Configurable retention policies (latest_only, prune_old, all)
  • Graceful recovery from last checkpoint on failure
  • Backward compatible - existing code works without changes (opt-in)
  • Production-ready - follows Aden code quality standards
  • <5% overhead - minimal performance impact

Testing

  • 20/20 new tests passing
  • Zero regressions - all existing passing tests still pass
  • ✅ Tested on Python 3.12.10

Code Quality

  • ✅ PEP 8 compliant (no linter errors)
  • ✅ Type hints on all functions
  • ✅ Comprehensive docstrings following Aden style
  • ✅ Follows existing project structure and patterns

Backward Compatibility

  • 100% backward compatible
  • Checkpoint system is opt-in via checkpoint_manager parameter
  • No breaking changes to public APIs

Example Usage

from framework.graph.checkpoint_manager import CheckpointManager
from framework.schemas.checkpoint import CheckpointConfig

Create checkpoint manager (opt-in)

checkpoint_manager = CheckpointManager(storage, CheckpointConfig())

Use with GraphExecutor

executor = GraphExecutor(
runtime=runtime,
checkpoint_manager=checkpoint_manager,
)

result = await executor.execute(graph, goal, input_data)

Implements automatic state persistence and recovery for agent execution:
- Add Checkpoint schema with Pydantic models
- Extend FileStorage with checkpoint operations
- Create CheckpointManager for orchestration
- Integrate with GraphExecutor and FlexibleGraphExecutor
- Add comprehensive test suite (20 tests)
- Maintain backward compatibility

Tests: 20/20 passing, zero regressions
Copy link
Collaborator

@TimothyZhang7 TimothyZhang7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Runtime execution should cache all the states in $HOME/.hive

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default state storage should be managed in $HOME/.hive

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants