Skip to content

Conversation

@danyalwajid
Copy link

Description
Implement checkpoint-based recovery system for FlexibleGraphExecutor. This enables auto-saving workflow state after each successful step and resuming from the last checkpoint on failure, preventing loss of progress in long-running agent workflows.

Type of Change
Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Refactoring (no functional changes)
Related Issues
Fixes #(issue number)

Changes Made
Add Checkpoint and CheckpointMetadata Pydantic schemas (framework/schemas/checkpoint.py)
Add CheckpointStorage for filesystem-based persistence using JSON files (framework/storage/checkpoint_storage.py)
Add CheckpointManager high-level API for checkpoint lifecycle management (framework/runtime/checkpoint.py)
Integrate checkpoint hooks into FlexibleGraphExecutor with auto-save after each successful step
Add resume_from_checkpoint parameter to execute_plan() for recovery support
Add checkpoint configuration options to ExecutorConfig (enabled, path, auto_cleanup)
Add comprehensive unit tests for checkpoint system (25 tests)
Testing
Describe the tests you ran to verify your changes:

tests/test_checkpoint.py - 25 passed in 1.83s

Tests cover:

  • Checkpoint schema creation, serialization, deserialization
  • CheckpointMetadata schema and status values
  • CheckpointStorage: save, load, load_latest, get_metadata, update_status, cleanup
  • CheckpointManager: save, load_latest, can_resume, on_execution_complete, disabled mode
  • Integration: full save/recovery flow, multiple runs isolation, large memory states

Unit tests pass (cd core && pytest tests/)
Lint passes (cd core && ruff check .)
Manual testing performed
Test Results
Checklist
My code follows the project's style guidelines
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Screenshots (if applicable)
N/A - Backend feature, no UI changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant