feat(checkpoint): add checkpoint system for workflow recovery #136
+1,211
−50
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Implement checkpoint-based recovery system for FlexibleGraphExecutor. This enables auto-saving workflow state after each successful step and resuming from the last checkpoint on failure, preventing loss of progress in long-running agent workflows.
Type of Change
Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Refactoring (no functional changes)
Related Issues
Fixes #(issue number)
Changes Made
Add Checkpoint and CheckpointMetadata Pydantic schemas (framework/schemas/checkpoint.py)
Add CheckpointStorage for filesystem-based persistence using JSON files (framework/storage/checkpoint_storage.py)
Add CheckpointManager high-level API for checkpoint lifecycle management (framework/runtime/checkpoint.py)
Integrate checkpoint hooks into FlexibleGraphExecutor with auto-save after each successful step
Add resume_from_checkpoint parameter to execute_plan() for recovery support
Add checkpoint configuration options to ExecutorConfig (enabled, path, auto_cleanup)
Add comprehensive unit tests for checkpoint system (25 tests)
Testing
Describe the tests you ran to verify your changes:
tests/test_checkpoint.py - 25 passed in 1.83s
Tests cover:
Unit tests pass (cd core && pytest tests/)
Lint passes (cd core && ruff check .)
Manual testing performed
Test Results
Checklist
My code follows the project's style guidelines
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Screenshots (if applicable)
N/A - Backend feature, no UI changes.