-
Notifications
You must be signed in to change notification settings - Fork 41
Open
Description
Summary
Create a robust versioning system for graphs that enables zero-downtime updates by ensuring running workflows continue with their original graph version while new triggers use the latest version.
Problem Statement
Currently, Exosphere's graph templates are mutable entities identified only by (namespace, name). When a graph is updated via the upsert endpoint, the existing template is directly overwritten. This creates several critical issues:
- Running workflows may break - If a graph is updated while workflows are executing, the in-flight states may reference nodes or inputs that no longer exist
- No version history - There's no audit trail of graph changes
- No rollback capability - If a bad graph version is deployed, there's no quick way to revert
- Coupling between deployment and execution - Updates require coordination to ensure no workflows are running
Goals
- Zero-downtime updates: Deploy new graph versions without affecting running workflows
- Version isolation: Each workflow run is pinned to a specific graph version
- Simple updates: Developers can push updates without worrying about in-flight executions
- Auditability: Complete history of graph versions with timestamps and metadata
- Rollback support: Quickly revert to a previous known-good version
Non-Goals
- Real-time migration of running workflows to new versions
- Automatic schema migration between versions
- Version branching (git-like branches for graphs)
Proposed Solution
High-Level Architecture
flowchart TB
subgraph "Current State"
GT1[GraphTemplate<br/>name + namespace]
S1[State] --> |references| GT1
end
subgraph "Proposed State"
GTV[GraphTemplateVersion<br/>name + namespace + version]
GTL[GraphTemplate<br/>name + namespace<br/>latest_version pointer]
GTL --> |points to| GTV
S2[State] --> |pinned to| GTV
GTV1[Version 1]
GTV2[Version 2]
GTV3[Version 3 - Latest]
GTL --> GTV3
end
The core idea is to:
- Make graph definitions immutable by storing each update as a new version
- Keep a lightweight pointer (
GraphTemplate) to the active version - Pin each workflow run to the version it was triggered with
Edge Cases and Considerations
| Scenario | Handling |
|---|---|
| Secrets rotation | Creates new version even if nodes are identical (different hash) |
| Validation failure | Version stored but not set aslatest_valid_version |
| Concurrent upserts | Use optimistic locking on version counter with retry |
| Long-running workflows | Versions retained while any run is active |
| Rollback to invalid version | Rejected - only valid versions can be activated |
| Delete graph | Soft delete - marks inactive, retains versions for audit |
| Trigger with specific version | Future enhancement - allow explicit version in trigger request |
Security Considerations
- Version history may contain sensitive information in secrets (encrypted)
- Audit log of who created each version
- Access control for rollback/activate operations
- Consider separate permissions for "deploy" vs "rollback"
Observability
New Metrics
graph_versions_total{namespace, graph_name}- Total versions per graphgraph_version_active{namespace, graph_name, version}- Currently active versionruns_by_version{namespace, graph_name, version}- Runs per version
Dashboard Enhancements
- Version history timeline view
- Diff viewer between versions
- Active runs per version indicator
- One-click rollback button
Open Questions
- Should we support triggering a specific version (not just latest)?
- What's the default retention policy for versions?
- Should version comparison show semantic diff or raw JSON diff?
- Do we need version tags/labels (e.g., "production", "staging")?
References
- Current
GraphTemplatemodel:state-manager/app/models/db/graph_template_model.py - Current
Statemodel:state-manager/app/models/db/state.py - Upsert controller:
state-manager/app/controller/upsert_graph_template.py - Trigger controller:
state-manager/app/controller/trigger_graph.py
Goals
Design and plan a simple, effective system for supporting zero-downtime deployment of graph templates. The solution should:
- Allow new versions of graphs to be created and deployed without affecting running workflows.
- Ensure each workflow run is pinned to the graph version it started with.
- Make previous versions easily accessible for rollback or audit.
- Keep the implementation as lightweight and low-risk as possible, minimizing changes to existing workflow execution logic.
- Clearly document the approach so it can be easily implemented and reviewed.