Skip to content

exploration: Create a feature plan for supporting Graph versioning and updates #639

@NiveditJain

Description

@NiveditJain

Summary

Create a robust versioning system for graphs that enables zero-downtime updates by ensuring running workflows continue with their original graph version while new triggers use the latest version.


Problem Statement

Currently, Exosphere's graph templates are mutable entities identified only by (namespace, name). When a graph is updated via the upsert endpoint, the existing template is directly overwritten. This creates several critical issues:

  1. Running workflows may break - If a graph is updated while workflows are executing, the in-flight states may reference nodes or inputs that no longer exist
  2. No version history - There's no audit trail of graph changes
  3. No rollback capability - If a bad graph version is deployed, there's no quick way to revert
  4. Coupling between deployment and execution - Updates require coordination to ensure no workflows are running

Goals

  • Zero-downtime updates: Deploy new graph versions without affecting running workflows
  • Version isolation: Each workflow run is pinned to a specific graph version
  • Simple updates: Developers can push updates without worrying about in-flight executions
  • Auditability: Complete history of graph versions with timestamps and metadata
  • Rollback support: Quickly revert to a previous known-good version

Non-Goals

  • Real-time migration of running workflows to new versions
  • Automatic schema migration between versions
  • Version branching (git-like branches for graphs)

Proposed Solution

High-Level Architecture

flowchart TB
    subgraph "Current State"
        GT1[GraphTemplate<br/>name + namespace]
        S1[State] --> |references| GT1
    end
  
    subgraph "Proposed State"
        GTV[GraphTemplateVersion<br/>name + namespace + version]
        GTL[GraphTemplate<br/>name + namespace<br/>latest_version pointer]
    
        GTL --> |points to| GTV
        S2[State] --> |pinned to| GTV
    
        GTV1[Version 1] 
        GTV2[Version 2]
        GTV3[Version 3 - Latest]
    
        GTL --> GTV3
    end
Loading

The core idea is to:

  1. Make graph definitions immutable by storing each update as a new version
  2. Keep a lightweight pointer (GraphTemplate) to the active version
  3. Pin each workflow run to the version it was triggered with

Edge Cases and Considerations

Scenario Handling
Secrets rotation Creates new version even if nodes are identical (different hash)
Validation failure Version stored but not set aslatest_valid_version
Concurrent upserts Use optimistic locking on version counter with retry
Long-running workflows Versions retained while any run is active
Rollback to invalid version Rejected - only valid versions can be activated
Delete graph Soft delete - marks inactive, retains versions for audit
Trigger with specific version Future enhancement - allow explicit version in trigger request

Security Considerations

  • Version history may contain sensitive information in secrets (encrypted)
  • Audit log of who created each version
  • Access control for rollback/activate operations
  • Consider separate permissions for "deploy" vs "rollback"

Observability

New Metrics

  • graph_versions_total{namespace, graph_name} - Total versions per graph
  • graph_version_active{namespace, graph_name, version} - Currently active version
  • runs_by_version{namespace, graph_name, version} - Runs per version

Dashboard Enhancements

  • Version history timeline view
  • Diff viewer between versions
  • Active runs per version indicator
  • One-click rollback button

Open Questions

  1. Should we support triggering a specific version (not just latest)?
  2. What's the default retention policy for versions?
  3. Should version comparison show semantic diff or raw JSON diff?
  4. Do we need version tags/labels (e.g., "production", "staging")?

References

  • Current GraphTemplate model: state-manager/app/models/db/graph_template_model.py
  • Current State model: state-manager/app/models/db/state.py
  • Upsert controller: state-manager/app/controller/upsert_graph_template.py
  • Trigger controller: state-manager/app/controller/trigger_graph.py

Goals

Design and plan a simple, effective system for supporting zero-downtime deployment of graph templates. The solution should:

  • Allow new versions of graphs to be created and deployed without affecting running workflows.
  • Ensure each workflow run is pinned to the graph version it started with.
  • Make previous versions easily accessible for rollback or audit.
  • Keep the implementation as lightweight and low-risk as possible, minimizing changes to existing workflow execution logic.
  • Clearly document the approach so it can be easily implemented and reviewed.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions