Skip to content

exploration: Study impact and effort for streaming to workers #638

@NiveditJain

Description

@NiveditJain

Objective

  1. Identify and quantify bottlenecks in the current polling-based architecture
  2. Determine if streaming is the right solution
  3. If yes, outline the least-effort path to implementation

Part 1: Current Polling Architecture

How It Works

Workers (Runtimes) poll the State Manager every poll_interval (default: 1 second) asking for work:

sequenceDiagram
    participant RT as Runtime
    participant SM as State Manager
    participant DB as MongoDB
    
    loop Every 1 second
        RT->>SM: POST /states/enqueue
        SM->>DB: Query for CREATED states
        SM->>DB: Update status to QUEUED
        SM-->>RT: Return states (or empty array)
    end
Loading

Key Configuration

Parameter Default Purpose
poll_interval 1s Time between polls
batch_size 16 States fetched per poll
workers 4 Parallel executors per runtime

Part 2: Bottleneck Analysis

Bottleneck 1: Latency

Problem: States wait up to poll_interval before being picked up.

Scenario Latency
Best case (state created just before poll) ~0ms
Worst case (state created just after poll) ~1000ms
Average ~500ms

Impact: For latency-sensitive workflows (real-time processing, user-facing operations), 500ms average delay may be unacceptable.


Bottleneck 2: Wasted Requests

Problem: Runtimes poll even when no work is available.

Analysis (per runtime):

  • Polls per minute: 60
  • If queue is empty 80% of the time: 48 wasted requests/minute
  • With 100 runtimes: 4,800 wasted requests/minute

Impact: Unnecessary network traffic and State Manager load.


Bottleneck 3: Database Load

Problem: Each poll triggers a MongoDB query, even when idle.

Current query pattern:

find_one_and_update(
    {"status": "CREATED", "node_name": {"$in": nodes}},
    {"$set": {"status": "QUEUED"}}
)

Analysis:

  • Queries per minute per runtime: 60
  • With 100 runtimes: 6,000 queries/minute (regardless of actual work)

Impact: MongoDB load scales with runtime count, not workload.


Bottleneck 4: Scalability Ceiling

Problem: Adding more runtimes linearly increases polling load.

Total requests/sec = num_runtimes / poll_interval
Runtimes Requests/sec to State Manager
10 10
100 100
1000 1000

Impact: State Manager becomes a bottleneck before compute is saturated.


Bottleneck Summary

Bottleneck Severity Affects
Latency (~500ms avg) Medium Real-time workloads
Wasted requests Low Network/cost
Database load Medium MongoDB at scale
Scalability ceiling High Large deployments

Part 3: Should We Move to Streaming?

Decision Matrix

Factor Polling Streaming Winner
Latency ~500ms ~10ms Streaming
Idle resource usage High Near zero Streaming
Implementation complexity Simple Moderate Polling
Proxy/firewall compatibility Excellent Good (SSE) / Moderate (WS) Polling
Failure recovery Stateless, trivial Requires reconnect logic Polling
Scalability O(n) requests O(1) connections Streaming

When Polling is Sufficient

  • Small deployments (< 50 runtimes)
  • Batch/async workloads where 500ms latency is acceptable
  • Environments with restrictive proxies
  • Teams preferring operational simplicity

When Streaming is Needed

  • Large deployments (100+ runtimes)
  • Real-time or low-latency requirements
  • High state throughput (1000+ states/minute)
  • Cost-sensitive environments (reduce unnecessary requests)

Recommendation

Yes, streaming is worth pursuing for the following reasons:

  1. Scalability: Polling fundamentally doesn't scale - streaming does
  2. Latency: 50x improvement (500ms to 10ms) enables new use cases
  3. Efficiency: Eliminates waste in idle scenarios
  4. Industry standard: Most distributed task systems use push-based models

However, polling should remain as a fallback for compatibility.


Part 4: Implementation Options

Two viable streaming approaches exist. The choice depends on current needs vs. future extensibility.


Option A: Server-Sent Events (SSE) - Unidirectional

How it works: Server pushes states to workers over HTTP. Workers respond via existing REST endpoints.

flowchart LR
    SM[State Manager] -->|SSE: push states| RT[Runtime]
    RT -->|HTTP POST: /executed| SM
    RT -->|HTTP POST: /errored| SM
Loading

Pros:

  • Simpler implementation
  • Works through most proxies/load balancers
  • Built-in browser reconnection (if dashboard uses it)
  • Lower server resource usage

Cons:

  • One-way only - responses still require HTTP calls
  • Cannot stream execution output back
  • Adding bidirectional later requires new infrastructure

Option B: WebSockets - Bidirectional

How it works: Full duplex connection. Both state delivery AND responses flow over same connection.

flowchart LR
    SM[State Manager] <-->|WebSocket: states + responses| RT[Runtime]
Loading

Pros:

  • Single persistent connection for everything
  • Can stream execution results/logs back in real-time
  • Lower latency for responses (no HTTP overhead)
  • Future-proof for additional use cases

Cons:

  • More complex implementation
  • Some proxies require special configuration
  • Need to implement ping/pong keepalive
  • Higher per-connection memory on server

Option C: Message Queue / Durable Streams

How it works: State Manager publishes states to a message broker. Workers consume from the queue. Decouples producers from consumers entirely.

flowchart LR
    SM[State Manager] -->|Publish| MQ[(Message Queue)]
    MQ -->|Consume| RT1[Runtime 1]
    MQ -->|Consume| RT2[Runtime 2]
    MQ -->|Consume| RT3[Runtime 3]
    RT1 -->|HTTP POST| SM
Loading

Technology Options:

Technology Latency Durability Complexity Managed Options
Redis Streams ~1-5ms Configurable (AOF/RDB) Low AWS ElastiCache, Redis Cloud
Apache Kafka ~5-10ms Strong (replicated log) High Confluent, AWS MSK
NATS JetStream ~1-2ms Strong (replicated) Medium Synadia Cloud

Pros:

  • Decoupling: State Manager doesn't need to track worker connections
  • Built-in durability: Messages persist if workers are temporarily down
  • Consumer groups: Built-in work distribution (no custom claiming logic)
  • Horizontal scaling: Add workers without any State Manager changes
  • Replay capability: Can replay messages for recovery
  • Battle-tested: Production-grade queueing semantics

Cons:

  • Additional infrastructure: New component to deploy and operate
  • Cost: Managed services have costs; self-hosted requires expertise
  • Latency variance: Network hop to broker adds some latency
  • Overkill for small deployments: Adds complexity for simple setups

Message Queue: Technology Recommendations

If starting fresh:

  • NATS JetStream - Lowest latency, simple to operate, lightweight
  • Good for: Most Exosphere deployments

If already using Redis:

  • Redis Streams - No new infrastructure, familiar tooling
  • Good for: Teams with Redis expertise

If enterprise/high-throughput:

  • Apache Kafka - Industry standard for high-volume streaming
  • Good for: 10K+ states/second, strict ordering requirements

Next Steps

  1. Decide: Should we move to streaming?

    • Is ~500ms latency acceptable for our use cases?
    • Are we hitting scalability limits with polling?
    • Do users need real-time responsiveness?
  2. If yes, gather requirements:

    • Do we need bidirectional streaming (logs, progress) in next 6-12 months?
    • Do we expect 100+ runtimes at scale?
    • Is durability/replay a requirement?
    • Do we already have Redis/Kafka/NATS infrastructure?
  3. Based on requirements, shortlist options:

    • Small scale, no durability needed → SSE or WebSocket
    • Large scale or durability needed → Message Queue
  4. Validate infrastructure prerequisites for shortlisted options

  5. Determine the simplest path forward based on constraints and team expertise


References

Codebase:

  • Current polling code: python-sdk/exospherehost/runtime.py (_enqueue method)
  • State query logic: state-manager/app/controller/enqueue_states.py

Option A - SSE:

Option B - WebSocket:

Option C - Message Queues:

Common (MongoDB Change Streams):

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions