exploration: Study impact and effort for streaming to workers

## Objective

1. Identify and quantify bottlenecks in the current polling-based architecture
2. Determine if streaming is the right solution
3. If yes, outline the least-effort path to implementation

---

## Part 1: Current Polling Architecture

### How It Works

Workers (Runtimes) poll the State Manager every `poll_interval` (default: 1 second) asking for work:

```mermaid
sequenceDiagram
    participant RT as Runtime
    participant SM as State Manager
    participant DB as MongoDB
    
    loop Every 1 second
        RT->>SM: POST /states/enqueue
        SM->>DB: Query for CREATED states
        SM->>DB: Update status to QUEUED
        SM-->>RT: Return states (or empty array)
    end
```

### Key Configuration

| Parameter | Default | Purpose |
|-----------|---------|---------|
| `poll_interval` | 1s | Time between polls |
| `batch_size` | 16 | States fetched per poll |
| `workers` | 4 | Parallel executors per runtime |

---

## Part 2: Bottleneck Analysis

### Bottleneck 1: Latency

**Problem**: States wait up to `poll_interval` before being picked up.

| Scenario | Latency |
|----------|---------|
| Best case (state created just before poll) | ~0ms |
| Worst case (state created just after poll) | ~1000ms |
| Average | ~500ms |

**Impact**: For latency-sensitive workflows (real-time processing, user-facing operations), 500ms average delay may be unacceptable.

---

### Bottleneck 2: Wasted Requests

**Problem**: Runtimes poll even when no work is available.

**Analysis** (per runtime):
- Polls per minute: 60
- If queue is empty 80% of the time: 48 wasted requests/minute
- With 100 runtimes: **4,800 wasted requests/minute**

**Impact**: Unnecessary network traffic and State Manager load.

---

### Bottleneck 3: Database Load

**Problem**: Each poll triggers a MongoDB query, even when idle.

**Current query pattern**:
```python
find_one_and_update(
    {"status": "CREATED", "node_name": {"$in": nodes}},
    {"$set": {"status": "QUEUED"}}
)
```

**Analysis**:
- Queries per minute per runtime: 60
- With 100 runtimes: **6,000 queries/minute** (regardless of actual work)

**Impact**: MongoDB load scales with runtime count, not workload.

---

### Bottleneck 4: Scalability Ceiling

**Problem**: Adding more runtimes linearly increases polling load.

```
Total requests/sec = num_runtimes / poll_interval
```

| Runtimes | Requests/sec to State Manager |
|----------|-------------------------------|
| 10 | 10 |
| 100 | 100 |
| 1000 | 1000 |

**Impact**: State Manager becomes a bottleneck before compute is saturated.

---

### Bottleneck Summary

| Bottleneck | Severity | Affects |
|------------|----------|---------|
| Latency (~500ms avg) | Medium | Real-time workloads |
| Wasted requests | Low | Network/cost |
| Database load | Medium | MongoDB at scale |
| Scalability ceiling | High | Large deployments |

---

## Part 3: Should We Move to Streaming?

### Decision Matrix

| Factor | Polling | Streaming | Winner |
|--------|---------|-----------|--------|
| Latency | ~500ms | ~10ms | Streaming |
| Idle resource usage | High | Near zero | Streaming |
| Implementation complexity | Simple | Moderate | Polling |
| Proxy/firewall compatibility | Excellent | Good (SSE) / Moderate (WS) | Polling |
| Failure recovery | Stateless, trivial | Requires reconnect logic | Polling |
| Scalability | O(n) requests | O(1) connections | Streaming |

### When Polling is Sufficient

- Small deployments (< 50 runtimes)
- Batch/async workloads where 500ms latency is acceptable
- Environments with restrictive proxies
- Teams preferring operational simplicity

### When Streaming is Needed

- Large deployments (100+ runtimes)
- Real-time or low-latency requirements
- High state throughput (1000+ states/minute)
- Cost-sensitive environments (reduce unnecessary requests)

### Recommendation

**Yes, streaming is worth pursuing** for the following reasons:

1. **Scalability**: Polling fundamentally doesn't scale - streaming does
2. **Latency**: 50x improvement (500ms to 10ms) enables new use cases
3. **Efficiency**: Eliminates waste in idle scenarios
4. **Industry standard**: Most distributed task systems use push-based models

However, **polling should remain as a fallback** for compatibility.

---

## Part 4: Implementation Options

Two viable streaming approaches exist. The choice depends on current needs vs. future extensibility.

---

### Option A: Server-Sent Events (SSE) - Unidirectional

**How it works**: Server pushes states to workers over HTTP. Workers respond via existing REST endpoints.

```mermaid
flowchart LR
    SM[State Manager] -->|SSE: push states| RT[Runtime]
    RT -->|HTTP POST: /executed| SM
    RT -->|HTTP POST: /errored| SM
```

**Pros:**
- Simpler implementation
- Works through most proxies/load balancers
- Built-in browser reconnection (if dashboard uses it)
- Lower server resource usage

**Cons:**
- One-way only - responses still require HTTP calls
- Cannot stream execution output back
- Adding bidirectional later requires new infrastructure

---

### Option B: WebSockets - Bidirectional

**How it works**: Full duplex connection. Both state delivery AND responses flow over same connection.

```mermaid
flowchart LR
    SM[State Manager] <-->|WebSocket: states + responses| RT[Runtime]
```

**Pros:**
- Single persistent connection for everything
- Can stream execution results/logs back in real-time
- Lower latency for responses (no HTTP overhead)
- Future-proof for additional use cases

**Cons:**
- More complex implementation
- Some proxies require special configuration
- Need to implement ping/pong keepalive
- Higher per-connection memory on server

---

### Option C: Message Queue / Durable Streams

**How it works**: State Manager publishes states to a message broker. Workers consume from the queue. Decouples producers from consumers entirely.

```mermaid
flowchart LR
    SM[State Manager] -->|Publish| MQ[(Message Queue)]
    MQ -->|Consume| RT1[Runtime 1]
    MQ -->|Consume| RT2[Runtime 2]
    MQ -->|Consume| RT3[Runtime 3]
    RT1 -->|HTTP POST| SM
```

**Technology Options:**

| Technology | Latency | Durability | Complexity | Managed Options |
|------------|---------|------------|------------|-----------------|
| **Redis Streams** | ~1-5ms | Configurable (AOF/RDB) | Low | AWS ElastiCache, Redis Cloud |
| **Apache Kafka** | ~5-10ms | Strong (replicated log) | High | Confluent, AWS MSK |
| **NATS JetStream** | ~1-2ms | Strong (replicated) | Medium | Synadia Cloud |

**Pros:**
- **Decoupling**: State Manager doesn't need to track worker connections
- **Built-in durability**: Messages persist if workers are temporarily down
- **Consumer groups**: Built-in work distribution (no custom claiming logic)
- **Horizontal scaling**: Add workers without any State Manager changes
- **Replay capability**: Can replay messages for recovery
- **Battle-tested**: Production-grade queueing semantics

**Cons:**
- **Additional infrastructure**: New component to deploy and operate
- **Cost**: Managed services have costs; self-hosted requires expertise
- **Latency variance**: Network hop to broker adds some latency
- **Overkill for small deployments**: Adds complexity for simple setups

---


### Message Queue: Technology Recommendations

**If starting fresh:**
- **NATS JetStream** - Lowest latency, simple to operate, lightweight
- Good for: Most Exosphere deployments

**If already using Redis:**
- **Redis Streams** - No new infrastructure, familiar tooling
- Good for: Teams with Redis expertise

**If enterprise/high-throughput:**
- **Apache Kafka** - Industry standard for high-volume streaming
- Good for: 10K+ states/second, strict ordering requirements

---

### Next Steps

1. **Decide: Should we move to streaming?**
   - Is ~500ms latency acceptable for our use cases?
   - Are we hitting scalability limits with polling?
   - Do users need real-time responsiveness?

2. **If yes, gather requirements:**
   - Do we need bidirectional streaming (logs, progress) in next 6-12 months?
   - Do we expect 100+ runtimes at scale?
   - Is durability/replay a requirement?
   - Do we already have Redis/Kafka/NATS infrastructure?

3. **Based on requirements, shortlist options:**
   - Small scale, no durability needed → SSE or WebSocket
   - Large scale or durability needed → Message Queue

4. Validate infrastructure prerequisites for shortlisted options

5. **Determine the simplest path forward** based on constraints and team expertise

---

## References

**Codebase:**
- Current polling code: `python-sdk/exospherehost/runtime.py` (`_enqueue` method)
- State query logic: `state-manager/app/controller/enqueue_states.py`

**Option A - SSE:**
- [SSE Specification](https://html.spec.whatwg.org/multipage/server-sent-events.html)
- [sse-starlette (FastAPI SSE)](https://github.com/sysid/sse-starlette)

**Option B - WebSocket:**
- [WebSocket Protocol (RFC 6455)](https://datatracker.ietf.org/doc/html/rfc6455)
- [FastAPI WebSockets](https://fastapi.tiangolo.com/advanced/websockets/)

**Option C - Message Queues:**
- [Redis Streams](https://redis.io/docs/data-types/streams/)
- [Apache Kafka](https://kafka.apache.org/documentation/)
- [NATS JetStream](https://docs.nats.io/nats-concepts/jetstream)

**Common (MongoDB Change Streams):**
- [MongoDB Change Streams](https://www.mongodb.com/docs/manual/changeStreams/)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exploration: Study impact and effort for streaming to workers #638

Objective

Part 1: Current Polling Architecture

How It Works

Key Configuration

Part 2: Bottleneck Analysis

Bottleneck 1: Latency

Bottleneck 2: Wasted Requests

Bottleneck 3: Database Load

Bottleneck 4: Scalability Ceiling

Bottleneck Summary

Part 3: Should We Move to Streaming?

Decision Matrix

When Polling is Sufficient

When Streaming is Needed

Recommendation

Part 4: Implementation Options

Option A: Server-Sent Events (SSE) - Unidirectional

Option B: WebSockets - Bidirectional

Option C: Message Queue / Durable Streams

Message Queue: Technology Recommendations

Next Steps

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Parameter	Default	Purpose
`poll_interval`	1s	Time between polls
`batch_size`	16	States fetched per poll
`workers`	4	Parallel executors per runtime

Scenario	Latency
Best case (state created just before poll)	~0ms
Worst case (state created just after poll)	~1000ms
Average	~500ms

Bottleneck	Severity	Affects
Latency (~500ms avg)	Medium	Real-time workloads
Wasted requests	Low	Network/cost
Database load	Medium	MongoDB at scale
Scalability ceiling	High	Large deployments

Factor	Polling	Streaming	Winner
Latency	~500ms	~10ms	Streaming
Idle resource usage	High	Near zero	Streaming
Implementation complexity	Simple	Moderate	Polling
Proxy/firewall compatibility	Excellent	Good (SSE) / Moderate (WS)	Polling
Failure recovery	Stateless, trivial	Requires reconnect logic	Polling
Scalability	O(n) requests	O(1) connections	Streaming

Technology	Latency	Durability	Complexity	Managed Options
Redis Streams	~1-5ms	Configurable (AOF/RDB)	Low	AWS ElastiCache, Redis Cloud
Apache Kafka	~5-10ms	Strong (replicated log)	High	Confluent, AWS MSK
NATS JetStream	~1-2ms	Strong (replicated)	Medium	Synadia Cloud

exploration: Study impact and effort for streaming to workers #638

Description

Objective

Part 1: Current Polling Architecture

How It Works

Key Configuration

Part 2: Bottleneck Analysis

Bottleneck 1: Latency

Bottleneck 2: Wasted Requests

Bottleneck 3: Database Load

Bottleneck 4: Scalability Ceiling

Bottleneck Summary

Part 3: Should We Move to Streaming?

Decision Matrix

When Polling is Sufficient

When Streaming is Needed

Recommendation

Part 4: Implementation Options

Option A: Server-Sent Events (SSE) - Unidirectional

Option B: WebSockets - Bidirectional

Option C: Message Queue / Durable Streams

Message Queue: Technology Recommendations

Next Steps

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions