-
Notifications
You must be signed in to change notification settings - Fork 41
Description
Objective
- Identify and quantify bottlenecks in the current polling-based architecture
- Determine if streaming is the right solution
- If yes, outline the least-effort path to implementation
Part 1: Current Polling Architecture
How It Works
Workers (Runtimes) poll the State Manager every poll_interval (default: 1 second) asking for work:
sequenceDiagram
participant RT as Runtime
participant SM as State Manager
participant DB as MongoDB
loop Every 1 second
RT->>SM: POST /states/enqueue
SM->>DB: Query for CREATED states
SM->>DB: Update status to QUEUED
SM-->>RT: Return states (or empty array)
end
Key Configuration
| Parameter | Default | Purpose |
|---|---|---|
poll_interval |
1s | Time between polls |
batch_size |
16 | States fetched per poll |
workers |
4 | Parallel executors per runtime |
Part 2: Bottleneck Analysis
Bottleneck 1: Latency
Problem: States wait up to poll_interval before being picked up.
| Scenario | Latency |
|---|---|
| Best case (state created just before poll) | ~0ms |
| Worst case (state created just after poll) | ~1000ms |
| Average | ~500ms |
Impact: For latency-sensitive workflows (real-time processing, user-facing operations), 500ms average delay may be unacceptable.
Bottleneck 2: Wasted Requests
Problem: Runtimes poll even when no work is available.
Analysis (per runtime):
- Polls per minute: 60
- If queue is empty 80% of the time: 48 wasted requests/minute
- With 100 runtimes: 4,800 wasted requests/minute
Impact: Unnecessary network traffic and State Manager load.
Bottleneck 3: Database Load
Problem: Each poll triggers a MongoDB query, even when idle.
Current query pattern:
find_one_and_update(
{"status": "CREATED", "node_name": {"$in": nodes}},
{"$set": {"status": "QUEUED"}}
)Analysis:
- Queries per minute per runtime: 60
- With 100 runtimes: 6,000 queries/minute (regardless of actual work)
Impact: MongoDB load scales with runtime count, not workload.
Bottleneck 4: Scalability Ceiling
Problem: Adding more runtimes linearly increases polling load.
Total requests/sec = num_runtimes / poll_interval
| Runtimes | Requests/sec to State Manager |
|---|---|
| 10 | 10 |
| 100 | 100 |
| 1000 | 1000 |
Impact: State Manager becomes a bottleneck before compute is saturated.
Bottleneck Summary
| Bottleneck | Severity | Affects |
|---|---|---|
| Latency (~500ms avg) | Medium | Real-time workloads |
| Wasted requests | Low | Network/cost |
| Database load | Medium | MongoDB at scale |
| Scalability ceiling | High | Large deployments |
Part 3: Should We Move to Streaming?
Decision Matrix
| Factor | Polling | Streaming | Winner |
|---|---|---|---|
| Latency | ~500ms | ~10ms | Streaming |
| Idle resource usage | High | Near zero | Streaming |
| Implementation complexity | Simple | Moderate | Polling |
| Proxy/firewall compatibility | Excellent | Good (SSE) / Moderate (WS) | Polling |
| Failure recovery | Stateless, trivial | Requires reconnect logic | Polling |
| Scalability | O(n) requests | O(1) connections | Streaming |
When Polling is Sufficient
- Small deployments (< 50 runtimes)
- Batch/async workloads where 500ms latency is acceptable
- Environments with restrictive proxies
- Teams preferring operational simplicity
When Streaming is Needed
- Large deployments (100+ runtimes)
- Real-time or low-latency requirements
- High state throughput (1000+ states/minute)
- Cost-sensitive environments (reduce unnecessary requests)
Recommendation
Yes, streaming is worth pursuing for the following reasons:
- Scalability: Polling fundamentally doesn't scale - streaming does
- Latency: 50x improvement (500ms to 10ms) enables new use cases
- Efficiency: Eliminates waste in idle scenarios
- Industry standard: Most distributed task systems use push-based models
However, polling should remain as a fallback for compatibility.
Part 4: Implementation Options
Two viable streaming approaches exist. The choice depends on current needs vs. future extensibility.
Option A: Server-Sent Events (SSE) - Unidirectional
How it works: Server pushes states to workers over HTTP. Workers respond via existing REST endpoints.
flowchart LR
SM[State Manager] -->|SSE: push states| RT[Runtime]
RT -->|HTTP POST: /executed| SM
RT -->|HTTP POST: /errored| SM
Pros:
- Simpler implementation
- Works through most proxies/load balancers
- Built-in browser reconnection (if dashboard uses it)
- Lower server resource usage
Cons:
- One-way only - responses still require HTTP calls
- Cannot stream execution output back
- Adding bidirectional later requires new infrastructure
Option B: WebSockets - Bidirectional
How it works: Full duplex connection. Both state delivery AND responses flow over same connection.
flowchart LR
SM[State Manager] <-->|WebSocket: states + responses| RT[Runtime]
Pros:
- Single persistent connection for everything
- Can stream execution results/logs back in real-time
- Lower latency for responses (no HTTP overhead)
- Future-proof for additional use cases
Cons:
- More complex implementation
- Some proxies require special configuration
- Need to implement ping/pong keepalive
- Higher per-connection memory on server
Option C: Message Queue / Durable Streams
How it works: State Manager publishes states to a message broker. Workers consume from the queue. Decouples producers from consumers entirely.
flowchart LR
SM[State Manager] -->|Publish| MQ[(Message Queue)]
MQ -->|Consume| RT1[Runtime 1]
MQ -->|Consume| RT2[Runtime 2]
MQ -->|Consume| RT3[Runtime 3]
RT1 -->|HTTP POST| SM
Technology Options:
| Technology | Latency | Durability | Complexity | Managed Options |
|---|---|---|---|---|
| Redis Streams | ~1-5ms | Configurable (AOF/RDB) | Low | AWS ElastiCache, Redis Cloud |
| Apache Kafka | ~5-10ms | Strong (replicated log) | High | Confluent, AWS MSK |
| NATS JetStream | ~1-2ms | Strong (replicated) | Medium | Synadia Cloud |
Pros:
- Decoupling: State Manager doesn't need to track worker connections
- Built-in durability: Messages persist if workers are temporarily down
- Consumer groups: Built-in work distribution (no custom claiming logic)
- Horizontal scaling: Add workers without any State Manager changes
- Replay capability: Can replay messages for recovery
- Battle-tested: Production-grade queueing semantics
Cons:
- Additional infrastructure: New component to deploy and operate
- Cost: Managed services have costs; self-hosted requires expertise
- Latency variance: Network hop to broker adds some latency
- Overkill for small deployments: Adds complexity for simple setups
Message Queue: Technology Recommendations
If starting fresh:
- NATS JetStream - Lowest latency, simple to operate, lightweight
- Good for: Most Exosphere deployments
If already using Redis:
- Redis Streams - No new infrastructure, familiar tooling
- Good for: Teams with Redis expertise
If enterprise/high-throughput:
- Apache Kafka - Industry standard for high-volume streaming
- Good for: 10K+ states/second, strict ordering requirements
Next Steps
-
Decide: Should we move to streaming?
- Is ~500ms latency acceptable for our use cases?
- Are we hitting scalability limits with polling?
- Do users need real-time responsiveness?
-
If yes, gather requirements:
- Do we need bidirectional streaming (logs, progress) in next 6-12 months?
- Do we expect 100+ runtimes at scale?
- Is durability/replay a requirement?
- Do we already have Redis/Kafka/NATS infrastructure?
-
Based on requirements, shortlist options:
- Small scale, no durability needed → SSE or WebSocket
- Large scale or durability needed → Message Queue
-
Validate infrastructure prerequisites for shortlisted options
-
Determine the simplest path forward based on constraints and team expertise
References
Codebase:
- Current polling code:
python-sdk/exospherehost/runtime.py(_enqueuemethod) - State query logic:
state-manager/app/controller/enqueue_states.py
Option A - SSE:
Option B - WebSocket:
Option C - Message Queues:
Common (MongoDB Change Streams):