Note: This analysis is based on the agent branch.
cc @subindevs
Background
The device layer architecture has evolved organically and now has several inefficiencies that introduce latency and confusion. This issue documents the current state, problems, and provides a refactoring roadmap.
Why Current Design Exists (Benefits to Preserve)
- Process Isolation - RPyC keeps application crashes from crashing hardware
- Bluesky Templates - Plan generation provides safe scaffold for agentic access
- Stage Limits - Ophyd devices enforce physical safety limits
- Restricted MMCore Access - Agents don't get raw hardware access
Current Architecture
Startup Sequence (Production)
1. start_server.py (RPyC server with MMCore on port 18861)
2. start_services.bat (launches simple_server.py + sam_server.py)
3. launch_copilot.py (orchestrator)
Components
| Component |
Port |
Role |
start_server.py |
18861 |
RPyC MMCore server - initializes MMCore, creates RunEngine + devices, but unused by copilot |
backend/simple_server.py |
60610 |
HTTP microscope API - connects to MMCore via RPyC, creates separate RunEngine + devices, actually used by copilot |
backend/sam_server.py |
18862 |
RPyC SAM detection server |
launch_copilot.py |
- |
Orchestrator - connects to simple_server (HTTP) and SAM server (RPyC) |
queue_server_startup.py |
- |
Bluesky queueserver - unused in current startup |
Data Flow (Current - Multiple Hops)
Physical Hardware
↓
MMCore (in start_server.py via RPyC)
↓
RPyC bridge to simple_server.py ← HOP 1
↓
Ophyd Device (in simple_server.py)
↓
RunEngine executes plan
↓
JSON serialize numpy array
↓
HTTP response to copilot ← HOP 2 (full data transfer)
↓
Copilot stores to DataStore
↓
VizServer retrieves from DataStore
Problems
1. RPyC/HTTP Latency for Large Image Stacks
- Volume acquisition returns entire numpy arrays through HTTP as JSON
- Pickle/JSON serialization + network transfer introduces significant latency
- Becomes worse with larger volumes (e.g., 200 slices × 2048 × 2048)
2. Duplicate Hardware Initialization
start_server.py creates MMCore + RunEngine + devices
backend/simple_server.py creates separate RunEngine + devices
queue_server_startup.py creates yet another set
- Multiple independent instances, unclear which is canonical
3. Device Wiring Confusion
start_server.py has RunEngine and devices but they're not used
simple_server.py is the actual control server used by copilot
- Unclear which system controls hardware
4. Data Persistence Location
- Data flows: Control layer → HTTP → Client → DataStore
- Should be: Control layer → Direct disk write → Pass UID via HTTP
- Affects: DataStore, VizServer, ImageManager
Target Architecture
Startup (2 scripts)
1. start_device_layer.py (all hardware: MMCore + devices + HTTP + SAM)
2. launch_copilot.py (application layer)
Data Flow (Target - Direct Writes)
Physical Hardware
↓
MMCore (in start_device_layer.py)
↓
Ophyd Device + RunEngine
↓
Write volume to disk: {storage}/volumes/{uid}.tif
↓
HTTP response: {uid, path, shape, dtype, session_id} ← METADATA ONLY
↓
Copilot registers UID in DataStore index
↓
VizServer reads from disk by UID
Refactoring Tasks
Task 1: Consolidate Hardware Control Layer into Single Script
Create start_device_layer.py that:
Task 2: Direct Data Writing from Control Layer
Task 3: Update DataStore Integration
Task 4: Update VizServer
Task 5: Update Copilot/Orchestrator
Task 6: Clean Up Unused Code
Files to Modify/Create
| File |
Change |
NEW start_device_layer.py |
Single script for device layer |
start_server.py |
Remove or merge |
backend/simple_server.py |
Merge, add direct disk writes |
backend/sam_server.py |
Merge or keep separate |
start_services.bat |
Remove |
gently/core/data_store.py |
Add register_external_file() |
gently/visualization/server.py |
Update to serve from new storage |
gently/agent/queue_server_client.py |
Handle UID-based responses |
gently/agent/image_manager.py |
Load from disk by UID |
launch_copilot.py |
Update connection to unified device layer |
Background
The device layer architecture has evolved organically and now has several inefficiencies that introduce latency and confusion. This issue documents the current state, problems, and provides a refactoring roadmap.
Why Current Design Exists (Benefits to Preserve)
Current Architecture
Startup Sequence (Production)
Components
start_server.pybackend/simple_server.pybackend/sam_server.pylaunch_copilot.pyqueue_server_startup.pyData Flow (Current - Multiple Hops)
Problems
1. RPyC/HTTP Latency for Large Image Stacks
2. Duplicate Hardware Initialization
start_server.pycreates MMCore + RunEngine + devicesbackend/simple_server.pycreates separate RunEngine + devicesqueue_server_startup.pycreates yet another set3. Device Wiring Confusion
start_server.pyhas RunEngine and devices but they're not usedsimple_server.pyis the actual control server used by copilot4. Data Persistence Location
Target Architecture
Startup (2 scripts)
Data Flow (Target - Direct Writes)
Refactoring Tasks
Task 1: Consolidate Hardware Control Layer into Single Script
Create
start_device_layer.pythat:Task 2: Direct Data Writing from Control Layer
{uid, path, shape, dtype, session_id}session_idfrom copilot to control layer when acquiringTask 3: Update DataStore Integration
gently/core/data_store.pyto index files written by control layerregister_external_file(uid, path, metadata)methodTask 4: Update VizServer
gently/visualization/server.pyto serve data by UID/api/images/{uid}endpoints work with new storageTask 5: Update Copilot/Orchestrator
gently/agent/queue_server_client.pyto work with UID-based responsesImageManagerto handle new data flowTask 6: Clean Up Unused Code
queue_server_startup.pyif not usedstart_services.bat(no longer needed)Files to Modify/Create
start_device_layer.pystart_server.pybackend/simple_server.pybackend/sam_server.pystart_services.batgently/core/data_store.pyregister_external_file()gently/visualization/server.pygently/agent/queue_server_client.pygently/agent/image_manager.pylaunch_copilot.py