Refactor device layer: consolidate servers, direct disk writes, reduce latency

> **Note:** This analysis is based on the `agent` branch.
> 
> cc @subindevs

## Background

The device layer architecture has evolved organically and now has several inefficiencies that introduce latency and confusion. This issue documents the current state, problems, and provides a refactoring roadmap.

### Why Current Design Exists (Benefits to Preserve)

1. **Process Isolation** - RPyC keeps application crashes from crashing hardware
2. **Bluesky Templates** - Plan generation provides safe scaffold for agentic access
3. **Stage Limits** - Ophyd devices enforce physical safety limits
4. **Restricted MMCore Access** - Agents don't get raw hardware access

---

## Current Architecture

### Startup Sequence (Production)
```
1. start_server.py      (RPyC server with MMCore on port 18861)
2. start_services.bat   (launches simple_server.py + sam_server.py)
3. launch_copilot.py    (orchestrator)
```

### Components

| Component | Port | Role |
|-----------|------|------|
| `start_server.py` | 18861 | RPyC MMCore server - initializes MMCore, creates RunEngine + devices, but **unused by copilot** |
| `backend/simple_server.py` | 60610 | HTTP microscope API - connects to MMCore via RPyC, creates **separate** RunEngine + devices, **actually used by copilot** |
| `backend/sam_server.py` | 18862 | RPyC SAM detection server |
| `launch_copilot.py` | - | Orchestrator - connects to simple_server (HTTP) and SAM server (RPyC) |
| `queue_server_startup.py` | - | Bluesky queueserver - **unused in current startup** |

### Data Flow (Current - Multiple Hops)
```
Physical Hardware
    ↓
MMCore (in start_server.py via RPyC)
    ↓
RPyC bridge to simple_server.py  ← HOP 1
    ↓
Ophyd Device (in simple_server.py)
    ↓
RunEngine executes plan
    ↓
JSON serialize numpy array
    ↓
HTTP response to copilot  ← HOP 2 (full data transfer)
    ↓
Copilot stores to DataStore
    ↓
VizServer retrieves from DataStore
```

---

## Problems

### 1. RPyC/HTTP Latency for Large Image Stacks
- Volume acquisition returns entire numpy arrays through HTTP as JSON
- Pickle/JSON serialization + network transfer introduces significant latency
- Becomes worse with larger volumes (e.g., 200 slices × 2048 × 2048)

### 2. Duplicate Hardware Initialization
- `start_server.py` creates MMCore + RunEngine + devices
- `backend/simple_server.py` creates **separate** RunEngine + devices
- `queue_server_startup.py` creates yet **another** set
- Multiple independent instances, unclear which is canonical

### 3. Device Wiring Confusion
- `start_server.py` has RunEngine and devices but they're not used
- `simple_server.py` is the actual control server used by copilot
- Unclear which system controls hardware

### 4. Data Persistence Location
- Data flows: Control layer → HTTP → Client → DataStore
- Should be: Control layer → Direct disk write → Pass UID via HTTP
- Affects: DataStore, VizServer, ImageManager

---

## Target Architecture

### Startup (2 scripts)
```
1. start_device_layer.py   (all hardware: MMCore + devices + HTTP + SAM)
2. launch_copilot.py       (application layer)
```

### Data Flow (Target - Direct Writes)
```
Physical Hardware
    ↓
MMCore (in start_device_layer.py)
    ↓
Ophyd Device + RunEngine
    ↓
Write volume to disk: {storage}/volumes/{uid}.tif
    ↓
HTTP response: {uid, path, shape, dtype, session_id}  ← METADATA ONLY
    ↓
Copilot registers UID in DataStore index
    ↓
VizServer reads from disk by UID
```

---

## Refactoring Tasks

### Task 1: Consolidate Hardware Control Layer into Single Script

Create `start_device_layer.py` that:
- [ ] Initializes MMCore directly
- [ ] Creates single set of Ophyd devices
- [ ] Creates single RunEngine
- [ ] Starts HTTP server for plan execution
- [ ] Starts SAM server (or integrates into same process)
- [ ] Uses **HTTP only** for application communication

### Task 2: Direct Data Writing from Control Layer

- [ ] Write volumes directly to disk in the control layer
- [ ] Return UID + metadata instead of full numpy arrays
- [ ] **Flat storage with session in metadata:**
  ```
  {storage_path}/volumes/{uid}.tif
  ```
- [ ] Session association tracked in index/metadata, not file path
- [ ] Return in HTTP response: `{uid, path, shape, dtype, session_id}`
- [ ] Pass `session_id` from copilot to control layer when acquiring

### Task 3: Update DataStore Integration

- [ ] Modify `gently/core/data_store.py` to index files written by control layer
- [ ] Implement `register_external_file(uid, path, metadata)` method
- [ ] Ensure lineage tracking still works with external writes

### Task 4: Update VizServer

- [ ] Modify `gently/visualization/server.py` to serve data by UID
- [ ] Watch for new files or receive push notifications from control layer
- [ ] Ensure existing `/api/images/{uid}` endpoints work with new storage

### Task 5: Update Copilot/Orchestrator

- [ ] Modify `gently/agent/queue_server_client.py` to work with UID-based responses
- [ ] Load data from disk when needed instead of receiving through HTTP
- [ ] Update `ImageManager` to handle new data flow

### Task 6: Clean Up Unused Code

- [ ] Remove or document `queue_server_startup.py` if not used
- [ ] Clarify which server pattern is canonical
- [ ] Remove `start_services.bat` (no longer needed)

---

## Files to Modify/Create

| File | Change |
|------|--------|
| **NEW** `start_device_layer.py` | Single script for device layer |
| `start_server.py` | Remove or merge |
| `backend/simple_server.py` | Merge, add direct disk writes |
| `backend/sam_server.py` | Merge or keep separate |
| `start_services.bat` | Remove |
| `gently/core/data_store.py` | Add `register_external_file()` |
| `gently/visualization/server.py` | Update to serve from new storage |
| `gently/agent/queue_server_client.py` | Handle UID-based responses |
| `gently/agent/image_manager.py` | Load from disk by UID |
| `launch_copilot.py` | Update connection to unified device layer |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor device layer: consolidate servers, direct disk writes, reduce latency #14

Background

Why Current Design Exists (Benefits to Preserve)

Current Architecture

Startup Sequence (Production)

Components

Data Flow (Current - Multiple Hops)

Problems

1. RPyC/HTTP Latency for Large Image Stacks

2. Duplicate Hardware Initialization

3. Device Wiring Confusion

4. Data Persistence Location

Target Architecture

Startup (2 scripts)

Data Flow (Target - Direct Writes)

Refactoring Tasks

Task 1: Consolidate Hardware Control Layer into Single Script

Task 2: Direct Data Writing from Control Layer

Task 3: Update DataStore Integration

Task 4: Update VizServer

Task 5: Update Copilot/Orchestrator

Task 6: Clean Up Unused Code

Files to Modify/Create

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Component	Port	Role
`start_server.py`	18861	RPyC MMCore server - initializes MMCore, creates RunEngine + devices, but unused by copilot
`backend/simple_server.py`	60610	HTTP microscope API - connects to MMCore via RPyC, creates separate RunEngine + devices, actually used by copilot
`backend/sam_server.py`	18862	RPyC SAM detection server
`launch_copilot.py`	-	Orchestrator - connects to simple_server (HTTP) and SAM server (RPyC)
`queue_server_startup.py`	-	Bluesky queueserver - unused in current startup

File	Change
NEW `start_device_layer.py`	Single script for device layer
`start_server.py`	Remove or merge
`backend/simple_server.py`	Merge, add direct disk writes
`backend/sam_server.py`	Merge or keep separate
`start_services.bat`	Remove
`gently/core/data_store.py`	Add `register_external_file()`
`gently/visualization/server.py`	Update to serve from new storage
`gently/agent/queue_server_client.py`	Handle UID-based responses
`gently/agent/image_manager.py`	Load from disk by UID
`launch_copilot.py`	Update connection to unified device layer

Refactor device layer: consolidate servers, direct disk writes, reduce latency #14

Description

Background

Why Current Design Exists (Benefits to Preserve)

Current Architecture

Startup Sequence (Production)

Components

Data Flow (Current - Multiple Hops)

Problems

1. RPyC/HTTP Latency for Large Image Stacks

2. Duplicate Hardware Initialization

3. Device Wiring Confusion

4. Data Persistence Location

Target Architecture

Startup (2 scripts)

Data Flow (Target - Direct Writes)

Refactoring Tasks

Task 1: Consolidate Hardware Control Layer into Single Script

Task 2: Direct Data Writing from Control Layer

Task 3: Update DataStore Integration

Task 4: Update VizServer

Task 5: Update Copilot/Orchestrator

Task 6: Clean Up Unused Code

Files to Modify/Create

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions