Skip to content

Refactor device layer: consolidate servers, direct disk writes, reduce latency #14

@pskeshu

Description

@pskeshu

Note: This analysis is based on the agent branch.

cc @subindevs

Background

The device layer architecture has evolved organically and now has several inefficiencies that introduce latency and confusion. This issue documents the current state, problems, and provides a refactoring roadmap.

Why Current Design Exists (Benefits to Preserve)

  1. Process Isolation - RPyC keeps application crashes from crashing hardware
  2. Bluesky Templates - Plan generation provides safe scaffold for agentic access
  3. Stage Limits - Ophyd devices enforce physical safety limits
  4. Restricted MMCore Access - Agents don't get raw hardware access

Current Architecture

Startup Sequence (Production)

1. start_server.py      (RPyC server with MMCore on port 18861)
2. start_services.bat   (launches simple_server.py + sam_server.py)
3. launch_copilot.py    (orchestrator)

Components

Component Port Role
start_server.py 18861 RPyC MMCore server - initializes MMCore, creates RunEngine + devices, but unused by copilot
backend/simple_server.py 60610 HTTP microscope API - connects to MMCore via RPyC, creates separate RunEngine + devices, actually used by copilot
backend/sam_server.py 18862 RPyC SAM detection server
launch_copilot.py - Orchestrator - connects to simple_server (HTTP) and SAM server (RPyC)
queue_server_startup.py - Bluesky queueserver - unused in current startup

Data Flow (Current - Multiple Hops)

Physical Hardware
    ↓
MMCore (in start_server.py via RPyC)
    ↓
RPyC bridge to simple_server.py  ← HOP 1
    ↓
Ophyd Device (in simple_server.py)
    ↓
RunEngine executes plan
    ↓
JSON serialize numpy array
    ↓
HTTP response to copilot  ← HOP 2 (full data transfer)
    ↓
Copilot stores to DataStore
    ↓
VizServer retrieves from DataStore

Problems

1. RPyC/HTTP Latency for Large Image Stacks

  • Volume acquisition returns entire numpy arrays through HTTP as JSON
  • Pickle/JSON serialization + network transfer introduces significant latency
  • Becomes worse with larger volumes (e.g., 200 slices × 2048 × 2048)

2. Duplicate Hardware Initialization

  • start_server.py creates MMCore + RunEngine + devices
  • backend/simple_server.py creates separate RunEngine + devices
  • queue_server_startup.py creates yet another set
  • Multiple independent instances, unclear which is canonical

3. Device Wiring Confusion

  • start_server.py has RunEngine and devices but they're not used
  • simple_server.py is the actual control server used by copilot
  • Unclear which system controls hardware

4. Data Persistence Location

  • Data flows: Control layer → HTTP → Client → DataStore
  • Should be: Control layer → Direct disk write → Pass UID via HTTP
  • Affects: DataStore, VizServer, ImageManager

Target Architecture

Startup (2 scripts)

1. start_device_layer.py   (all hardware: MMCore + devices + HTTP + SAM)
2. launch_copilot.py       (application layer)

Data Flow (Target - Direct Writes)

Physical Hardware
    ↓
MMCore (in start_device_layer.py)
    ↓
Ophyd Device + RunEngine
    ↓
Write volume to disk: {storage}/volumes/{uid}.tif
    ↓
HTTP response: {uid, path, shape, dtype, session_id}  ← METADATA ONLY
    ↓
Copilot registers UID in DataStore index
    ↓
VizServer reads from disk by UID

Refactoring Tasks

Task 1: Consolidate Hardware Control Layer into Single Script

Create start_device_layer.py that:

  • Initializes MMCore directly
  • Creates single set of Ophyd devices
  • Creates single RunEngine
  • Starts HTTP server for plan execution
  • Starts SAM server (or integrates into same process)
  • Uses HTTP only for application communication

Task 2: Direct Data Writing from Control Layer

  • Write volumes directly to disk in the control layer
  • Return UID + metadata instead of full numpy arrays
  • Flat storage with session in metadata:
    {storage_path}/volumes/{uid}.tif
    
  • Session association tracked in index/metadata, not file path
  • Return in HTTP response: {uid, path, shape, dtype, session_id}
  • Pass session_id from copilot to control layer when acquiring

Task 3: Update DataStore Integration

  • Modify gently/core/data_store.py to index files written by control layer
  • Implement register_external_file(uid, path, metadata) method
  • Ensure lineage tracking still works with external writes

Task 4: Update VizServer

  • Modify gently/visualization/server.py to serve data by UID
  • Watch for new files or receive push notifications from control layer
  • Ensure existing /api/images/{uid} endpoints work with new storage

Task 5: Update Copilot/Orchestrator

  • Modify gently/agent/queue_server_client.py to work with UID-based responses
  • Load data from disk when needed instead of receiving through HTTP
  • Update ImageManager to handle new data flow

Task 6: Clean Up Unused Code

  • Remove or document queue_server_startup.py if not used
  • Clarify which server pattern is canonical
  • Remove start_services.bat (no longer needed)

Files to Modify/Create

File Change
NEW start_device_layer.py Single script for device layer
start_server.py Remove or merge
backend/simple_server.py Merge, add direct disk writes
backend/sam_server.py Merge or keep separate
start_services.bat Remove
gently/core/data_store.py Add register_external_file()
gently/visualization/server.py Update to serve from new storage
gently/agent/queue_server_client.py Handle UID-based responses
gently/agent/image_manager.py Load from disk by UID
launch_copilot.py Update connection to unified device layer

Metadata

Metadata

Assignees

Labels

architectureSystem architecturehardwareHardware/device relatedrefactorCode refactoring

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions