Skip to content

Conversation

@rtuck99
Copy link
Contributor

@rtuck99 rtuck99 commented Jan 6, 2026

This change addresses the requirements of Hyperion as discussed in GH issue

DeviceManagerSource now accepts ensure_connected boolean option. When enabled any failure to build or connect a device will result in an exception. The default value for ensure_connected is False, which retains existing behaviour.

Launching blueapi in the event that devices cannot be connected will result in blueapi running, however the environment will not be initialised. Probing the environment via the REST API using GET /environment will result in a response similar to the following:

{
  "environment_id": "0631a4de-2f36-487e-9b3c-cf8edbcc2482",
  "initialized": false,
  "error_message": "ExceptionGroup: Errors occurred while connecting the following devices: fastcs_eiger, pin_tip_detection, panda (3 sub-exceptions)"
}

@rtuck99 rtuck99 requested a review from a team as a code owner January 6, 2026 11:51
@codecov
Copy link

codecov bot commented Jan 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.77%. Comparing base (038fe03) to head (65fefad).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1324   +/-   ##
=======================================
  Coverage   94.76%   94.77%           
=======================================
  Files          42       42           
  Lines        2752     2755    +3     
=======================================
+ Hits         2608     2611    +3     
  Misses        144      144           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@tpoliaw tpoliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok and works in my limited testing. Might be nitpicking but I'm not sure ensure_connected really describes what it's doing given it doesn't do anything more to ensure the devices are connected. Maybe required or check_connected? Happy to ignore it and move on though if others are ok with it.

Copy link
Contributor

@abbiemery abbiemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good idea, but there are some things I think I need clarification on and other which are potentially worth discussing.

For some clarification: So with this change (and =true) blueapi still runs, there is an error code on the /environment endpoint, and the environment is now in unusable state. The initial issue said it shouldn't start at all. Is this endpoint being queried as a check on whether to proceed instead?

I played with this change with ensure_connected as true with one of my devices down, and brought it up and down in various states.

With a device down:

If you have ensure_connected as false :

  • you get one set of logs showing you all the pv's that didn't connect
  • if from the client you do: blueapi -c config.yaml controller devices only the devices which built and connected are returned.
  • you can still query for plans
  • if you try and run a plan with the device that didn't connect (and therefore isn't in the context) you get the error
Error: Incorrect parameters supplied
    Invalid value 'det' for field detectors.0: Value error, Device det is not of type bluesky.protocols.Readable

and the plan doesn't run. Error message should be improved but hey ho.

If you have ensure_connected as true :

  • On start in addition to the previous pv connection error logs, there are now duplicated NotConnectedErrors which is noisy.
  • if from the client you do: blueapi -c config.yaml controller devices you now get a 500 error and a reasonably obscure stack trace.
  • You get the same if you query for plans.

So blueapi is "up", but in a broken state which is not obviously different from the previous behaviour if you are just looking at the server logs.

Would the following be more helpful to you?:

Have the blueapi context instead add all devices that have built, even if they don't connect, but have some distinction for those that connected.

For the false case as if you query the devices you get a more useful list of all potentially available devices, as well as the ones which are available. We could also add a connect <device> command to the client so if you started blueapi with motor1 disconnected, then connected it, you could ensure blueapi has caught up properly before trying to run a scan with it.

In the true case the same occurs but if there are any connection errors the /environment endpoint says "error you can't scan with me, i don't have all my devices". Maybe if it receives a run scan command in that state it returns a "no i cant until you connect my devices". If there is a "connect" command as above it could also potentially avoid the need to reload the whole environment to check things have connected.

There are undoubtedly requirements and ramifications I have not considered here so some weigh in would be appreciated: @tpoliaw @rtuck99

@rtuck99
Copy link
Contributor Author

rtuck99 commented Jan 7, 2026

and the plan doesn't run. Error message should be improved but hey ho.

It may be that this behaviour is acceptable at least for the current hyperion implementation, given that at the moment it is mostly just one big plan which injects all devices. Although there is I suppose the potential for it to successfully get through the udc_default_state plan with the queue empty, then some time later when something enters the queue, the actual collection fails due to a missing device, which would probably be annoying.

The way I envisage this being used is that blueapi would start but it could e.g. fail a k8s readiness check or hyperion-supervisor could check the environment endpoint and determine this explicitly. What we don't want is for someone to launch hyperion, think it's working and then clock off for the day, only to find out later something is amiss.

I don't think we'd need any facility to reconnect devices, since we would probably just restart the container, since it's likely that either there is a hw problem or at least some device needs to be dummied out.

For the false case I think current behaviour is fine, at least as far as UDC is concerned, we would always run with it set to true, check the endpoint and then not run any plans at all if there is a problem.

Point accepted about the log noise, I will remove the ExceptionGroup and downgrade it to a normal exception if that helps.

DeviceManagerSource now accepts ensure_connected boolean option.
When enabled any failure to build or connect a device will result in
an exception.
Rename ensure_connected to check_connected
@rtuck99 rtuck99 force-pushed the 1236_option_to_enforce_all_devices_connected branch from 756c108 to 65fefad Compare January 9, 2026 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants