-
Notifications
You must be signed in to change notification settings - Fork 44
Description
The Ray dashboard service runs on the ray head node, which goes away after the ray cluster is deleted. This makes the ray dashboard of little use when you are running ephemeral or short lived ray clusters. Long running ray clusters have reliability issues. So the ray clusters get an impression of limited transparency and observability.
If we can move the dashboard service outside the ray cluster, we can have a global dashboard that can show the jobs across all the ray clusters, even after the cluster is deleted. This needs to be backed by a dedicated data store, which gets data from the ray cluster's dashboard service, while the cluster is still running. This will allow us to preserve the history of all the rich data that we see on the ray dashboards, independent of how long the ray cluster lives.
In the nutshell, we need a way to export the ray dashboard data to an external data store, while the cluster is running. And we need an external dashboard service that shows the exported data. The external dashboard can have the same GUI as the ray dashboard service, to keep it consistent.