-
Notifications
You must be signed in to change notification settings - Fork 14
Open
Description
@RaiqaRasool presents a problem scenario that is easy to create and hard to debug:
- Launch JANA2 with a large number of threads and no timeout set. A small number of these threads hang when they all attempt to connect to an external resource.
- The events corresponding to non-hanging are all processed just fine. Because no timeout has been set, processing will continue with slightly fewer threads and one event dropped for each hanging thread. Eventually we'd notice when the event source finished and the topology fails to shut down. In this case, however, we are doing online monitoring, so we won't notice this.
- Eventually a barrier event is emitted into the stream. The barrier event cannot be emitted until the hanging events are marked as finished.
- The pending barrier cannot be emitted until the hanging events complete, so the entire topology idles, all output stops, and the event rate slowly drops to zero. This won't happen until we are tens of thousands of events in, and on a production environment. Worse, this can't be reproduced locally.
How to debug this?
- Set a timeout. The hanging events are the root cause. However,
- Send Ctrl-Z (or
bin/jana-status, orkill -s USR2 $JANA_PID) to obtain stack traces for each worker, which includes the worker's current arrow and current event number. Eventually notice that worker 22 is processing event 12345, but worker 23 is processing event 7.
What would I like?
- We have this whole JInspector interface, which is great except for the fact that it won't run until the topology is fully paused. Of course, most of the time we want to inspect, it is because something is misbehaving and hence the topology can't pause.
- JExecutionEngine doesn't expose the
ArrowStateandWorkerState, even though this would be easy and would clear the issue right up. - JExecutionEngine doesn't have a concept of pending events. This one is also pretty easy.
Consider if we made the following info available from the Ctrl-Z handler or JInspector:
| Arrow | Worker | Event | Time [ms] |
|---|---|---|---|
| PhysicsEventSource | - | (PhysicsEvent 222) | - |
| PhysicsEventMap | 2 | PhysicsEvent 19 | 10000 |
| 3 | PhysicsEvent 40 | 8042 | |
| PhysicsEventUnfolder | 1 | PhysicsEvent 50 | 10 |
| (Subevent 9) | - | ||
| SubeventMap | 4 | Subevent 7 | 19 |
| SubeventMap | 5 | Subevent 5 | 501 |
| SubeventTap | - | - | - |
Metadata
Metadata
Assignees
Labels
No labels