Skip to content

Inspecting a hung processing topology #445

@nathanwbrei

Description

@nathanwbrei

@RaiqaRasool presents a problem scenario that is easy to create and hard to debug:

  1. Launch JANA2 with a large number of threads and no timeout set. A small number of these threads hang when they all attempt to connect to an external resource.
  2. The events corresponding to non-hanging are all processed just fine. Because no timeout has been set, processing will continue with slightly fewer threads and one event dropped for each hanging thread. Eventually we'd notice when the event source finished and the topology fails to shut down. In this case, however, we are doing online monitoring, so we won't notice this.
  3. Eventually a barrier event is emitted into the stream. The barrier event cannot be emitted until the hanging events are marked as finished.
  4. The pending barrier cannot be emitted until the hanging events complete, so the entire topology idles, all output stops, and the event rate slowly drops to zero. This won't happen until we are tens of thousands of events in, and on a production environment. Worse, this can't be reproduced locally.

How to debug this?

  • Set a timeout. The hanging events are the root cause. However,
  • Send Ctrl-Z (or bin/jana-status, or kill -s USR2 $JANA_PID) to obtain stack traces for each worker, which includes the worker's current arrow and current event number. Eventually notice that worker 22 is processing event 12345, but worker 23 is processing event 7.

What would I like?

  • We have this whole JInspector interface, which is great except for the fact that it won't run until the topology is fully paused. Of course, most of the time we want to inspect, it is because something is misbehaving and hence the topology can't pause.
  • JExecutionEngine doesn't expose the ArrowState and WorkerState, even though this would be easy and would clear the issue right up.
  • JExecutionEngine doesn't have a concept of pending events. This one is also pretty easy.

Consider if we made the following info available from the Ctrl-Z handler or JInspector:

Arrow Worker Event Time [ms]
PhysicsEventSource - (PhysicsEvent 222) -
PhysicsEventMap 2 PhysicsEvent 19 10000
3 PhysicsEvent 40 8042
PhysicsEventUnfolder 1 PhysicsEvent 50 10
(Subevent 9) -
SubeventMap 4 Subevent 7 19
SubeventMap 5 Subevent 5 501
SubeventTap - - -

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions