Hello @lee218llnl,
I have been a happy user of STAT for a long time and used STAT for deadlock detection at scale on many systems.
Now working with AI/ML stack and wonder if STAT would be also useful for distributed training workloads as well.
Especially, I wonder about the frameworks like PyTorch where NCCL is default backend. Any experience or suggestion about this?
Also, as I don't see many updates on STAT repo, I was wondering if there are other efforts ongoing or alternative tools being developed at LLNL (or outside).