For large sized jobs with little data, most of the time is spent in the two MPI_Allreduce calls for the CLE arrays during iterations.
Looking up the manual page of the MPI vendor and find ways to increasing the block size of the reduce operation could help.
MPI_Iallreduce could be helpful.
- related, on some systems the smart Alltoall sparse is not very useful. We shall provide a way to disable that.