Skip to content

Conversation

@mlund
Copy link
Owner

@mlund mlund commented Jan 26, 2026

Summary

Fixes MPI deadlock when both moltransrot and gibbs_matter moves are enabled in Gibbs ensemble simulations.

Root cause: RNG desynchronization between MPI processes due to std::uniform_int_distribution making variable numbers of engine calls for different ranges (rejection sampling). The two cells have different molecule counts, causing mpi.random states to diverge, which led to different move selections and eventual deadlock.

Fix:

  • Use Faunus::random (local) for molecule selection instead of mpi.random
  • Add MPI exchange to synchronize early returns when one process fails

Test plan

  • Verified fix with test case that previously deadlocked
  • Multiple successful runs confirm stability

The Gibbs ensemble simulation would deadlock when both moltransrot and
gibbs_matter moves were enabled. Root cause was RNG desynchronization
between MPI processes.

Problem:
- MoveCollection::sample() uses mpi.random to select moves (must stay in sync)
- GibbsMatterMove::_move() was using mpi.random for molecule selection
- The two cells have different molecule counts (e.g., 5 vs 15)
- std::uniform_int_distribution uses rejection sampling, which makes
  variable numbers of engine calls for different ranges
- This caused mpi.random states to diverge between processes
- Subsequent sample() calls returned different moves on each process
- When one selected gibbs_matter (with MPI calls in bias()) and the
  other selected moltransrot, the first blocked waiting for MPI

Fix:
1. Use Faunus::random (local, non-MPI) for molecule selection instead
   of mpi.random. This keeps mpi.random synchronized for move selection.

2. Add MPI exchange to synchronize early returns. When one process fails
   (no molecules found or cell full) while the other succeeds, both must
   skip the move to avoid one blocking in bias() waiting for MPI calls.
@mlund mlund added the fix 🔧 Fix broken functionality label Jan 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fix 🔧 Fix broken functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants