Skip to content

Add PhysicsNeMo domain-parallel mesh sharding support#1

Draft
rmclaren wants to merge 1 commit intomainfrom
codex/investigate-and-implement-physics-nemo-integration
Draft

Add PhysicsNeMo domain-parallel mesh sharding support#1
rmclaren wants to merge 1 commit intomainfrom
codex/investigate-and-implement-physics-nemo-integration

Conversation

@rmclaren
Copy link
Owner

@rmclaren rmclaren commented Feb 7, 2026

Motivation

  • Enable domain (mesh) parallelism so the global icosahedral mesh can be split across GPUs using PhysicsNeMo ShardTensor primitives, reducing per-GPU memory and enabling larger models and datasets.
  • Integrate mesh-parallel sharding with the existing data-parallel bin sampling and multi-node DDP training so existing sampling/resampling flows continue to work.

Description

  • Add a new utility module gnn_model/domain_parallel.py that provides MeshShard, build_mesh_shard, filter_mesh_edges, filter_bipartite_edges, shard_tensor_dim0, init_domain_parallel_context, and maybe_build_shardtensor to manage simple mesh splits and optional PhysicsNeMo ShardTensor construction.
  • Wire domain-parallel options into the data module by adding domain_parallel flags to GNNDataModule and applying mesh sharding in setup() and graph construction (_create_graph_structure) with filter_mesh_edges and filter_bipartite_edges.
  • Add distributed-aware sampling by using torch.utils.data.distributed.DistributedSampler in train_dataloader() and val_dataloader() so bins are split cleanly across ranks when DDP is active.
  • Wire domain-parallel CLI/config options into train_gnn.py (--domain_parallel, --domain_parallel_mesh_shape, --domain_parallel_mesh_dim_names) and pass them into GNNLightning and GNNDataModule, and add an on_fit_start hook in GNNLightning to build the mesh shard and optionally convert sharded local mesh tensors into PhysicsNeMo ShardTensors.
  • Document usage and installation notes in gnn_model/README.md, including the new flags and optional pip install physicsnemo step.

Testing

  • No automated tests were executed as part of this change and no CI/test run was performed.

Codex Task

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant