Skip to content

test with 8b on 4 nodes - models with tied word embeddings might fail…

ea97b77
Select commit
Loading
Failed to load commit list.
Open

Add megatron_ray_fault_tolerant example with comprehensive fault tolerance implementation #19

test with 8b on 4 nodes - models with tied word embeddings might fail…
ea97b77
Select commit
Loading
Failed to load commit list.

Workflow runs completed with no jobs