Add `parallelism_auto` flag to automatically set dp, tp and micro batch size by pefontana · Pull Request #516 · PsycheFoundation/nousnet

pefontana · 2026-01-23T21:44:34Z

Adds --parallelism-auto flag to automatically detect optimal dp, tp, and micro_batch_size based on the model and GPU hardware.

Detects GPU type via nvidia-smi (with fallback to /proc/driver/nvidia/gpus/*/information for containers)
Fetches parallelism_data.json from the model's HuggingFace repo
Looks up config: GPU type → num GPUs → {dp, tp, micro_batch_size}

  {
    "H100": {
      "1": { "dp": 1, "tp": 1, "micro_batch_size": 4 },
      "8": { "dp": 4, "tp": 2, "micro_batch_size": 4 }
    },
    "H200": {
      "8": { "dp": 8, "tp": 1, "micro_batch_size": 8 }
    }
  }

We considered sharing the parallelism config via P2P but it added unnecessary complexity—dp/tp/micro_batch_size are needed before model download begins, requiring a separate request/wait flow. Since the file is ~200 bytes, fetching from HF is fast and keeps the code simple.

Note:

I think we can merge this PR now, we should only take into account that if the run is private, is posible that some trainers dont have access to HuggingFace, so I dont recommend using this flag for private runs now
Once #455 is merged, all trainers must have access to HF/GCP, so it wont be a problem

…hardcode-parallelism-data

entropidelic · 2026-01-26T21:35:38Z

shared/client/src/parallelism_lookup.rs

+
+fn get_gpu_type() -> String {
+    // Try nvidia-smi first
+    let raw = Command::new("nvidia-smi")


small nit: maybe we can rename this variable to something better?

arilotter · 2026-01-27T22:08:35Z

shared/client/src/parallelism_lookup.rs

+fn get_gpu_type() -> String {
+    // Try nvidia-smi first
+    let raw_gpu_name = Command::new("nvidia-smi")
+        .args(["--query-gpu=name", "--format=csv,noheader"])
+        .output()
+        .ok()
+        .and_then(|o| String::from_utf8(o.stdout).ok())
+        .and_then(|s| s.lines().next().map(|l| l.trim().to_string()))
+        .filter(|s| !s.is_empty())
+        // Fallback: read from /proc/driver/nvidia (works in containers without nvidia-smi)
+        .or_else(|| {
+            std::fs::read_dir("/proc/driver/nvidia/gpus")
+                .ok()?
+                .filter_map(|e| e.ok())
+                .next()
+                .and_then(|entry| {
+                    let info_path = entry.path().join("information");
+                    std::fs::read_to_string(info_path).ok()
+                })
+                .and_then(|content| {
+                    content
+                        .lines()
+                        .find(|line| line.starts_with("Model:"))
+                        .map(|line| line.trim_start_matches("Model:").trim().to_string())
+                })
+        })
+        .unwrap_or_default();
+
+    // Normalize GPU name to match table keys
+    if raw_gpu_name.to_uppercase().contains("H200") {
+        "H200".to_string()
+    } else if raw_gpu_name.to_uppercase().contains("H100") {
+        "H100".to_string()
+    } else {
+        raw_gpu_name
+    }
+}


have you considered using the nvml_wrapper crate instead of shelling out / reading /proc/fs stuff? we can grab gpu count from there too 🤷 and assert that they're all the same GPU for sanity checking :D

we use this in some of the metrics stuff already -
it would be something like:

use nvml_wrapper::Nvml; #[derive(Debug)] struct GpuInfo { name: String, device_count: u32, } fn get_gpu_info() -> anyhow::Result<GpuInfo> { let nvml = Nvml::init()?; let device_count = nvml.device_count()?; if device_count == 0 { anyhow::bail!("No GPUs found!"); } let mut gpu_names = Vec::new(); for i in 0..device_count { let device = nvml.device_by_index(i)?; gpu_names.push(device.name()?); } let first_name = &gpu_names[0]; if !gpu_names.iter().all(|name| name == first_name) { anyhow::bail!( "All GPUs must be of the same type, but we have mismatching names: {:?}", gpu_names ); } Ok(GpuInfo { name: gpu_names.pop().unwrap(), device_count, }) }

Thanks Ari!
I add it and works

I just changed the let device_count = nvml.device_count()?; because was not taking into account CUDA_VISIBLE_DEVICES env var

pefontana added 11 commits January 23, 2026 13:43

Implement --auto-parallelism

3f48d73

parallelism_data.json

653d0eb

Merge branch 'main' into hardcode-parallelism-data

01a6714

simplify code

b541cc5

Merge remote-tracking branch 'origin/hardcode-parallelism-data' into …

4a6e1d2

…hardcode-parallelism-data

clippy

3af705e

Merge branch 'main' into hardcode-parallelism-data

9303f99

add parallelism_data.json to Garnix

27f4ed5

add hardware type to json

a8d5330

update .json

f265de9

Fallback: read from /proc/driver/nvidia

1d3aafd

pefontana changed the title ~~Hardcode parallelism data~~ Add parallelism_auto flag to automatically set dp, tp and micro batch size Jan 26, 2026

restore scripts/train-solana-test.sh

50762e5

pefontana marked this pull request as ready for review January 26, 2026 20:40

update documentation

f6c1c7a

entropidelic reviewed Jan 26, 2026

View reviewed changes

pefontana added 3 commits January 26, 2026 18:36

Change micro_batch_size for Meta-Llama-3.1 to 1

42fd012

nit

1695b93

look data-parallelism.json in HF repo

5175723

arilotter reviewed Jan 27, 2026

View reviewed changes

pefontana added 3 commits January 28, 2026 10:39

change json format

8f6d14e

nvml_wrapper

6644c67

Merge branch 'main' into hardcode-parallelism-data

ec28547

pefontana force-pushed the hardcode-parallelism-data branch from 0beffdf to ec28547 Compare January 28, 2026 21:50

se tch for GPU count (respects CUDA_VISIBLE_DEVICES)

fdf4f0d

pefontana requested review from arilotter and entropidelic January 29, 2026 17:12

pefontana added 2 commits January 30, 2026 15:06

Merge branch 'main' into hardcode-parallelism-data

1a454c5

Merge branch 'main' into hardcode-parallelism-data

6114eba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `parallelism_auto` flag to automatically set dp, tp and micro batch size#516

Add `parallelism_auto` flag to automatically set dp, tp and micro batch size#516
pefontana wants to merge 22 commits intomainfrom
hardcode-parallelism-data

pefontana commented Jan 23, 2026 •

edited

Loading

Uh oh!

entropidelic Jan 26, 2026

Uh oh!

pefontana Jan 28, 2026

Uh oh!

arilotter Jan 27, 2026

Uh oh!

pefontana Jan 28, 2026 •

edited

Loading

Uh oh!

pefontana Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pefontana commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Note:

Uh oh!

entropidelic Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

pefontana Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

arilotter Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

pefontana Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pefontana Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pefontana commented Jan 23, 2026 •

edited

Loading

pefontana Jan 28, 2026 •

edited

Loading