Add parallelism_auto flag to automatically set dp, tp and micro batch size#516
Open
Add parallelism_auto flag to automatically set dp, tp and micro batch size#516
parallelism_auto flag to automatically set dp, tp and micro batch size#516Conversation
…hardcode-parallelism-data
parallelism_auto flag to automatically set dp, tp and micro batch size
|
|
||
| fn get_gpu_type() -> String { | ||
| // Try nvidia-smi first | ||
| let raw = Command::new("nvidia-smi") |
Contributor
There was a problem hiding this comment.
small nit: maybe we can rename this variable to something better?
arilotter
reviewed
Jan 27, 2026
Comment on lines
21
to
57
| fn get_gpu_type() -> String { | ||
| // Try nvidia-smi first | ||
| let raw_gpu_name = Command::new("nvidia-smi") | ||
| .args(["--query-gpu=name", "--format=csv,noheader"]) | ||
| .output() | ||
| .ok() | ||
| .and_then(|o| String::from_utf8(o.stdout).ok()) | ||
| .and_then(|s| s.lines().next().map(|l| l.trim().to_string())) | ||
| .filter(|s| !s.is_empty()) | ||
| // Fallback: read from /proc/driver/nvidia (works in containers without nvidia-smi) | ||
| .or_else(|| { | ||
| std::fs::read_dir("/proc/driver/nvidia/gpus") | ||
| .ok()? | ||
| .filter_map(|e| e.ok()) | ||
| .next() | ||
| .and_then(|entry| { | ||
| let info_path = entry.path().join("information"); | ||
| std::fs::read_to_string(info_path).ok() | ||
| }) | ||
| .and_then(|content| { | ||
| content | ||
| .lines() | ||
| .find(|line| line.starts_with("Model:")) | ||
| .map(|line| line.trim_start_matches("Model:").trim().to_string()) | ||
| }) | ||
| }) | ||
| .unwrap_or_default(); | ||
|
|
||
| // Normalize GPU name to match table keys | ||
| if raw_gpu_name.to_uppercase().contains("H200") { | ||
| "H200".to_string() | ||
| } else if raw_gpu_name.to_uppercase().contains("H100") { | ||
| "H100".to_string() | ||
| } else { | ||
| raw_gpu_name | ||
| } | ||
| } |
Collaborator
There was a problem hiding this comment.
have you considered using the nvml_wrapper crate instead of shelling out / reading /proc/fs stuff? we can grab gpu count from there too 🤷 and assert that they're all the same GPU for sanity checking :D
we use this in some of the metrics stuff already -
it would be something like:
use nvml_wrapper::Nvml;
#[derive(Debug)]
struct GpuInfo {
name: String,
device_count: u32,
}
fn get_gpu_info() -> anyhow::Result<GpuInfo> {
let nvml = Nvml::init()?;
let device_count = nvml.device_count()?;
if device_count == 0 {
anyhow::bail!("No GPUs found!");
}
let mut gpu_names = Vec::new();
for i in 0..device_count {
let device = nvml.device_by_index(i)?;
gpu_names.push(device.name()?);
}
let first_name = &gpu_names[0];
if !gpu_names.iter().all(|name| name == first_name) {
anyhow::bail!(
"All GPUs must be of the same type, but we have mismatching names: {:?}",
gpu_names
);
}
Ok(GpuInfo {
name: gpu_names.pop().unwrap(),
device_count,
})
}
Contributor
Author
There was a problem hiding this comment.
Thanks Ari!
I add it and works
Contributor
Author
There was a problem hiding this comment.
I just changed the let device_count = nvml.device_count()?; because was not taking into account CUDA_VISIBLE_DEVICES env var
0beffdf to
ec28547
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds --parallelism-auto flag to automatically detect optimal dp, tp, and micro_batch_size based on the model and GPU hardware.
We considered sharing the parallelism config via P2P but it added unnecessary complexity—dp/tp/micro_batch_size are needed before model download begins, requiring a separate request/wait flow. Since the file is ~200 bytes, fetching from HF is fast and keeps the code simple.
Note:
I think we can merge this PR now, we should only take into account that if the run is private, is posible that some trainers dont have access to HuggingFace, so I dont recommend using this flag for private runs now
Once #455 is merged, all trainers must have access to HF/GCP, so it wont be a problem