-
Notifications
You must be signed in to change notification settings - Fork 49
Description
Hi, I'm encountering significant fluctuations in perplexity (PPL) when reproducing the VPTQ 4.05-bit quantization results using the following configuration:
"--vector_lens", "-1", "6",
"--group_num", "1",
"--num_centroids", "-1", "4096",
"--num_res_centroids", "-1", "4096",
"--npercent", "0",
"--blocksize", "128",
"--new_eval",
"--seq_len", "2048",
"--kmeans_mode", "hessian",
"--num_gpus", "8",
# "--enable_perm",
"--enable_norm",
"--save_model",
"--save_packed_model",
"--hessian_path", "/workshop/Hessians/H",
"--inv_hessian_path", "/workshop/Hessians/INVH",
"--ktol", "1e-5",
"--kiter", "100"Setup details:
- Model: Llama3-8B
- Dataset: wikitext-2
- Hardware: 8× A100 GPUs
- Random seed: default (0)
- Hessian files are precomputed and reused across runs
Observed behavior:
Across multiple independent runs with the exact same command and environment, I obtained widely varying PPL scores: 29.83, 15.52, and 50.56.
To debug, I verified that:
The inference code itself is deterministic: when I load a saved quantized model and run evaluation, the PPL is consistent across repeated evaluations of the same quantized checkpoint.
However, different quantization runs (even with identical seeds and inputs) produce quantized models with drastically different PPLs.
This suggests that non-determinism is introduced during the quantization process, possibly in the k-means clustering step (--kmeans_mode hessian). Could this be due to:
Non-deterministic behavior in PyTorch/CUDA operations despite a fixed seed?
Initialization sensitivity in k-means when using Hessian-weighted distances?
Race conditions or non-determinism across multi-GPU execution?
Could you please help clarify why such large fluctuations occur and how to achieve reproducible quantization results? Any guidance on ensuring determinism (e.g., additional seeding, disabling certain optimizations, or adjusting k-means parameters) would be greatly appreciated.