[torch.compile] Use Inductor Process Pool in Compilation#36028
[torch.compile] Use Inductor Process Pool in Compilation#36028eellison wants to merge 2 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces parallel compilation for torch.compile using an Inductor process pool, which is a great performance improvement. The implementation correctly calculates the number of processes and manages the pool's lifecycle. I've found one critical issue related to ensuring a fallback to single-threaded compilation when parallel compilation is not supported, which could lead to hangs. My review includes a suggested fix for this.
Note: Security Review did not run due to the size of the PR.
|
Hi @eellison, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
This pull request has merge conflicts that must be resolved before it can be |
Enable parallel inductor compilation via a subprocess pool, allowing triton kernel compilation to happen asynchronously across multiple processes. Previously, vLLM hard-coded compile_threads=1, so all triton kernels were compiled sequentially in the main process. This change auto-computes a default based on available CPUs and GPUs: min(8, cpu_count // num_gpus - 1), capped at 8 since vLLM's graph splitting typically produces only ~4 unique kernels. The default can be overridden with VLLM_COMPILE_PROCESSES. Lifecycle: - Pool is warmed up at the end of load_model(), before first torch.compile - Pool is quiesced before cudagraph capture - After quiesce, the sidecar subprocess settles to 0% CPU within a few seconds (just a sleeping process waiting on a pipe read), so it does not interfere with inference, which is CPU-sensitive. Pool overhead (8 processes): warm_pool: ~124ms quiesce: <1ms Benchmark (facebook/opt-1.3b, 1 GPU): | Config | Graph compile | torch.compile total | Init time | |-----------|---------------|---------------------|-----------| | threads=1 | 6.73s | 9.40s | 34.21s | | threads=8 | 5.87s | 8.45s | 32.33s | | Speedup | 13% | 10% | 5.5% | Measurement scripts: https://gist.github.com/eellison/4137011c0cc9c9260b1e5a35522ef90b Signed-off-by: Elias Ellison <elias.ellison@gmail.com> Signed-off-by: <elias.ellison@gmail.com>
Signed-off-by: eellison <elias.ellison@gmail.com>
| # TODO: warm start should not save any artifacts | ||
| # https://github.com/vllm-project/vllm/issues/35708 | ||
| num_compiled_artifacts_saved=1, |
| assert counters["aot_autograd"]["autograd_cache_hit"] == 1 | ||
|
|
| # Clean up for other tests in the same pytest session | ||
| shutdown_compile_workers() |
There was a problem hiding this comment.
Does the other test in this file affect this test?
| # Verify the PyTorch version supports quiesce — we need it | ||
| # to stop the pool before cudagraph capture. If missing, | ||
| # fall back to single-threaded compilation. | ||
| from torch._inductor.compile_worker.subproc_pool import ( | ||
| SubprocPool, | ||
| ) | ||
|
|
||
| if not hasattr(SubprocPool, "quiesce"): | ||
| return 0 |
There was a problem hiding this comment.
why pytorch versions don't? If the answer is < 2.9, then we can just remove the extra check
zou3519
left a comment
There was a problem hiding this comment.
main thing is that there shouldn't be the test_moe_startup change, otherwise, this lgtm
Enable parallel inductor compilation via a subprocess pool, allowing triton kernel compilation to happen asynchronously across multiple processes. Previously, vLLM hard-coded compile_threads=1, so all triton kernels were compiled sequentially in the main process.
This change auto-computes a default based on available CPUs and GPUs: min(8, cpu_count // num_gpus - 1), capped at 8 since vLLM's graph splitting typically produces only ~4 unique kernels. The default can be overridden with VLLM_COMPILE_PROCESSES.
Lifecycle:
Pool overhead (8 processes):
warm_pool: ~124ms
quiesce: <1ms
Benchmark (facebook/opt-1.3b, 1 GPU):
Measurement scripts for pool overhead: https://gist.github.com/eellison/4137011c0cc9c9260b1e5a35522ef90b
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.