Skip to content

[torch.compile] Use Inductor Process Pool in Compilation#36028

Open
eellison wants to merge 2 commits intovllm-project:mainfrom
eellison:process_pool
Open

[torch.compile] Use Inductor Process Pool in Compilation#36028
eellison wants to merge 2 commits intovllm-project:mainfrom
eellison:process_pool

Conversation

@eellison
Copy link
Contributor

@eellison eellison commented Mar 4, 2026

Enable parallel inductor compilation via a subprocess pool, allowing triton kernel compilation to happen asynchronously across multiple processes. Previously, vLLM hard-coded compile_threads=1, so all triton kernels were compiled sequentially in the main process.

This change auto-computes a default based on available CPUs and GPUs: min(8, cpu_count // num_gpus - 1), capped at 8 since vLLM's graph splitting typically produces only ~4 unique kernels. The default can be overridden with VLLM_COMPILE_PROCESSES.

Lifecycle:

  • Pool is warmed up at the end of load_model(), before first torch.compile.
  • Pool is quiesced before cudagraph capture
  • After quiesce, the sidecar subprocess settles to 0% CPU within a few seconds (just a sleeping process waiting on a pipe read), so it does not interfere with inference, which is CPU-sensitive.

Pool overhead (8 processes):

warm_pool: ~124ms
quiesce: <1ms

Benchmark (facebook/opt-1.3b, 1 GPU):

Config Graph compile torch.compile total Init time
threads=1 6.73s 9.40s 34.21s
threads=8 5.87s 8.45s 32.33s
Speedup 13% 10% 5.5%

Measurement scripts for pool overhead: https://gist.github.com/eellison/4137011c0cc9c9260b1e5a35522ef90b

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@eellison eellison requested a review from njhill as a code owner March 4, 2026 17:56
@mergify mergify bot added the v1 label Mar 4, 2026
@eellison eellison marked this pull request as draft March 4, 2026 17:59
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces parallel compilation for torch.compile using an Inductor process pool, which is a great performance improvement. The implementation correctly calculates the number of processes and manages the pool's lifecycle. I've found one critical issue related to ensuring a fallback to single-threaded compilation when parallel compilation is not supported, which could lead to hangs. My review includes a suggested fix for this.

Note: Security Review did not run due to the size of the PR.

@eellison eellison marked this pull request as ready for review March 4, 2026 18:02
@mergify
Copy link

mergify bot commented Mar 4, 2026

Hi @eellison, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@eellison eellison marked this pull request as draft March 4, 2026 18:15
@eellison eellison marked this pull request as ready for review March 4, 2026 19:49
@zou3519 zou3519 added ready ONLY add when PR is ready to merge/full CI is needed and removed ready ONLY add when PR is ready to merge/full CI is needed labels Mar 4, 2026
@mergify
Copy link

mergify bot commented Mar 5, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @eellison.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 5, 2026
eellison added 2 commits March 9, 2026 15:57
Enable parallel inductor compilation via a subprocess pool, allowing
triton kernel compilation to happen asynchronously across multiple
processes. Previously, vLLM hard-coded compile_threads=1, so all
triton kernels were compiled sequentially in the main process.

This change auto-computes a default based on available CPUs and GPUs:
min(8, cpu_count // num_gpus - 1), capped at 8 since vLLM's graph
splitting typically produces only ~4 unique kernels. The default can
be overridden with VLLM_COMPILE_PROCESSES.

Lifecycle:
- Pool is warmed up at the end of load_model(), before first torch.compile
- Pool is quiesced before cudagraph capture
- After quiesce, the sidecar subprocess settles to 0% CPU within
  a few seconds (just a sleeping process waiting on a pipe read),
  so it does not interfere with inference, which is CPU-sensitive.

Pool overhead (8 processes):

  warm_pool: ~124ms
  quiesce:   <1ms

Benchmark (facebook/opt-1.3b, 1 GPU):

  | Config    | Graph compile | torch.compile total | Init time |
  |-----------|---------------|---------------------|-----------|
  | threads=1 | 6.73s         | 9.40s               | 34.21s    |
  | threads=8 | 5.87s         | 8.45s               | 32.33s    |
  | Speedup   | 13%           | 10%                 | 5.5%      |

Measurement scripts: https://gist.github.com/eellison/4137011c0cc9c9260b1e5a35522ef90b

Signed-off-by: Elias Ellison <elias.ellison@gmail.com>

Signed-off-by:  <elias.ellison@gmail.com>
Signed-off-by: eellison <elias.ellison@gmail.com>
Comment on lines +64 to +66
# TODO: warm start should not save any artifacts
# https://github.com/vllm-project/vllm/issues/35708
num_compiled_artifacts_saved=1,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eellison we fixed this, please rebase

Comment on lines +71 to +72
assert counters["aot_autograd"]["autograd_cache_hit"] == 1

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Comment on lines +99 to +100
# Clean up for other tests in the same pytest session
shutdown_compile_workers()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the other test in this file affect this test?

Comment on lines +367 to +375
# Verify the PyTorch version supports quiesce — we need it
# to stop the pool before cudagraph capture. If missing,
# fall back to single-threaded compilation.
from torch._inductor.compile_worker.subproc_pool import (
SubprocPool,
)

if not hasattr(SubprocPool, "quiesce"):
return 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why pytorch versions don't? If the answer is < 2.9, then we can just remove the extra check

Copy link
Collaborator

@zou3519 zou3519 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main thing is that there shouldn't be the test_moe_startup change, otherwise, this lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants