fix: auto-detect Blackwell GPUs and avoid vLLM hangs#825
fix: auto-detect Blackwell GPUs and avoid vLLM hangs#825SirMal wants to merge 4 commits intoace-step:mainfrom
Conversation
- Dockerfile for x86_64 CUDA (Docker Desktop on Windows/Linux) - docker-compose.yml for easy launch with GPU passthrough - CLAUDE.md with build/test commands, architecture overview, and conventions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…trypoint The env var was ignored because the pipeline only reads it when --enable-api is set. Without --backend, nano-vllm (default) OOMs on consumer GPUs like RTX 5070 (11.5GB). Now defaults to pt (PyTorch) backend. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Blackwell GPUs (compute capability ≥12, RTX 50-series) have known compatibility issues with vLLM/nano-vllm causing hangs and segfaults. - Add is_blackwell_gpu() detection in gpu_config.py - Default to PyTorch backend on Blackwell GPUs instead of vLLM - Add safety net in llm_inference.py to override vLLM on Blackwell - Add 180s timeout wrapper around vLLM init with auto PyTorch fallback - Add debug logging in Docker entrypoint for INIT_ARGS visibility - Log backend/device/model at LM initialization start Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
📝 WalkthroughWalkthroughAdded documentation and containerization for ACE-Step 1.5; introduced NVIDIA Blackwell (RTX 50-series) GPU detection and runtime backend selection; implemented threaded vLLM initialization with timeout and fallback to PyTorch to avoid hangs. Changes
Sequence DiagramsequenceDiagram
participant User as User/Entrypoint
participant GPU as GPU Detection
participant Selector as Backend Selector
participant LLM as LLM Initializer
User->>GPU: Query device (compute capability)
GPU-->>User: Is Blackwell? (True/False)
User->>Selector: Choose backend (mlx/vllm/pt)
alt Blackwell detected
Selector-->>User: Select PyTorch (warn vLLM hang)
else Mac
Selector-->>User: Select mlx
else
Selector-->>User: Select vLLM
end
User->>LLM: Initialize LLM (threaded for vLLM)
alt Init completes within timeout
LLM-->>User: Return initialized LLM
else Timeout / hang
LLM-->>User: Initialization failed -> suggest PyTorch fallback
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
acestep/acestep_v15_pipeline.py (1)
71-79:⚠️ Potential issue | 🔴 CriticalImport
is_blackwell_gpuin the primary package-import branch too.This only fixes the script fallback path. The normal launch path (
python -m acestep.acestep_v15_pipeline, which the Docker entrypoint uses) goes through the first import block, somain()reaches Line 147 withis_blackwell_gpuundefined and the UI crashes before startup.Suggested change
from .gpu_config import ( get_gpu_config, get_gpu_memory_gb, print_gpu_config_info, set_global_gpu_config, VRAM_16GB_MIN_GB, VRAM_AUTO_OFFLOAD_THRESHOLD_GB, + is_blackwell_gpu, is_mps_platform, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@acestep/acestep_v15_pipeline.py` around lines 71 - 79, The primary import block in acestep_v15_pipeline.py is missing is_blackwell_gpu, causing main() to reference an undefined symbol; update the first package-import branch to include is_blackwell_gpu alongside get_gpu_config, get_gpu_memory_gb, print_gpu_config_info, set_global_gpu_config, VRAM_16GB_MIN_GB, and VRAM_AUTO_OFFLOAD_THRESHOLD_GB so that the symbol is defined when main() runs (ensure the same import appears in the first import group as in the fallback branch).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@acestep/llm_inference.py`:
- Around line 739-750: vLLM init in a daemon thread (init_thread running
_init_vllm) can continue touching CUDA after join(timeout= vllm_timeout_s)
returns, risking concurrent GPU access; replace the in-thread strategy with a
killable subprocess (e.g., use multiprocessing.Process or spawn a separate
process to run _init_vllm) and join with timeout, and if the process is still
alive call terminate() and ensure it is cleaned up, set self.llm_initialized =
False and return the timeout error without falling back to the in-process
PyTorch path (i.e., avoid calling LLM() or any other backend init in-process on
that code path).
In `@CLAUDE.md`:
- Around line 62-64: The VRAM-tier example in CLAUDE.md is incorrect; update the
mapping to match the logic in acestep/gpu_config.py by changing the tiers to:
6GB → 0.6B; 8–12GB → 0.6B (not 1.7B); 12–24GB → 1.7B; and >=24GB (unlimited) →
4B; reference gpu_config.py in the text and ensure the sentence "VRAM tiers
auto-select LM model size" lists these exact ranges and model sizes so the
documentation aligns with the live config.
In `@Dockerfile`:
- Line 91: The Dockerfile currently creates /app/checkpoints and
/app/gradio_outputs as root (RUN mkdir -p /app/checkpoints /app/gradio_outputs)
and the container continues running as root; change this by creating a dedicated
non-root runtime user (e.g., "appuser"), chown the application directories (/app
and the created subdirs) to that user after creation, and switch to that user
with USER before the ENTRYPOINT/CMD so HTTP endpoints and bind-mounted files are
not written as root; apply the same pattern for the other mkdir/chown
occurrences in the file (the block referenced at lines 160-162).
- Around line 141-142: The Dockerfile currently echoes ACESTEP_EXTRA_ARGS
verbatim, which may leak secrets; change the echo of ACESTEP_EXTRA_ARGS to
either remove it entirely or print a redacted version instead (e.g., detect and
mask values for flags like --auth-password and --api-key before printing). Keep
the existing INIT_ARGS echo but update the ACESTEP_EXTRA_ARGS reference in the
block containing the echo commands so that sensitive flags are not logged (look
for the lines that reference INIT_ARGS and ACESTEP_EXTRA_ARGS and modify/remove
the ACESTEP_EXTRA_ARGS echo).
- Around line 94-100: Remove the hard-coded default for the ACESTEP LM backend
in the Docker image by deleting or unsetting the ENV ACESTEP_LM_BACKEND
declaration in the Dockerfile so the image does not force "pt" at container
start; instead, document ACESTEP_LM_BACKEND as a runtime/user override (e.g.,
via docker run -e or compose) so the runtime selection logic can choose vLLM or
other backends dynamically when Python starts.
---
Outside diff comments:
In `@acestep/acestep_v15_pipeline.py`:
- Around line 71-79: The primary import block in acestep_v15_pipeline.py is
missing is_blackwell_gpu, causing main() to reference an undefined symbol;
update the first package-import branch to include is_blackwell_gpu alongside
get_gpu_config, get_gpu_memory_gb, print_gpu_config_info, set_global_gpu_config,
VRAM_16GB_MIN_GB, and VRAM_AUTO_OFFLOAD_THRESHOLD_GB so that the symbol is
defined when main() runs (ensure the same import appears in the first import
group as in the fallback branch).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 0feb43e7-cced-4945-997f-5ff71fa12659
📒 Files selected for processing (6)
CLAUDE.mdDockerfileacestep/acestep_v15_pipeline.pyacestep/gpu_config.pyacestep/llm_inference.pydocker-compose.yml
| init_thread = threading.Thread(target=_init_vllm, daemon=True) | ||
| init_thread.start() | ||
| init_thread.join(timeout=vllm_timeout_s) | ||
|
|
||
| if init_thread.is_alive(): | ||
| logger.error( | ||
| f"vLLM initialization timed out after {vllm_timeout_s}s " | ||
| "(may indicate GPU compatibility issue). " | ||
| "The daemon thread will be abandoned." | ||
| ) | ||
| self.llm_initialized = False | ||
| return f"❌ vLLM initialization timed out after {vllm_timeout_s}s — try setting LM Backend to 'pt' (PyTorch)" |
There was a problem hiding this comment.
A vLLM timeout still leaves the initializer running.
join(timeout=...) only stops waiting; it does not stop LLM(). After this returns, the caller immediately falls back to PyTorch, so two backend initializations can touch CUDA in the same process on the exact hang path. If this timeout needs to be recoverable, the vLLM init has to live in a killable subprocess; otherwise treat the timeout as terminal and skip the in-process fallback.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@acestep/llm_inference.py` around lines 739 - 750, vLLM init in a daemon
thread (init_thread running _init_vllm) can continue touching CUDA after
join(timeout= vllm_timeout_s) returns, risking concurrent GPU access; replace
the in-thread strategy with a killable subprocess (e.g., use
multiprocessing.Process or spawn a separate process to run _init_vllm) and join
with timeout, and if the process is still alive call terminate() and ensure it
is cleaned up, set self.llm_initialized = False and return the timeout error
without falling back to the in-process PyTorch path (i.e., avoid calling LLM()
or any other backend init in-process on that code path).
| ### Multi-Platform Support | ||
|
|
||
| Supports CUDA, ROCm, Intel XPU, MPS, MLX, and CPU. **Do not alter non-target platform paths** unless the task requires it. Use `gpu_config.py` for hardware detection. VRAM tiers auto-select LM model size (6GB→0.6B, 8GB→0.6B/1.7B, 16GB+→4B). |
There was a problem hiding this comment.
Update the VRAM-tier example to match acestep/gpu_config.py.
The current note says 16GB+ -> 4B, but the live config still recommends 1.7B through the 16-24GB tiers and only recommends 4B in the unlimited tier (>=24GB). The 8GB example is also off: 8-12GB still defaults to 0.6B. Right now this doc teaches the wrong auto-selection behavior.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@CLAUDE.md` around lines 62 - 64, The VRAM-tier example in CLAUDE.md is
incorrect; update the mapping to match the logic in acestep/gpu_config.py by
changing the tiers to: 6GB → 0.6B; 8–12GB → 0.6B (not 1.7B); 12–24GB → 1.7B; and
>=24GB (unlimited) → 4B; reference gpu_config.py in the text and ensure the
sentence "VRAM tiers auto-select LM model size" lists these exact ranges and
model sizes so the documentation aligns with the live config.
| && pip install --no-cache-dir --no-deps /app/acestep/third_parts/nano-vllm | ||
|
|
||
| # ==================== Runtime directories ==================== | ||
| RUN mkdir -p /app/checkpoints /app/gradio_outputs |
There was a problem hiding this comment.
Run the final container as an unprivileged user.
This image exposes HTTP endpoints and writes bind-mounted checkpoints/outputs as root. That increases the blast radius of a compromise and leaves host files root-owned. Create a dedicated runtime user, chown /app, /app/checkpoints, and /app/gradio_outputs, then switch with USER before the entrypoint.
Suggested change
RUN mkdir -p /app/checkpoints /app/gradio_outputs
+RUN groupadd --system acestep \
+ && useradd --system --gid acestep --create-home --home-dir /home/acestep acestep \
+ && chown -R acestep:acestep /app /home/acestep
@@
RUN sed -i 's/\r$//' /app/docker-entrypoint.sh && chmod +x /app/docker-entrypoint.sh
+USER acestep
ENTRYPOINT ["/app/docker-entrypoint.sh"]Also applies to: 160-162
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@Dockerfile` at line 91, The Dockerfile currently creates /app/checkpoints and
/app/gradio_outputs as root (RUN mkdir -p /app/checkpoints /app/gradio_outputs)
and the container continues running as root; change this by creating a dedicated
non-root runtime user (e.g., "appuser"), chown the application directories (/app
and the created subdirs) to that user after creation, and switch to that user
with USER before the ENTRYPOINT/CMD so HTTP endpoints and bind-mounted files are
not written as root; apply the same pattern for the other mkdir/chown
occurrences in the file (the block referenced at lines 160-162).
| ENV GRADIO_SERVER_NAME=0.0.0.0 | ||
| ENV ACESTEP_MODE=gradio | ||
| ENV ACESTEP_INIT_SERVICE=true | ||
| ENV ACESTEP_CONFIG_PATH=acestep-v15-turbo | ||
| ENV ACESTEP_LM_MODEL_PATH=acestep-5Hz-lm-0.6B | ||
| ENV ACESTEP_LM_BACKEND=pt | ||
| ENV TOKENIZERS_PARALLELISM=false |
There was a problem hiding this comment.
Don't pin the container's default LM backend to PyTorch.
ENV ACESTEP_LM_BACKEND=pt short-circuits the new runtime selection before Python starts, so non-Blackwell CUDA runs never keep vLLM as the default and docker compose inherits that regression too. Leave this unset in the image and only document it as a user override.
Suggested change
-ENV ACESTEP_LM_BACKEND=pt
ENV TOKENIZERS_PARALLELISM=false📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ENV GRADIO_SERVER_NAME=0.0.0.0 | |
| ENV ACESTEP_MODE=gradio | |
| ENV ACESTEP_INIT_SERVICE=true | |
| ENV ACESTEP_CONFIG_PATH=acestep-v15-turbo | |
| ENV ACESTEP_LM_MODEL_PATH=acestep-5Hz-lm-0.6B | |
| ENV ACESTEP_LM_BACKEND=pt | |
| ENV TOKENIZERS_PARALLELISM=false | |
| ENV GRADIO_SERVER_NAME=0.0.0.0 | |
| ENV ACESTEP_MODE=gradio | |
| ENV ACESTEP_INIT_SERVICE=true | |
| ENV ACESTEP_CONFIG_PATH=acestep-v15-turbo | |
| ENV ACESTEP_LM_MODEL_PATH=acestep-5Hz-lm-0.6B | |
| ENV TOKENIZERS_PARALLELISM=false |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@Dockerfile` around lines 94 - 100, Remove the hard-coded default for the
ACESTEP LM backend in the Docker image by deleting or unsetting the ENV
ACESTEP_LM_BACKEND declaration in the Dockerfile so the image does not force
"pt" at container start; instead, document ACESTEP_LM_BACKEND as a runtime/user
override (e.g., via docker run -e or compose) so the runtime selection logic can
choose vLLM or other backends dynamically when Python starts.
| echo "INIT_ARGS: ${INIT_ARGS}" | ||
| echo "EXTRA_ARGS: ${ACESTEP_EXTRA_ARGS:-}" |
There was a problem hiding this comment.
Don't log ACESTEP_EXTRA_ARGS verbatim.
This is the escape hatch for flags like --auth-password and --api-key, so echoing it writes secrets straight into container logs. Keeping INIT_ARGS for debugging is fine, but EXTRA_ARGS should be dropped or redacted first.
Suggested change
echo "INIT_ARGS: ${INIT_ARGS}"
-echo "EXTRA_ARGS: ${ACESTEP_EXTRA_ARGS:-}"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| echo "INIT_ARGS: ${INIT_ARGS}" | |
| echo "EXTRA_ARGS: ${ACESTEP_EXTRA_ARGS:-}" | |
| echo "INIT_ARGS: ${INIT_ARGS}" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@Dockerfile` around lines 141 - 142, The Dockerfile currently echoes
ACESTEP_EXTRA_ARGS verbatim, which may leak secrets; change the echo of
ACESTEP_EXTRA_ARGS to either remove it entirely or print a redacted version
instead (e.g., detect and mask values for flags like --auth-password and
--api-key before printing). Keep the existing INIT_ARGS echo but update the
ACESTEP_EXTRA_ARGS reference in the block containing the echo commands so that
sensitive flags are not logged (look for the lines that reference INIT_ARGS and
ACESTEP_EXTRA_ARGS and modify/remove the ACESTEP_EXTRA_ARGS echo).
ChuxiJ
left a comment
There was a problem hiding this comment.
Code Review: fix: auto-detect Blackwell GPUs and avoid vLLM hangs
Overall Assessment: Request Changes
The Blackwell GPU detection feature is valuable for RTX 50-series users, but the PR has several issues: a confirmed crash bug, a risky thread-based timeout approach, scope creep (Docker + CLAUDE.md), and no unit tests for the core GPU detection logic.
🔴 BUG (Blocking): Missing import in primary code path
CodeRabbit correctly flagged this. The PR adds is_blackwell_gpu to the fallback import block (except ImportError, line 78) but NOT to the primary import block (try, lines 50-59).
When running normally via python -m acestep.acestep_v15_pipeline, the try block succeeds, but is_blackwell_gpu is not imported. Then main() calls is_blackwell_gpu() at line ~147 → NameError: name 'is_blackwell_gpu' is not defined → crash on startup for every user.
Fix: Add is_blackwell_gpu to the primary import block:
from .gpu_config import (
get_gpu_config,
get_gpu_memory_gb,
print_gpu_config_info,
set_global_gpu_config,
VRAM_16GB_MIN_GB,
VRAM_AUTO_OFFLOAD_THRESHOLD_GB,
is_blackwell_gpu,
is_mps_platform,
)🟡 Concern: Daemon thread timeout for vLLM init
The PR wraps LLM() initialization in a daemon thread with 180s timeout:
init_thread = threading.Thread(target=_init_vllm, daemon=True)
init_thread.start()
init_thread.join(timeout=vllm_timeout_s)
if init_thread.is_alive():
# "The daemon thread will be abandoned"This is problematic:
- Abandoned CUDA state: The daemon thread continues running after timeout. It holds CUDA context, allocated GPU memory, and potentially corrupted allocator state. Subsequent PyTorch operations (even on a different backend) may fail or behave unpredictably.
- No cleanup: There is no
torch.cuda.empty_cache()or CUDA context reset after timeout. - Thread safety:
init_result(a mutable list) is shared between threads without synchronization primitives.
Recommendation: Since the PR already adds is_blackwell_gpu() detection in initialize() (which switches to backend="pt" BEFORE reaching _initialize_5hz_lm_vllm), the thread timeout is a redundant safety net. Consider removing it entirely — the Blackwell check at line 622-628 already prevents vLLM from ever being attempted on these GPUs. If a timeout safety net is truly needed, use multiprocessing.Process which can be terminated cleanly.
🟡 Scope: Docker + CLAUDE.md are separate concerns
The PR includes 3 files unrelated to Blackwell GPU detection:
CLAUDE.md(+84 lines) — AI coding assistant guidanceDockerfile(+162 lines) — Docker containerizationdocker-compose.yml(+21 lines) — Compose config
Per AGENTS.md: "Solve one problem per task/PR" and "Keep edits minimal: touch only files/functions required for the requested change."
Recommendation: Split Docker support into a separate PR. This keeps the Blackwell fix focused and reviewable. The Docker setup also has its own issues (runs as root, hardcodes ACESTEP_LM_BACKEND=pt, echoes ACESTEP_EXTRA_ARGS which could leak secrets) that deserve dedicated review.
🟡 Missing: No unit tests
The core feature (is_blackwell_gpu()) has no unit tests. It's a simple function that would be easy to test:
def test_blackwell_detected_when_capability_12(self):
with patch("torch.cuda.is_available", return_value=True), \
patch("torch.cuda.get_device_capability", return_value=(12, 0)):
self.assertTrue(is_blackwell_gpu())
def test_non_blackwell_when_capability_8(self):
with patch("torch.cuda.is_available", return_value=True), \
patch("torch.cuda.get_device_capability", return_value=(8, 9)):
self.assertFalse(is_blackwell_gpu())✅ Good: GPU detection logic
is_blackwell_gpu() itself is clean and correct:
- Uses
torch.cuda.get_device_capability()which is the standard way to detect GPU architecture major >= 12correctly identifies Blackwell (compute capability 12.x)- Wrapped in
try/except Exceptionfor graceful degradation - Properly gates on
torch.cuda.is_available()first
Summary of Requested Changes
- [Blocking] Add
is_blackwell_gputo the primary import block inacestep_v15_pipeline.py - [Important] Remove or replace the daemon thread timeout with
multiprocessing.Process, or remove it entirely sinceis_blackwell_gpu()already prevents vLLM on Blackwell GPUs - [Important] Split Docker/CLAUDE.md into a separate PR
- [Should have] Add unit tests for
is_blackwell_gpu()
There was a problem hiding this comment.
♻️ Duplicate comments (1)
acestep/llm_inference.py (1)
740-773:⚠️ Potential issue | 🔴 CriticalTimeout branch is not actually recoverable and can race with PyTorch fallback.
Line 759 only times out the wait; it does not stop
LLM()in the daemon thread. That leaves vLLM init running while the caller may start PyTorch fallback in the same process, which is unsafe on CUDA.Suggested direction
@@ def _initialize_5hz_lm_vllm(...): - if init_thread.is_alive(): + if init_thread.is_alive(): logger.error( f"vLLM initialization timed out after {vllm_timeout_s}s " "(may indicate GPU compatibility issue). " "The daemon thread will be abandoned." ) self.llm_initialized = False - return f"❌ vLLM initialization timed out after {vllm_timeout_s}s — try setting LM Backend to 'pt' (PyTorch)" + self._vllm_timeout_unrecoverable = True + return ( + f"❌ vLLM initialization timed out after {vllm_timeout_s}s. " + "In-process fallback is disabled for safety; restart with backend='pt'." + )@@ def initialize(...): - if status_msg.startswith("❌"): + if status_msg.startswith("❌"): + if getattr(self, "_vllm_timeout_unrecoverable", False): + return status_msg, False if not self.llm_initialized: ...If you need automatic fallback, move vLLM init to a killable subprocess and terminate on timeout before attempting PyTorch in-process.
Run this to verify the risky control flow is present:
#!/bin/bash set -euo pipefail # 1) Confirm timeout only waits and then abandons thread rg -n -C3 'init_thread = threading.Thread|init_thread\.join\(timeout=|if init_thread\.is_alive\(\)|daemon thread will be abandoned' acestep/llm_inference.py # 2) Confirm failure path can immediately fall back to PyTorch rg -n -C4 'status_msg\.startswith\("❌"\)|Falling back to PyTorch backend|_load_pytorch_model\(' acestep/llm_inference.py🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@acestep/llm_inference.py` around lines 740 - 773, The timeout branch currently only stops waiting for the daemon thread running _init_vllm (which calls LLM(...)) but does not terminate that thread, leaving vLLM initialization running and racing with any in-process PyTorch fallback (_load_pytorch_model or code that sets self.llm), which is unsafe on CUDA; fix by moving the vLLM initialization out of a lingering daemon thread into a separate killable subprocess (spawn a subprocess to run the LLM(...) init, monitor it with a timeout, terminate the subprocess if it exceeds vllm_timeout_s, and only then proceed to set self.llm or fall back), or alternatively serialize GPU backends so that when init_thread is timed out you block further PyTorch initialization until the vLLM subprocess is fully killed; update references to _init_vllm, init_thread, init_result, and setting self.llm accordingly so no background vLLM init can run concurrently with PyTorch fallback.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@acestep/llm_inference.py`:
- Around line 740-773: The timeout branch currently only stops waiting for the
daemon thread running _init_vllm (which calls LLM(...)) but does not terminate
that thread, leaving vLLM initialization running and racing with any in-process
PyTorch fallback (_load_pytorch_model or code that sets self.llm), which is
unsafe on CUDA; fix by moving the vLLM initialization out of a lingering daemon
thread into a separate killable subprocess (spawn a subprocess to run the
LLM(...) init, monitor it with a timeout, terminate the subprocess if it exceeds
vllm_timeout_s, and only then proceed to set self.llm or fall back), or
alternatively serialize GPU backends so that when init_thread is timed out you
block further PyTorch initialization until the vLLM subprocess is fully killed;
update references to _init_vllm, init_thread, init_result, and setting self.llm
accordingly so no background vLLM init can run concurrently with PyTorch
fallback.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ef44a067-2a29-4e4c-9abd-c71714deb0b0
📒 Files selected for processing (2)
acestep/gpu_config.pyacestep/llm_inference.py
ChuxiJ
left a comment
There was a problem hiding this comment.
Code Review
Verdict: Needs Changes ❌
The PR addresses a real issue (Blackwell GPUs hanging during vLLM init), but has several problems that must be fixed before merge.
Blocking Bug: Missing import crashes ALL users
In acestep/acestep_v15_pipeline.py, is_blackwell_gpu is imported only in the except ImportError fallback block but NOT in the primary try block. When running normally, the try block succeeds, is_blackwell_gpu is never imported, and main() raises NameError — crashing startup for all users, not just Blackwell GPU owners.
Fix: Add from acestep.gpu_config import is_blackwell_gpu to the primary try import block.
Medium: Daemon thread timeout leaks CUDA state
The thread-based vLLM init timeout has issues:
- After
join(timeout=180)returns with the thread still alive, the daemon thread continues running, holding CUDA context and GPU memory — Python threads cannot be killed - No
torch.cuda.empty_cache()or CUDA context cleanup after timeout - This is largely redundant since the
is_blackwell_gpu()guard already prevents vLLM from being attempted on Blackwell GPUs
Recommendation: Remove the thread-based timeout (the Blackwell guard handles it), or replace with multiprocessing.Process which can be properly terminated.
Scope: Docker and CLAUDE.md should be separate PRs
The PR mixes Blackwell fix + Docker support + CLAUDE.md updates. Docker-specific issues:
ENV ACESTEP_LM_BACKEND=pthardcodes PyTorch for ALL Docker users, bypassing auto-detection that benefits non-Blackwell GPUs- Container runs as root — should use an unprivileged user
- Entrypoint logs potentially sensitive values (
--api-key,--auth-password)
Other
- Mixed
print()vsloggerfor Blackwell detection messages — use loguru consistently - No unit tests for
is_blackwell_gpu()despite being trivially testable with mocks - CLAUDE.md VRAM tier description ("16GB+ -> 4B") doesn't match actual
gpu_config.pylogic
The core detection logic (is_blackwell_gpu using get_device_capability() >= 12) is solid. Please fix the import bug, simplify the timeout approach, and split Docker into its own PR.
ChuxiJ
left a comment
There was a problem hiding this comment.
Code Review
Overall: The Blackwell GPU detection concept is good, but there is a critical import bug that must be fixed before merge. Docker additions should be split into a separate PR.
Critical
- Missing import in primary import branch (
acestep_v15_pipeline.py):is_blackwell_gpuis only imported in theexcept ImportErrorfallback branch (line 78), but not in the primary import path (lines 51-58). When run normally (python -m acestep.acestep_v15_pipeline),is_blackwell_gpuwill be undefined at line 147, crashing the application at startup. This is a blocking bug.
High
- Daemon thread resource leak (
llm_inference.py): When vLLM init times out, the daemon thread continues running in the background, potentially holding GPU memory and CUDA contexts. Python threads cannot be forcefully terminated. Consider usingmultiprocessing.Processinstead, or document that the process should be restarted after a timeout.
Medium
-
Docker defaults
ACESTEP_LM_BACKEND=ptfor all GPUs: The Dockerfile setsENV ACESTEP_LM_BACKEND=pt, meaning even non-Blackwell GPUs in Docker use PyTorch instead of vLLM by default. This seems overly conservative. -
Scope creep: The PR bundles Docker support (Dockerfile + docker-compose.yml) and a
CLAUDE.mdfile with a bug fix. Per the project's conventions, these should be separate PRs.
Low
print()vsloggerinacestep_v15_pipeline.py: The Blackwell detection message usesprint()whilellm_inference.pycorrectly useslogger.warning().AGENTS.mdreferenced inCLAUDE.mdbut doesn't exist in the repo.- Docker layer caching:
COPY . /app/beforepip installinvalidates the cache on any source change. Copypyproject.toml/requirements first.
Recommendation: Fix the critical import bug before merge. Split Docker support into a separate PR.
ChuxiJ
left a comment
There was a problem hiding this comment.
Code Review
Good initiative addressing Blackwell GPU compatibility! Several comments:
Positive:
is_blackwell_gpu()with compute capability ≥12 detection is correct- The 180s timeout wrapper for vLLM init is a smart safety net
- Docker + docker-compose support is a nice addition
- Debug logging in the entrypoint is helpful for troubleshooting
Concerns:
-
Abandoned daemon threads: When vLLM init times out, the daemon thread is "abandoned" but may still hold GPU resources (CUDA context, allocated VRAM). This could leave the system in a bad state. Consider adding a comment noting this limitation, or documenting that a process restart may be needed after a timeout.
-
CLAUDE.md inclusion: Adding
CLAUDE.mdis helpful but should be a separate PR — it's unrelated to the Blackwell fix and adds review burden. Some of the content (architecture overview, conventions) should be validated by maintainers. -
Dockerfile:
- Pinning
torch==2.10.0+cu128is very specific — this may not exist yet or may become outdated quickly. Consider using a version range or documenting the pin reason. - The
pip installapproach (vsuv) diverges from the project's standard tooling.
- Pinning
-
Test plan: The test plan is manual-only. Could you add an automated test for
is_blackwell_gpu()(mockingtorch.cuda.get_device_capability)?
Suggestion:
Split this into two PRs:
- Blackwell detection + vLLM timeout (core fix)
- Docker support + CLAUDE.md (infrastructure)
Code reviewFound 3 issues:
Additionally: PR has merge conflicts with main and needs a rebase. 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
Summary
is_blackwell_gpu()detection and automatically defaults to PyTorch backend on these GPUsINIT_ARGS) and Python LM initialization (backend/device/model)Test plan
ACESTEP_LM_BACKEND=vllmon Blackwell GPU — verify safety net overrides to PyTorch with warningINIT_ARGSappears in Docker container logs for debugging🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation