fix: auto-detect Blackwell GPUs and avoid vLLM hangs by SirMal · Pull Request #825 · ace-step/ACE-Step-1.5

SirMal · 2026-03-13T06:07:52Z

Summary

Blackwell GPUs (RTX 50-series, compute capability ≥12) hang during vLLM/nano-vllm initialization, preventing the Gradio UI from ever starting
Adds is_blackwell_gpu() detection and automatically defaults to PyTorch backend on these GPUs
Adds a 180s timeout wrapper around vLLM init — if it hangs, auto-falls back to PyTorch
Adds debug logging in Docker entrypoint (INIT_ARGS) and Python LM initialization (backend/device/model)

Test plan

Build Docker image and run on Blackwell GPU (RTX 5070) — verify PyTorch backend is auto-selected and Gradio UI loads
Run on non-Blackwell GPU — verify vLLM is still used by default
Set ACESTEP_LM_BACKEND=vllm on Blackwell GPU — verify safety net overrides to PyTorch with warning
Verify INIT_ARGS appears in Docker container logs for debugging

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Containerized deployment with Docker and Compose, GPU support, healthchecks, and configurable runtime mode (UI or API).
- Automatic NVIDIA Blackwell (RTX 50-series) detection with backend selection and warnings.
- Threaded LLM initialization with timeout to detect hangs and improve startup reliability.
Documentation
- Added a comprehensive guide covering setup, architecture, configuration, and development conventions.

- Dockerfile for x86_64 CUDA (Docker Desktop on Windows/Linux) - docker-compose.yml for easy launch with GPU passthrough - CLAUDE.md with build/test commands, architecture overview, and conventions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…trypoint The env var was ignored because the pipeline only reads it when --enable-api is set. Without --backend, nano-vllm (default) OOMs on consumer GPUs like RTX 5070 (11.5GB). Now defaults to pt (PyTorch) backend. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Blackwell GPUs (compute capability ≥12, RTX 50-series) have known compatibility issues with vLLM/nano-vllm causing hangs and segfaults. - Add is_blackwell_gpu() detection in gpu_config.py - Default to PyTorch backend on Blackwell GPUs instead of vLLM - Add safety net in llm_inference.py to override vLLM on Blackwell - Add 180s timeout wrapper around vLLM init with auto PyTorch fallback - Add debug logging in Docker entrypoint for INIT_ARGS visibility - Log backend/device/model at LM initialization start Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai · 2026-03-13T06:08:13Z

📝 Walkthrough

Walkthrough

Added documentation and containerization for ACE-Step 1.5; introduced NVIDIA Blackwell (RTX 50-series) GPU detection and runtime backend selection; implemented threaded vLLM initialization with timeout and fallback to PyTorch to avoid hangs.

Changes

Cohort / File(s)	Summary
Docs & Containerization `CLAUDE.md`, `Dockerfile`, `docker-compose.yml`	New repository documentation (`CLAUDE.md`) and container setup: multi-stage CUDA 12.8 Dockerfile, Python 3.11 environment, PyTorch/CUDA deps, entrypoint/startup logic, healthcheck; docker-compose service with GPU allocation and env defaults.
GPU detection `acestep/gpu_config.py`	Added `is_blackwell_gpu()` to detect NVIDIA Blackwell (compute capability >= 12) with safe exception handling.
Pipeline backend selection `acestep/acestep_v15_pipeline.py`	Backend default logic updated to prefer PyTorch when Blackwell detected; retains mlx on macOS and vLLM otherwise; imports `is_blackwell_gpu`.
LLM initialization `acestep/llm_inference.py`	vLLM initialization moved into a worker thread with timeout and hang detection; uses Blackwell guard to preemptively select PyTorch; improved runtime logging and failure reporting for initialization.

Sequence Diagram

sequenceDiagram
    participant User as User/Entrypoint
    participant GPU as GPU Detection
    participant Selector as Backend Selector
    participant LLM as LLM Initializer
    User->>GPU: Query device (compute capability)
    GPU-->>User: Is Blackwell? (True/False)
    User->>Selector: Choose backend (mlx/vllm/pt)
    alt Blackwell detected
        Selector-->>User: Select PyTorch (warn vLLM hang)
    else Mac
        Selector-->>User: Select mlx
    else
        Selector-->>User: Select vLLM
    end
    User->>LLM: Initialize LLM (threaded for vLLM)
    alt Init completes within timeout
        LLM-->>User: Return initialized LLM
    else Timeout / hang
        LLM-->>User: Initialization failed -> suggest PyTorch fallback
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

feat: Add NVIDIA Jetson Docker support with GPU acceleration #735: Similar device-specific GPU detection and backend initialization adjustments in llm_inference.py (Jetson-focused changes).
fix: auto-detect Mac/MPS and apply optimal configuration #443: Related GPU detection/backend selection edits touching gpu_config.py and pipeline selection logic.
Add Intel XPU Support (Arc GPU) + Fix Audio Loading #746: Overlapping changes to GPU config and backend selection for additional accelerator types.

Poem

🐇
I sniffed the CUDA breeze at night,
Found Blackwell stars that dimmed vLLM's light,
I threaded threads to wake the stack,
Wrote docs and Docker to pack the pack,
Hop, build, run — the models take flight!

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main purpose of the changeset: auto-detecting Blackwell GPUs and preventing vLLM initialization hangs by switching backends.
Docstring Coverage	✅ Passed	Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

acestep/acestep_v15_pipeline.py (1)
71-79: ⚠️ Potential issue | 🔴 Critical

Import is_blackwell_gpu in the primary package-import branch too.

This only fixes the script fallback path. The normal launch path (python -m acestep.acestep_v15_pipeline, which the Docker entrypoint uses) goes through the first import block, so main() reaches Line 147 with is_blackwell_gpu undefined and the UI crashes before startup.
Suggested change
     from .gpu_config import (
         get_gpu_config,
         get_gpu_memory_gb,
         print_gpu_config_info,
         set_global_gpu_config,
         VRAM_16GB_MIN_GB,
         VRAM_AUTO_OFFLOAD_THRESHOLD_GB,
+        is_blackwell_gpu,
         is_mps_platform,
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@acestep/acestep_v15_pipeline.py` around lines 71 - 79, The primary import
block in acestep_v15_pipeline.py is missing is_blackwell_gpu, causing main() to
reference an undefined symbol; update the first package-import branch to include
is_blackwell_gpu alongside get_gpu_config, get_gpu_memory_gb,
print_gpu_config_info, set_global_gpu_config, VRAM_16GB_MIN_GB, and
VRAM_AUTO_OFFLOAD_THRESHOLD_GB so that the symbol is defined when main() runs
(ensure the same import appears in the first import group as in the fallback
branch).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@acestep/llm_inference.py`:
- Around line 739-750: vLLM init in a daemon thread (init_thread running
_init_vllm) can continue touching CUDA after join(timeout= vllm_timeout_s)
returns, risking concurrent GPU access; replace the in-thread strategy with a
killable subprocess (e.g., use multiprocessing.Process or spawn a separate
process to run _init_vllm) and join with timeout, and if the process is still
alive call terminate() and ensure it is cleaned up, set self.llm_initialized =
False and return the timeout error without falling back to the in-process
PyTorch path (i.e., avoid calling LLM() or any other backend init in-process on
that code path).

In `@CLAUDE.md`:
- Around line 62-64: The VRAM-tier example in CLAUDE.md is incorrect; update the
mapping to match the logic in acestep/gpu_config.py by changing the tiers to:
6GB → 0.6B; 8–12GB → 0.6B (not 1.7B); 12–24GB → 1.7B; and >=24GB (unlimited) →
4B; reference gpu_config.py in the text and ensure the sentence "VRAM tiers
auto-select LM model size" lists these exact ranges and model sizes so the
documentation aligns with the live config.

In `@Dockerfile`:
- Line 91: The Dockerfile currently creates /app/checkpoints and
/app/gradio_outputs as root (RUN mkdir -p /app/checkpoints /app/gradio_outputs)
and the container continues running as root; change this by creating a dedicated
non-root runtime user (e.g., "appuser"), chown the application directories (/app
and the created subdirs) to that user after creation, and switch to that user
with USER before the ENTRYPOINT/CMD so HTTP endpoints and bind-mounted files are
not written as root; apply the same pattern for the other mkdir/chown
occurrences in the file (the block referenced at lines 160-162).
- Around line 141-142: The Dockerfile currently echoes ACESTEP_EXTRA_ARGS
verbatim, which may leak secrets; change the echo of ACESTEP_EXTRA_ARGS to
either remove it entirely or print a redacted version instead (e.g., detect and
mask values for flags like --auth-password and --api-key before printing). Keep
the existing INIT_ARGS echo but update the ACESTEP_EXTRA_ARGS reference in the
block containing the echo commands so that sensitive flags are not logged (look
for the lines that reference INIT_ARGS and ACESTEP_EXTRA_ARGS and modify/remove
the ACESTEP_EXTRA_ARGS echo).
- Around line 94-100: Remove the hard-coded default for the ACESTEP LM backend
in the Docker image by deleting or unsetting the ENV ACESTEP_LM_BACKEND
declaration in the Dockerfile so the image does not force "pt" at container
start; instead, document ACESTEP_LM_BACKEND as a runtime/user override (e.g.,
via docker run -e or compose) so the runtime selection logic can choose vLLM or
other backends dynamically when Python starts.

---

Outside diff comments:
In `@acestep/acestep_v15_pipeline.py`:
- Around line 71-79: The primary import block in acestep_v15_pipeline.py is
missing is_blackwell_gpu, causing main() to reference an undefined symbol;
update the first package-import branch to include is_blackwell_gpu alongside
get_gpu_config, get_gpu_memory_gb, print_gpu_config_info, set_global_gpu_config,
VRAM_16GB_MIN_GB, and VRAM_AUTO_OFFLOAD_THRESHOLD_GB so that the symbol is
defined when main() runs (ensure the same import appears in the first import
group as in the fallback branch).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0feb43e7-cced-4945-997f-5ff71fa12659

📥 Commits

Reviewing files that changed from the base of the PR and between 0aabbce and f97734b.

📒 Files selected for processing (6)

CLAUDE.md
Dockerfile
acestep/acestep_v15_pipeline.py
acestep/gpu_config.py
acestep/llm_inference.py
docker-compose.yml

coderabbitai · 2026-03-13T06:17:16Z

acestep/llm_inference.py

+            init_thread = threading.Thread(target=_init_vllm, daemon=True)
+            init_thread.start()
+            init_thread.join(timeout=vllm_timeout_s)
+
+            if init_thread.is_alive():
+                logger.error(
+                    f"vLLM initialization timed out after {vllm_timeout_s}s "
+                    "(may indicate GPU compatibility issue). "
+                    "The daemon thread will be abandoned."
+                )
+                self.llm_initialized = False
+                return f"❌ vLLM initialization timed out after {vllm_timeout_s}s — try setting LM Backend to 'pt' (PyTorch)"


⚠️ Potential issue | 🟠 Major

A vLLM timeout still leaves the initializer running.

join(timeout=...) only stops waiting; it does not stop LLM(). After this returns, the caller immediately falls back to PyTorch, so two backend initializations can touch CUDA in the same process on the exact hang path. If this timeout needs to be recoverable, the vLLM init has to live in a killable subprocess; otherwise treat the timeout as terminal and skip the in-process fallback.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@acestep/llm_inference.py` around lines 739 - 750, vLLM init in a daemon thread (init_thread running _init_vllm) can continue touching CUDA after join(timeout= vllm_timeout_s) returns, risking concurrent GPU access; replace the in-thread strategy with a killable subprocess (e.g., use multiprocessing.Process or spawn a separate process to run _init_vllm) and join with timeout, and if the process is still alive call terminate() and ensure it is cleaned up, set self.llm_initialized = False and return the timeout error without falling back to the in-process PyTorch path (i.e., avoid calling LLM() or any other backend init in-process on that code path).

coderabbitai · 2026-03-13T06:17:16Z

CLAUDE.md

+### Multi-Platform Support
+
+Supports CUDA, ROCm, Intel XPU, MPS, MLX, and CPU. **Do not alter non-target platform paths** unless the task requires it. Use `gpu_config.py` for hardware detection. VRAM tiers auto-select LM model size (6GB→0.6B, 8GB→0.6B/1.7B, 16GB+→4B).


⚠️ Potential issue | 🟡 Minor

Update the VRAM-tier example to match acestep/gpu_config.py.

The current note says 16GB+ -> 4B, but the live config still recommends 1.7B through the 16-24GB tiers and only recommends 4B in the unlimited tier (>=24GB). The 8GB example is also off: 8-12GB still defaults to 0.6B. Right now this doc teaches the wrong auto-selection behavior.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@CLAUDE.md` around lines 62 - 64, The VRAM-tier example in CLAUDE.md is incorrect; update the mapping to match the logic in acestep/gpu_config.py by changing the tiers to: 6GB → 0.6B; 8–12GB → 0.6B (not 1.7B); 12–24GB → 1.7B; and >=24GB (unlimited) → 4B; reference gpu_config.py in the text and ensure the sentence "VRAM tiers auto-select LM model size" lists these exact ranges and model sizes so the documentation aligns with the live config.

coderabbitai · 2026-03-13T06:17:16Z

Dockerfile

+    && pip install --no-cache-dir --no-deps /app/acestep/third_parts/nano-vllm
+
+# ==================== Runtime directories ====================
+RUN mkdir -p /app/checkpoints /app/gradio_outputs


⚠️ Potential issue | 🟠 Major

Run the final container as an unprivileged user.

This image exposes HTTP endpoints and writes bind-mounted checkpoints/outputs as root. That increases the blast radius of a compromise and leaves host files root-owned. Create a dedicated runtime user, chown /app, /app/checkpoints, and /app/gradio_outputs, then switch with USER before the entrypoint.

Suggested change

RUN mkdir -p /app/checkpoints /app/gradio_outputs +RUN groupadd --system acestep \ + && useradd --system --gid acestep --create-home --home-dir /home/acestep acestep \ + && chown -R acestep:acestep /app /home/acestep @@ RUN sed -i 's/\r$//' /app/docker-entrypoint.sh && chmod +x /app/docker-entrypoint.sh +USER acestep ENTRYPOINT ["/app/docker-entrypoint.sh"]

Also applies to: 160-162

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Dockerfile` at line 91, The Dockerfile currently creates /app/checkpoints and /app/gradio_outputs as root (RUN mkdir -p /app/checkpoints /app/gradio_outputs) and the container continues running as root; change this by creating a dedicated non-root runtime user (e.g., "appuser"), chown the application directories (/app and the created subdirs) to that user after creation, and switch to that user with USER before the ENTRYPOINT/CMD so HTTP endpoints and bind-mounted files are not written as root; apply the same pattern for the other mkdir/chown occurrences in the file (the block referenced at lines 160-162).

coderabbitai · 2026-03-13T06:17:16Z

Dockerfile

+ENV GRADIO_SERVER_NAME=0.0.0.0
+ENV ACESTEP_MODE=gradio
+ENV ACESTEP_INIT_SERVICE=true
+ENV ACESTEP_CONFIG_PATH=acestep-v15-turbo
+ENV ACESTEP_LM_MODEL_PATH=acestep-5Hz-lm-0.6B
+ENV ACESTEP_LM_BACKEND=pt
+ENV TOKENIZERS_PARALLELISM=false


⚠️ Potential issue | 🟠 Major

Don't pin the container's default LM backend to PyTorch.

ENV ACESTEP_LM_BACKEND=pt short-circuits the new runtime selection before Python starts, so non-Blackwell CUDA runs never keep vLLM as the default and docker compose inherits that regression too. Leave this unset in the image and only document it as a user override.

Suggested change

-ENV ACESTEP_LM_BACKEND=pt ENV TOKENIZERS_PARALLELISM=false

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

ENV GRADIO_SERVER_NAME=0.0.0.0

ENV ACESTEP_MODE=gradio

ENV ACESTEP_INIT_SERVICE=true

ENV ACESTEP_CONFIG_PATH=acestep-v15-turbo

ENV ACESTEP_LM_MODEL_PATH=acestep-5Hz-lm-0.6B

ENV ACESTEP_LM_BACKEND=pt

ENV TOKENIZERS_PARALLELISM=false

ENV GRADIO_SERVER_NAME=0.0.0.0

ENV ACESTEP_MODE=gradio

ENV ACESTEP_INIT_SERVICE=true

ENV ACESTEP_CONFIG_PATH=acestep-v15-turbo

ENV ACESTEP_LM_MODEL_PATH=acestep-5Hz-lm-0.6B

ENV TOKENIZERS_PARALLELISM=false

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Dockerfile` around lines 94 - 100, Remove the hard-coded default for the ACESTEP LM backend in the Docker image by deleting or unsetting the ENV ACESTEP_LM_BACKEND declaration in the Dockerfile so the image does not force "pt" at container start; instead, document ACESTEP_LM_BACKEND as a runtime/user override (e.g., via docker run -e or compose) so the runtime selection logic can choose vLLM or other backends dynamically when Python starts.

coderabbitai · 2026-03-13T06:17:16Z

Dockerfile

+echo "INIT_ARGS: ${INIT_ARGS}"
+echo "EXTRA_ARGS: ${ACESTEP_EXTRA_ARGS:-}"


⚠️ Potential issue | 🟠 Major

Don't log ACESTEP_EXTRA_ARGS verbatim.

This is the escape hatch for flags like --auth-password and --api-key, so echoing it writes secrets straight into container logs. Keeping INIT_ARGS for debugging is fine, but EXTRA_ARGS should be dropped or redacted first.

Suggested change

echo "INIT_ARGS: ${INIT_ARGS}" -echo "EXTRA_ARGS: ${ACESTEP_EXTRA_ARGS:-}"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

echo "INIT_ARGS: ${INIT_ARGS}"

echo "EXTRA_ARGS: ${ACESTEP_EXTRA_ARGS:-}"

echo "INIT_ARGS: ${INIT_ARGS}"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Dockerfile` around lines 141 - 142, The Dockerfile currently echoes ACESTEP_EXTRA_ARGS verbatim, which may leak secrets; change the echo of ACESTEP_EXTRA_ARGS to either remove it entirely or print a redacted version instead (e.g., detect and mask values for flags like --auth-password and --api-key before printing). Keep the existing INIT_ARGS echo but update the ACESTEP_EXTRA_ARGS reference in the block containing the echo commands so that sensitive flags are not logged (look for the lines that reference INIT_ARGS and ACESTEP_EXTRA_ARGS and modify/remove the ACESTEP_EXTRA_ARGS echo).

ChuxiJ

Code Review: fix: auto-detect Blackwell GPUs and avoid vLLM hangs

Overall Assessment: Request Changes

The Blackwell GPU detection feature is valuable for RTX 50-series users, but the PR has several issues: a confirmed crash bug, a risky thread-based timeout approach, scope creep (Docker + CLAUDE.md), and no unit tests for the core GPU detection logic.

🔴 BUG (Blocking): Missing import in primary code path

CodeRabbit correctly flagged this. The PR adds is_blackwell_gpu to the fallback import block (except ImportError, line 78) but NOT to the primary import block (try, lines 50-59).

When running normally via python -m acestep.acestep_v15_pipeline, the try block succeeds, but is_blackwell_gpu is not imported. Then main() calls is_blackwell_gpu() at line ~147 → NameError: name 'is_blackwell_gpu' is not defined → crash on startup for every user.

Fix: Add is_blackwell_gpu to the primary import block:

from .gpu_config import (
    get_gpu_config,
    get_gpu_memory_gb,
    print_gpu_config_info,
    set_global_gpu_config,
    VRAM_16GB_MIN_GB,
    VRAM_AUTO_OFFLOAD_THRESHOLD_GB,
    is_blackwell_gpu,
    is_mps_platform,
)

🟡 Concern: Daemon thread timeout for vLLM init

The PR wraps LLM() initialization in a daemon thread with 180s timeout:

init_thread = threading.Thread(target=_init_vllm, daemon=True)
init_thread.start()
init_thread.join(timeout=vllm_timeout_s)
if init_thread.is_alive():
    # "The daemon thread will be abandoned"

This is problematic:

Abandoned CUDA state: The daemon thread continues running after timeout. It holds CUDA context, allocated GPU memory, and potentially corrupted allocator state. Subsequent PyTorch operations (even on a different backend) may fail or behave unpredictably.
No cleanup: There is no torch.cuda.empty_cache() or CUDA context reset after timeout.
Thread safety: init_result (a mutable list) is shared between threads without synchronization primitives.

Recommendation: Since the PR already adds is_blackwell_gpu() detection in initialize() (which switches to backend="pt" BEFORE reaching _initialize_5hz_lm_vllm), the thread timeout is a redundant safety net. Consider removing it entirely — the Blackwell check at line 622-628 already prevents vLLM from ever being attempted on these GPUs. If a timeout safety net is truly needed, use multiprocessing.Process which can be terminated cleanly.

🟡 Scope: Docker + CLAUDE.md are separate concerns

The PR includes 3 files unrelated to Blackwell GPU detection:

CLAUDE.md (+84 lines) — AI coding assistant guidance
Dockerfile (+162 lines) — Docker containerization
docker-compose.yml (+21 lines) — Compose config

Per AGENTS.md: "Solve one problem per task/PR" and "Keep edits minimal: touch only files/functions required for the requested change."

Recommendation: Split Docker support into a separate PR. This keeps the Blackwell fix focused and reviewable. The Docker setup also has its own issues (runs as root, hardcodes ACESTEP_LM_BACKEND=pt, echoes ACESTEP_EXTRA_ARGS which could leak secrets) that deserve dedicated review.

🟡 Missing: No unit tests

The core feature (is_blackwell_gpu()) has no unit tests. It's a simple function that would be easy to test:

def test_blackwell_detected_when_capability_12(self):
    with patch("torch.cuda.is_available", return_value=True), \
         patch("torch.cuda.get_device_capability", return_value=(12, 0)):
        self.assertTrue(is_blackwell_gpu())

def test_non_blackwell_when_capability_8(self):
    with patch("torch.cuda.is_available", return_value=True), \
         patch("torch.cuda.get_device_capability", return_value=(8, 9)):
        self.assertFalse(is_blackwell_gpu())

✅ Good: GPU detection logic

is_blackwell_gpu() itself is clean and correct:

Uses torch.cuda.get_device_capability() which is the standard way to detect GPU architecture
major >= 12 correctly identifies Blackwell (compute capability 12.x)
Wrapped in try/except Exception for graceful degradation
Properly gates on torch.cuda.is_available() first

Summary of Requested Changes

[Blocking] Add is_blackwell_gpu to the primary import block in acestep_v15_pipeline.py
[Important] Remove or replace the daemon thread timeout with multiprocessing.Process, or remove it entirely since is_blackwell_gpu() already prevents vLLM on Blackwell GPUs
[Important] Split Docker/CLAUDE.md into a separate PR
[Should have] Add unit tests for is_blackwell_gpu()

coderabbitai

♻️ Duplicate comments (1)

acestep/llm_inference.py (1)

740-773: ⚠️ Potential issue | 🔴 Critical

Timeout branch is not actually recoverable and can race with PyTorch fallback.

Line 759 only times out the wait; it does not stop LLM() in the daemon thread. That leaves vLLM init running while the caller may start PyTorch fallback in the same process, which is unsafe on CUDA.

Suggested direction

@@ def _initialize_5hz_lm_vllm(...):
-            if init_thread.is_alive():
+            if init_thread.is_alive():
                 logger.error(
                     f"vLLM initialization timed out after {vllm_timeout_s}s "
                     "(may indicate GPU compatibility issue). "
                     "The daemon thread will be abandoned."
                 )
                 self.llm_initialized = False
-                return f"❌ vLLM initialization timed out after {vllm_timeout_s}s — try setting LM Backend to 'pt' (PyTorch)"
+                self._vllm_timeout_unrecoverable = True
+                return (
+                    f"❌ vLLM initialization timed out after {vllm_timeout_s}s. "
+                    "In-process fallback is disabled for safety; restart with backend='pt'."
+                )

@@ def initialize(...):
-                    if status_msg.startswith("❌"):
+                    if status_msg.startswith("❌"):
+                        if getattr(self, "_vllm_timeout_unrecoverable", False):
+                            return status_msg, False
                         if not self.llm_initialized:
                             ...

If you need automatic fallback, move vLLM init to a killable subprocess and terminate on timeout before attempting PyTorch in-process.

Run this to verify the risky control flow is present:

#!/bin/bash
set -euo pipefail

# 1) Confirm timeout only waits and then abandons thread
rg -n -C3 'init_thread = threading.Thread|init_thread\.join\(timeout=|if init_thread\.is_alive\(\)|daemon thread will be abandoned' acestep/llm_inference.py

# 2) Confirm failure path can immediately fall back to PyTorch
rg -n -C4 'status_msg\.startswith\("❌"\)|Falling back to PyTorch backend|_load_pytorch_model\(' acestep/llm_inference.py

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@acestep/llm_inference.py` around lines 740 - 773, The timeout branch
currently only stops waiting for the daemon thread running _init_vllm (which
calls LLM(...)) but does not terminate that thread, leaving vLLM initialization
running and racing with any in-process PyTorch fallback (_load_pytorch_model or
code that sets self.llm), which is unsafe on CUDA; fix by moving the vLLM
initialization out of a lingering daemon thread into a separate killable
subprocess (spawn a subprocess to run the LLM(...) init, monitor it with a
timeout, terminate the subprocess if it exceeds vllm_timeout_s, and only then
proceed to set self.llm or fall back), or alternatively serialize GPU backends
so that when init_thread is timed out you block further PyTorch initialization
until the vLLM subprocess is fully killed; update references to _init_vllm,
init_thread, init_result, and setting self.llm accordingly so no background vLLM
init can run concurrently with PyTorch fallback.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@acestep/llm_inference.py`:
- Around line 740-773: The timeout branch currently only stops waiting for the
daemon thread running _init_vllm (which calls LLM(...)) but does not terminate
that thread, leaving vLLM initialization running and racing with any in-process
PyTorch fallback (_load_pytorch_model or code that sets self.llm), which is
unsafe on CUDA; fix by moving the vLLM initialization out of a lingering daemon
thread into a separate killable subprocess (spawn a subprocess to run the
LLM(...) init, monitor it with a timeout, terminate the subprocess if it exceeds
vllm_timeout_s, and only then proceed to set self.llm or fall back), or
alternatively serialize GPU backends so that when init_thread is timed out you
block further PyTorch initialization until the vLLM subprocess is fully killed;
update references to _init_vllm, init_thread, init_result, and setting self.llm
accordingly so no background vLLM init can run concurrently with PyTorch
fallback.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ef44a067-2a29-4e4c-9abd-c71714deb0b0

📥 Commits

Reviewing files that changed from the base of the PR and between f97734b and 1885f52.

📒 Files selected for processing (2)

acestep/gpu_config.py
acestep/llm_inference.py

ChuxiJ

Code Review

Verdict: Needs Changes ❌

The PR addresses a real issue (Blackwell GPUs hanging during vLLM init), but has several problems that must be fixed before merge.

Blocking Bug: Missing import crashes ALL users

In acestep/acestep_v15_pipeline.py, is_blackwell_gpu is imported only in the except ImportError fallback block but NOT in the primary try block. When running normally, the try block succeeds, is_blackwell_gpu is never imported, and main() raises NameError — crashing startup for all users, not just Blackwell GPU owners.

Fix: Add from acestep.gpu_config import is_blackwell_gpu to the primary try import block.

Medium: Daemon thread timeout leaks CUDA state

The thread-based vLLM init timeout has issues:

After join(timeout=180) returns with the thread still alive, the daemon thread continues running, holding CUDA context and GPU memory — Python threads cannot be killed
No torch.cuda.empty_cache() or CUDA context cleanup after timeout
This is largely redundant since the is_blackwell_gpu() guard already prevents vLLM from being attempted on Blackwell GPUs

Recommendation: Remove the thread-based timeout (the Blackwell guard handles it), or replace with multiprocessing.Process which can be properly terminated.

Scope: Docker and CLAUDE.md should be separate PRs

The PR mixes Blackwell fix + Docker support + CLAUDE.md updates. Docker-specific issues:

ENV ACESTEP_LM_BACKEND=pt hardcodes PyTorch for ALL Docker users, bypassing auto-detection that benefits non-Blackwell GPUs
Container runs as root — should use an unprivileged user
Entrypoint logs potentially sensitive values (--api-key, --auth-password)

Other

Mixed print() vs logger for Blackwell detection messages — use loguru consistently
No unit tests for is_blackwell_gpu() despite being trivially testable with mocks
CLAUDE.md VRAM tier description ("16GB+ -> 4B") doesn't match actual gpu_config.py logic

The core detection logic (is_blackwell_gpu using get_device_capability() >= 12) is solid. Please fix the import bug, simplify the timeout approach, and split Docker into its own PR.

ChuxiJ

Code Review

Overall: The Blackwell GPU detection concept is good, but there is a critical import bug that must be fixed before merge. Docker additions should be split into a separate PR.

Critical

Missing import in primary import branch (acestep_v15_pipeline.py): is_blackwell_gpu is only imported in the except ImportError fallback branch (line 78), but not in the primary import path (lines 51-58). When run normally (python -m acestep.acestep_v15_pipeline), is_blackwell_gpu will be undefined at line 147, crashing the application at startup. This is a blocking bug.

High

Daemon thread resource leak (llm_inference.py): When vLLM init times out, the daemon thread continues running in the background, potentially holding GPU memory and CUDA contexts. Python threads cannot be forcefully terminated. Consider using multiprocessing.Process instead, or document that the process should be restarted after a timeout.

Medium

Docker defaults ACESTEP_LM_BACKEND=pt for all GPUs: The Dockerfile sets ENV ACESTEP_LM_BACKEND=pt, meaning even non-Blackwell GPUs in Docker use PyTorch instead of vLLM by default. This seems overly conservative.
Scope creep: The PR bundles Docker support (Dockerfile + docker-compose.yml) and a CLAUDE.md file with a bug fix. Per the project's conventions, these should be separate PRs.

Low

print() vs logger in acestep_v15_pipeline.py: The Blackwell detection message uses print() while llm_inference.py correctly uses logger.warning().
AGENTS.md referenced in CLAUDE.md but doesn't exist in the repo.
Docker layer caching: COPY . /app/ before pip install invalidates the cache on any source change. Copy pyproject.toml/requirements first.

Recommendation: Fix the critical import bug before merge. Split Docker support into a separate PR.

ChuxiJ

Code Review

Good initiative addressing Blackwell GPU compatibility! Several comments:

Positive:

is_blackwell_gpu() with compute capability ≥12 detection is correct
The 180s timeout wrapper for vLLM init is a smart safety net
Docker + docker-compose support is a nice addition
Debug logging in the entrypoint is helpful for troubleshooting

Concerns:

Abandoned daemon threads: When vLLM init times out, the daemon thread is "abandoned" but may still hold GPU resources (CUDA context, allocated VRAM). This could leave the system in a bad state. Consider adding a comment noting this limitation, or documenting that a process restart may be needed after a timeout.
CLAUDE.md inclusion: Adding CLAUDE.md is helpful but should be a separate PR — it's unrelated to the Blackwell fix and adds review burden. Some of the content (architecture overview, conventions) should be validated by maintainers.
Dockerfile:
- Pinning torch==2.10.0+cu128 is very specific — this may not exist yet or may become outdated quickly. Consider using a version range or documenting the pin reason.
- The pip install approach (vs uv) diverges from the project's standard tooling.
Test plan: The test plan is manual-only. Could you add an automated test for is_blackwell_gpu() (mocking torch.cuda.get_device_capability)?

Suggestion:

Split this into two PRs:

Blackwell detection + vLLM timeout (core fix)
Docker support + CLAUDE.md (infrastructure)

ChuxiJ · 2026-03-20T01:33:04Z

Code review

Found 3 issues:

Critical: is_blackwell_gpu not imported in primary code path — is_blackwell_gpu is imported only in the except ImportError fallback block (line ~78), not in the primary try block (lines 51-59). The normal execution path (python -m acestep.acestep_v15_pipeline, uv run acestep) enters the try block and will crash with NameError: name 'is_blackwell_gpu' is not defined when main() calls it. This affects all users, not just Blackwell GPU owners. Fix: add is_blackwell_gpu to the try block's import statement.
Daemon thread timeout leaves orphaned CUDA context — When init_thread.join(timeout=180) returns with the thread still alive, it continues running LLM(...) holding CUDA resources. The fallback to _load_pytorch_model() then creates a second concurrent GPU-using path with no synchronization. Since is_blackwell_gpu() already prevents vLLM on Blackwell GPUs, the thread timeout is redundant for the stated use case. Recommend removing it entirely.
Dockerfile hard-codes ENV ACESTEP_LM_BACKEND=pt — This bypasses auto-detection for all Docker users, even on GPUs where vLLM works fine (A100, RTX 3090, etc). The Dockerfile also references the removed nano-vllm directory which will fail at build time.

Additionally: PR has merge conflicts with main and needs a rebase.

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

SirMal and others added 3 commits March 12, 2026 08:01

coderabbitai bot reviewed Mar 13, 2026

View reviewed changes

ChuxiJ requested changes Mar 16, 2026

View reviewed changes

Merge branch 'main' into fix/blackwell-vllm-hang

1885f52

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

ChuxiJ requested changes Mar 18, 2026

View reviewed changes

ChuxiJ reviewed Mar 18, 2026

View reviewed changes

ChuxiJ reviewed Mar 19, 2026

View reviewed changes

		### Multi-Platform Support

		Supports CUDA, ROCm, Intel XPU, MPS, MLX, and CPU. Do not alter non-target platform paths unless the task requires it. Use `gpu_config.py` for hardware detection. VRAM tiers auto-select LM model size (6GB→0.6B, 8GB→0.6B/1.7B, 16GB+→4B).

		echo "INIT_ARGS: ${INIT_ARGS}"
		echo "EXTRA_ARGS: ${ACESTEP_EXTRA_ARGS:-}"

Conversation

SirMal commented Mar 13, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

ChuxiJ left a comment

Choose a reason for hiding this comment

Code Review: fix: auto-detect Blackwell GPUs and avoid vLLM hangs

Overall Assessment: Request Changes

🔴 BUG (Blocking): Missing import in primary code path

🟡 Concern: Daemon thread timeout for vLLM init

🟡 Scope: Docker + CLAUDE.md are separate concerns

🟡 Missing: No unit tests

✅ Good: GPU detection logic

Summary of Requested Changes

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

ChuxiJ left a comment

Choose a reason for hiding this comment

Code Review

Blocking Bug: Missing import crashes ALL users

Medium: Daemon thread timeout leaks CUDA state

Scope: Docker and CLAUDE.md should be separate PRs

Other

Uh oh!

ChuxiJ left a comment

Choose a reason for hiding this comment

Code Review

Critical

High

Medium

Low

Uh oh!

ChuxiJ left a comment

Choose a reason for hiding this comment

Code Review

Positive:

Concerns:

Suggestion:

Uh oh!

ChuxiJ commented Mar 20, 2026

Code review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

SirMal commented Mar 13, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 13, 2026 •

edited

Loading