Skip to content

fix: auto-detect Blackwell GPUs and avoid vLLM hangs#825

Open
SirMal wants to merge 4 commits intoace-step:mainfrom
SirMal:fix/blackwell-vllm-hang
Open

fix: auto-detect Blackwell GPUs and avoid vLLM hangs#825
SirMal wants to merge 4 commits intoace-step:mainfrom
SirMal:fix/blackwell-vllm-hang

Conversation

@SirMal
Copy link

@SirMal SirMal commented Mar 13, 2026

Summary

  • Blackwell GPUs (RTX 50-series, compute capability ≥12) hang during vLLM/nano-vllm initialization, preventing the Gradio UI from ever starting
  • Adds is_blackwell_gpu() detection and automatically defaults to PyTorch backend on these GPUs
  • Adds a 180s timeout wrapper around vLLM init — if it hangs, auto-falls back to PyTorch
  • Adds debug logging in Docker entrypoint (INIT_ARGS) and Python LM initialization (backend/device/model)

Test plan

  • Build Docker image and run on Blackwell GPU (RTX 5070) — verify PyTorch backend is auto-selected and Gradio UI loads
  • Run on non-Blackwell GPU — verify vLLM is still used by default
  • Set ACESTEP_LM_BACKEND=vllm on Blackwell GPU — verify safety net overrides to PyTorch with warning
  • Verify INIT_ARGS appears in Docker container logs for debugging

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Containerized deployment with Docker and Compose, GPU support, healthchecks, and configurable runtime mode (UI or API).
    • Automatic NVIDIA Blackwell (RTX 50-series) detection with backend selection and warnings.
    • Threaded LLM initialization with timeout to detect hangs and improve startup reliability.
  • Documentation

    • Added a comprehensive guide covering setup, architecture, configuration, and development conventions.

SirMal and others added 3 commits March 12, 2026 08:01
- Dockerfile for x86_64 CUDA (Docker Desktop on Windows/Linux)
- docker-compose.yml for easy launch with GPU passthrough
- CLAUDE.md with build/test commands, architecture overview, and conventions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…trypoint

The env var was ignored because the pipeline only reads it when --enable-api
is set. Without --backend, nano-vllm (default) OOMs on consumer GPUs like
RTX 5070 (11.5GB). Now defaults to pt (PyTorch) backend.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Blackwell GPUs (compute capability ≥12, RTX 50-series) have known
compatibility issues with vLLM/nano-vllm causing hangs and segfaults.

- Add is_blackwell_gpu() detection in gpu_config.py
- Default to PyTorch backend on Blackwell GPUs instead of vLLM
- Add safety net in llm_inference.py to override vLLM on Blackwell
- Add 180s timeout wrapper around vLLM init with auto PyTorch fallback
- Add debug logging in Docker entrypoint for INIT_ARGS visibility
- Log backend/device/model at LM initialization start

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 13, 2026

📝 Walkthrough

Walkthrough

Added documentation and containerization for ACE-Step 1.5; introduced NVIDIA Blackwell (RTX 50-series) GPU detection and runtime backend selection; implemented threaded vLLM initialization with timeout and fallback to PyTorch to avoid hangs.

Changes

Cohort / File(s) Summary
Docs & Containerization
CLAUDE.md, Dockerfile, docker-compose.yml
New repository documentation (CLAUDE.md) and container setup: multi-stage CUDA 12.8 Dockerfile, Python 3.11 environment, PyTorch/CUDA deps, entrypoint/startup logic, healthcheck; docker-compose service with GPU allocation and env defaults.
GPU detection
acestep/gpu_config.py
Added is_blackwell_gpu() to detect NVIDIA Blackwell (compute capability >= 12) with safe exception handling.
Pipeline backend selection
acestep/acestep_v15_pipeline.py
Backend default logic updated to prefer PyTorch when Blackwell detected; retains mlx on macOS and vLLM otherwise; imports is_blackwell_gpu.
LLM initialization
acestep/llm_inference.py
vLLM initialization moved into a worker thread with timeout and hang detection; uses Blackwell guard to preemptively select PyTorch; improved runtime logging and failure reporting for initialization.

Sequence Diagram

sequenceDiagram
    participant User as User/Entrypoint
    participant GPU as GPU Detection
    participant Selector as Backend Selector
    participant LLM as LLM Initializer
    User->>GPU: Query device (compute capability)
    GPU-->>User: Is Blackwell? (True/False)
    User->>Selector: Choose backend (mlx/vllm/pt)
    alt Blackwell detected
        Selector-->>User: Select PyTorch (warn vLLM hang)
    else Mac
        Selector-->>User: Select mlx
    else
        Selector-->>User: Select vLLM
    end
    User->>LLM: Initialize LLM (threaded for vLLM)
    alt Init completes within timeout
        LLM-->>User: Return initialized LLM
    else Timeout / hang
        LLM-->>User: Initialization failed -> suggest PyTorch fallback
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐇
I sniffed the CUDA breeze at night,
Found Blackwell stars that dimmed vLLM's light,
I threaded threads to wake the stack,
Wrote docs and Docker to pack the pack,
Hop, build, run — the models take flight!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main purpose of the changeset: auto-detecting Blackwell GPUs and preventing vLLM initialization hangs by switching backends.
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
acestep/acestep_v15_pipeline.py (1)

71-79: ⚠️ Potential issue | 🔴 Critical

Import is_blackwell_gpu in the primary package-import branch too.

This only fixes the script fallback path. The normal launch path (python -m acestep.acestep_v15_pipeline, which the Docker entrypoint uses) goes through the first import block, so main() reaches Line 147 with is_blackwell_gpu undefined and the UI crashes before startup.

Suggested change
     from .gpu_config import (
         get_gpu_config,
         get_gpu_memory_gb,
         print_gpu_config_info,
         set_global_gpu_config,
         VRAM_16GB_MIN_GB,
         VRAM_AUTO_OFFLOAD_THRESHOLD_GB,
+        is_blackwell_gpu,
         is_mps_platform,
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@acestep/acestep_v15_pipeline.py` around lines 71 - 79, The primary import
block in acestep_v15_pipeline.py is missing is_blackwell_gpu, causing main() to
reference an undefined symbol; update the first package-import branch to include
is_blackwell_gpu alongside get_gpu_config, get_gpu_memory_gb,
print_gpu_config_info, set_global_gpu_config, VRAM_16GB_MIN_GB, and
VRAM_AUTO_OFFLOAD_THRESHOLD_GB so that the symbol is defined when main() runs
(ensure the same import appears in the first import group as in the fallback
branch).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@acestep/llm_inference.py`:
- Around line 739-750: vLLM init in a daemon thread (init_thread running
_init_vllm) can continue touching CUDA after join(timeout= vllm_timeout_s)
returns, risking concurrent GPU access; replace the in-thread strategy with a
killable subprocess (e.g., use multiprocessing.Process or spawn a separate
process to run _init_vllm) and join with timeout, and if the process is still
alive call terminate() and ensure it is cleaned up, set self.llm_initialized =
False and return the timeout error without falling back to the in-process
PyTorch path (i.e., avoid calling LLM() or any other backend init in-process on
that code path).

In `@CLAUDE.md`:
- Around line 62-64: The VRAM-tier example in CLAUDE.md is incorrect; update the
mapping to match the logic in acestep/gpu_config.py by changing the tiers to:
6GB → 0.6B; 8–12GB → 0.6B (not 1.7B); 12–24GB → 1.7B; and >=24GB (unlimited) →
4B; reference gpu_config.py in the text and ensure the sentence "VRAM tiers
auto-select LM model size" lists these exact ranges and model sizes so the
documentation aligns with the live config.

In `@Dockerfile`:
- Line 91: The Dockerfile currently creates /app/checkpoints and
/app/gradio_outputs as root (RUN mkdir -p /app/checkpoints /app/gradio_outputs)
and the container continues running as root; change this by creating a dedicated
non-root runtime user (e.g., "appuser"), chown the application directories (/app
and the created subdirs) to that user after creation, and switch to that user
with USER before the ENTRYPOINT/CMD so HTTP endpoints and bind-mounted files are
not written as root; apply the same pattern for the other mkdir/chown
occurrences in the file (the block referenced at lines 160-162).
- Around line 141-142: The Dockerfile currently echoes ACESTEP_EXTRA_ARGS
verbatim, which may leak secrets; change the echo of ACESTEP_EXTRA_ARGS to
either remove it entirely or print a redacted version instead (e.g., detect and
mask values for flags like --auth-password and --api-key before printing). Keep
the existing INIT_ARGS echo but update the ACESTEP_EXTRA_ARGS reference in the
block containing the echo commands so that sensitive flags are not logged (look
for the lines that reference INIT_ARGS and ACESTEP_EXTRA_ARGS and modify/remove
the ACESTEP_EXTRA_ARGS echo).
- Around line 94-100: Remove the hard-coded default for the ACESTEP LM backend
in the Docker image by deleting or unsetting the ENV ACESTEP_LM_BACKEND
declaration in the Dockerfile so the image does not force "pt" at container
start; instead, document ACESTEP_LM_BACKEND as a runtime/user override (e.g.,
via docker run -e or compose) so the runtime selection logic can choose vLLM or
other backends dynamically when Python starts.

---

Outside diff comments:
In `@acestep/acestep_v15_pipeline.py`:
- Around line 71-79: The primary import block in acestep_v15_pipeline.py is
missing is_blackwell_gpu, causing main() to reference an undefined symbol;
update the first package-import branch to include is_blackwell_gpu alongside
get_gpu_config, get_gpu_memory_gb, print_gpu_config_info, set_global_gpu_config,
VRAM_16GB_MIN_GB, and VRAM_AUTO_OFFLOAD_THRESHOLD_GB so that the symbol is
defined when main() runs (ensure the same import appears in the first import
group as in the fallback branch).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0feb43e7-cced-4945-997f-5ff71fa12659

📥 Commits

Reviewing files that changed from the base of the PR and between 0aabbce and f97734b.

📒 Files selected for processing (6)
  • CLAUDE.md
  • Dockerfile
  • acestep/acestep_v15_pipeline.py
  • acestep/gpu_config.py
  • acestep/llm_inference.py
  • docker-compose.yml

Comment on lines +739 to +750
init_thread = threading.Thread(target=_init_vllm, daemon=True)
init_thread.start()
init_thread.join(timeout=vllm_timeout_s)

if init_thread.is_alive():
logger.error(
f"vLLM initialization timed out after {vllm_timeout_s}s "
"(may indicate GPU compatibility issue). "
"The daemon thread will be abandoned."
)
self.llm_initialized = False
return f"❌ vLLM initialization timed out after {vllm_timeout_s}s — try setting LM Backend to 'pt' (PyTorch)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

A vLLM timeout still leaves the initializer running.

join(timeout=...) only stops waiting; it does not stop LLM(). After this returns, the caller immediately falls back to PyTorch, so two backend initializations can touch CUDA in the same process on the exact hang path. If this timeout needs to be recoverable, the vLLM init has to live in a killable subprocess; otherwise treat the timeout as terminal and skip the in-process fallback.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@acestep/llm_inference.py` around lines 739 - 750, vLLM init in a daemon
thread (init_thread running _init_vllm) can continue touching CUDA after
join(timeout= vllm_timeout_s) returns, risking concurrent GPU access; replace
the in-thread strategy with a killable subprocess (e.g., use
multiprocessing.Process or spawn a separate process to run _init_vllm) and join
with timeout, and if the process is still alive call terminate() and ensure it
is cleaned up, set self.llm_initialized = False and return the timeout error
without falling back to the in-process PyTorch path (i.e., avoid calling LLM()
or any other backend init in-process on that code path).

Comment on lines +62 to +64
### Multi-Platform Support

Supports CUDA, ROCm, Intel XPU, MPS, MLX, and CPU. **Do not alter non-target platform paths** unless the task requires it. Use `gpu_config.py` for hardware detection. VRAM tiers auto-select LM model size (6GB→0.6B, 8GB→0.6B/1.7B, 16GB+→4B).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Update the VRAM-tier example to match acestep/gpu_config.py.

The current note says 16GB+ -> 4B, but the live config still recommends 1.7B through the 16-24GB tiers and only recommends 4B in the unlimited tier (>=24GB). The 8GB example is also off: 8-12GB still defaults to 0.6B. Right now this doc teaches the wrong auto-selection behavior.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CLAUDE.md` around lines 62 - 64, The VRAM-tier example in CLAUDE.md is
incorrect; update the mapping to match the logic in acestep/gpu_config.py by
changing the tiers to: 6GB → 0.6B; 8–12GB → 0.6B (not 1.7B); 12–24GB → 1.7B; and
>=24GB (unlimited) → 4B; reference gpu_config.py in the text and ensure the
sentence "VRAM tiers auto-select LM model size" lists these exact ranges and
model sizes so the documentation aligns with the live config.

&& pip install --no-cache-dir --no-deps /app/acestep/third_parts/nano-vllm

# ==================== Runtime directories ====================
RUN mkdir -p /app/checkpoints /app/gradio_outputs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Run the final container as an unprivileged user.

This image exposes HTTP endpoints and writes bind-mounted checkpoints/outputs as root. That increases the blast radius of a compromise and leaves host files root-owned. Create a dedicated runtime user, chown /app, /app/checkpoints, and /app/gradio_outputs, then switch with USER before the entrypoint.

Suggested change
 RUN mkdir -p /app/checkpoints /app/gradio_outputs
+RUN groupadd --system acestep \
+    && useradd --system --gid acestep --create-home --home-dir /home/acestep acestep \
+    && chown -R acestep:acestep /app /home/acestep
@@
 RUN sed -i 's/\r$//' /app/docker-entrypoint.sh && chmod +x /app/docker-entrypoint.sh
+USER acestep
 
 ENTRYPOINT ["/app/docker-entrypoint.sh"]

Also applies to: 160-162

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Dockerfile` at line 91, The Dockerfile currently creates /app/checkpoints and
/app/gradio_outputs as root (RUN mkdir -p /app/checkpoints /app/gradio_outputs)
and the container continues running as root; change this by creating a dedicated
non-root runtime user (e.g., "appuser"), chown the application directories (/app
and the created subdirs) to that user after creation, and switch to that user
with USER before the ENTRYPOINT/CMD so HTTP endpoints and bind-mounted files are
not written as root; apply the same pattern for the other mkdir/chown
occurrences in the file (the block referenced at lines 160-162).

Comment on lines +94 to +100
ENV GRADIO_SERVER_NAME=0.0.0.0
ENV ACESTEP_MODE=gradio
ENV ACESTEP_INIT_SERVICE=true
ENV ACESTEP_CONFIG_PATH=acestep-v15-turbo
ENV ACESTEP_LM_MODEL_PATH=acestep-5Hz-lm-0.6B
ENV ACESTEP_LM_BACKEND=pt
ENV TOKENIZERS_PARALLELISM=false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't pin the container's default LM backend to PyTorch.

ENV ACESTEP_LM_BACKEND=pt short-circuits the new runtime selection before Python starts, so non-Blackwell CUDA runs never keep vLLM as the default and docker compose inherits that regression too. Leave this unset in the image and only document it as a user override.

Suggested change
-ENV ACESTEP_LM_BACKEND=pt
 ENV TOKENIZERS_PARALLELISM=false
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ENV GRADIO_SERVER_NAME=0.0.0.0
ENV ACESTEP_MODE=gradio
ENV ACESTEP_INIT_SERVICE=true
ENV ACESTEP_CONFIG_PATH=acestep-v15-turbo
ENV ACESTEP_LM_MODEL_PATH=acestep-5Hz-lm-0.6B
ENV ACESTEP_LM_BACKEND=pt
ENV TOKENIZERS_PARALLELISM=false
ENV GRADIO_SERVER_NAME=0.0.0.0
ENV ACESTEP_MODE=gradio
ENV ACESTEP_INIT_SERVICE=true
ENV ACESTEP_CONFIG_PATH=acestep-v15-turbo
ENV ACESTEP_LM_MODEL_PATH=acestep-5Hz-lm-0.6B
ENV TOKENIZERS_PARALLELISM=false
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Dockerfile` around lines 94 - 100, Remove the hard-coded default for the
ACESTEP LM backend in the Docker image by deleting or unsetting the ENV
ACESTEP_LM_BACKEND declaration in the Dockerfile so the image does not force
"pt" at container start; instead, document ACESTEP_LM_BACKEND as a runtime/user
override (e.g., via docker run -e or compose) so the runtime selection logic can
choose vLLM or other backends dynamically when Python starts.

Comment on lines +141 to +142
echo "INIT_ARGS: ${INIT_ARGS}"
echo "EXTRA_ARGS: ${ACESTEP_EXTRA_ARGS:-}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't log ACESTEP_EXTRA_ARGS verbatim.

This is the escape hatch for flags like --auth-password and --api-key, so echoing it writes secrets straight into container logs. Keeping INIT_ARGS for debugging is fine, but EXTRA_ARGS should be dropped or redacted first.

Suggested change
 echo "INIT_ARGS: ${INIT_ARGS}"
-echo "EXTRA_ARGS: ${ACESTEP_EXTRA_ARGS:-}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
echo "INIT_ARGS: ${INIT_ARGS}"
echo "EXTRA_ARGS: ${ACESTEP_EXTRA_ARGS:-}"
echo "INIT_ARGS: ${INIT_ARGS}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Dockerfile` around lines 141 - 142, The Dockerfile currently echoes
ACESTEP_EXTRA_ARGS verbatim, which may leak secrets; change the echo of
ACESTEP_EXTRA_ARGS to either remove it entirely or print a redacted version
instead (e.g., detect and mask values for flags like --auth-password and
--api-key before printing). Keep the existing INIT_ARGS echo but update the
ACESTEP_EXTRA_ARGS reference in the block containing the echo commands so that
sensitive flags are not logged (look for the lines that reference INIT_ARGS and
ACESTEP_EXTRA_ARGS and modify/remove the ACESTEP_EXTRA_ARGS echo).

Copy link
Contributor

@ChuxiJ ChuxiJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: fix: auto-detect Blackwell GPUs and avoid vLLM hangs

Overall Assessment: Request Changes

The Blackwell GPU detection feature is valuable for RTX 50-series users, but the PR has several issues: a confirmed crash bug, a risky thread-based timeout approach, scope creep (Docker + CLAUDE.md), and no unit tests for the core GPU detection logic.


🔴 BUG (Blocking): Missing import in primary code path

CodeRabbit correctly flagged this. The PR adds is_blackwell_gpu to the fallback import block (except ImportError, line 78) but NOT to the primary import block (try, lines 50-59).

When running normally via python -m acestep.acestep_v15_pipeline, the try block succeeds, but is_blackwell_gpu is not imported. Then main() calls is_blackwell_gpu() at line ~147 → NameError: name 'is_blackwell_gpu' is not defined → crash on startup for every user.

Fix: Add is_blackwell_gpu to the primary import block:

from .gpu_config import (
    get_gpu_config,
    get_gpu_memory_gb,
    print_gpu_config_info,
    set_global_gpu_config,
    VRAM_16GB_MIN_GB,
    VRAM_AUTO_OFFLOAD_THRESHOLD_GB,
    is_blackwell_gpu,
    is_mps_platform,
)

🟡 Concern: Daemon thread timeout for vLLM init

The PR wraps LLM() initialization in a daemon thread with 180s timeout:

init_thread = threading.Thread(target=_init_vllm, daemon=True)
init_thread.start()
init_thread.join(timeout=vllm_timeout_s)
if init_thread.is_alive():
    # "The daemon thread will be abandoned"

This is problematic:

  1. Abandoned CUDA state: The daemon thread continues running after timeout. It holds CUDA context, allocated GPU memory, and potentially corrupted allocator state. Subsequent PyTorch operations (even on a different backend) may fail or behave unpredictably.
  2. No cleanup: There is no torch.cuda.empty_cache() or CUDA context reset after timeout.
  3. Thread safety: init_result (a mutable list) is shared between threads without synchronization primitives.

Recommendation: Since the PR already adds is_blackwell_gpu() detection in initialize() (which switches to backend="pt" BEFORE reaching _initialize_5hz_lm_vllm), the thread timeout is a redundant safety net. Consider removing it entirely — the Blackwell check at line 622-628 already prevents vLLM from ever being attempted on these GPUs. If a timeout safety net is truly needed, use multiprocessing.Process which can be terminated cleanly.


🟡 Scope: Docker + CLAUDE.md are separate concerns

The PR includes 3 files unrelated to Blackwell GPU detection:

  • CLAUDE.md (+84 lines) — AI coding assistant guidance
  • Dockerfile (+162 lines) — Docker containerization
  • docker-compose.yml (+21 lines) — Compose config

Per AGENTS.md: "Solve one problem per task/PR" and "Keep edits minimal: touch only files/functions required for the requested change."

Recommendation: Split Docker support into a separate PR. This keeps the Blackwell fix focused and reviewable. The Docker setup also has its own issues (runs as root, hardcodes ACESTEP_LM_BACKEND=pt, echoes ACESTEP_EXTRA_ARGS which could leak secrets) that deserve dedicated review.


🟡 Missing: No unit tests

The core feature (is_blackwell_gpu()) has no unit tests. It's a simple function that would be easy to test:

def test_blackwell_detected_when_capability_12(self):
    with patch("torch.cuda.is_available", return_value=True), \
         patch("torch.cuda.get_device_capability", return_value=(12, 0)):
        self.assertTrue(is_blackwell_gpu())

def test_non_blackwell_when_capability_8(self):
    with patch("torch.cuda.is_available", return_value=True), \
         patch("torch.cuda.get_device_capability", return_value=(8, 9)):
        self.assertFalse(is_blackwell_gpu())

✅ Good: GPU detection logic

is_blackwell_gpu() itself is clean and correct:

  • Uses torch.cuda.get_device_capability() which is the standard way to detect GPU architecture
  • major >= 12 correctly identifies Blackwell (compute capability 12.x)
  • Wrapped in try/except Exception for graceful degradation
  • Properly gates on torch.cuda.is_available() first

Summary of Requested Changes

  1. [Blocking] Add is_blackwell_gpu to the primary import block in acestep_v15_pipeline.py
  2. [Important] Remove or replace the daemon thread timeout with multiprocessing.Process, or remove it entirely since is_blackwell_gpu() already prevents vLLM on Blackwell GPUs
  3. [Important] Split Docker/CLAUDE.md into a separate PR
  4. [Should have] Add unit tests for is_blackwell_gpu()

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
acestep/llm_inference.py (1)

740-773: ⚠️ Potential issue | 🔴 Critical

Timeout branch is not actually recoverable and can race with PyTorch fallback.

Line 759 only times out the wait; it does not stop LLM() in the daemon thread. That leaves vLLM init running while the caller may start PyTorch fallback in the same process, which is unsafe on CUDA.

Suggested direction
@@ def _initialize_5hz_lm_vllm(...):
-            if init_thread.is_alive():
+            if init_thread.is_alive():
                 logger.error(
                     f"vLLM initialization timed out after {vllm_timeout_s}s "
                     "(may indicate GPU compatibility issue). "
                     "The daemon thread will be abandoned."
                 )
                 self.llm_initialized = False
-                return f"❌ vLLM initialization timed out after {vllm_timeout_s}s — try setting LM Backend to 'pt' (PyTorch)"
+                self._vllm_timeout_unrecoverable = True
+                return (
+                    f"❌ vLLM initialization timed out after {vllm_timeout_s}s. "
+                    "In-process fallback is disabled for safety; restart with backend='pt'."
+                )
@@ def initialize(...):
-                    if status_msg.startswith("❌"):
+                    if status_msg.startswith("❌"):
+                        if getattr(self, "_vllm_timeout_unrecoverable", False):
+                            return status_msg, False
                         if not self.llm_initialized:
                             ...

If you need automatic fallback, move vLLM init to a killable subprocess and terminate on timeout before attempting PyTorch in-process.

Run this to verify the risky control flow is present:

#!/bin/bash
set -euo pipefail

# 1) Confirm timeout only waits and then abandons thread
rg -n -C3 'init_thread = threading.Thread|init_thread\.join\(timeout=|if init_thread\.is_alive\(\)|daemon thread will be abandoned' acestep/llm_inference.py

# 2) Confirm failure path can immediately fall back to PyTorch
rg -n -C4 'status_msg\.startswith\("❌"\)|Falling back to PyTorch backend|_load_pytorch_model\(' acestep/llm_inference.py
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@acestep/llm_inference.py` around lines 740 - 773, The timeout branch
currently only stops waiting for the daemon thread running _init_vllm (which
calls LLM(...)) but does not terminate that thread, leaving vLLM initialization
running and racing with any in-process PyTorch fallback (_load_pytorch_model or
code that sets self.llm), which is unsafe on CUDA; fix by moving the vLLM
initialization out of a lingering daemon thread into a separate killable
subprocess (spawn a subprocess to run the LLM(...) init, monitor it with a
timeout, terminate the subprocess if it exceeds vllm_timeout_s, and only then
proceed to set self.llm or fall back), or alternatively serialize GPU backends
so that when init_thread is timed out you block further PyTorch initialization
until the vLLM subprocess is fully killed; update references to _init_vllm,
init_thread, init_result, and setting self.llm accordingly so no background vLLM
init can run concurrently with PyTorch fallback.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@acestep/llm_inference.py`:
- Around line 740-773: The timeout branch currently only stops waiting for the
daemon thread running _init_vllm (which calls LLM(...)) but does not terminate
that thread, leaving vLLM initialization running and racing with any in-process
PyTorch fallback (_load_pytorch_model or code that sets self.llm), which is
unsafe on CUDA; fix by moving the vLLM initialization out of a lingering daemon
thread into a separate killable subprocess (spawn a subprocess to run the
LLM(...) init, monitor it with a timeout, terminate the subprocess if it exceeds
vllm_timeout_s, and only then proceed to set self.llm or fall back), or
alternatively serialize GPU backends so that when init_thread is timed out you
block further PyTorch initialization until the vLLM subprocess is fully killed;
update references to _init_vllm, init_thread, init_result, and setting self.llm
accordingly so no background vLLM init can run concurrently with PyTorch
fallback.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ef44a067-2a29-4e4c-9abd-c71714deb0b0

📥 Commits

Reviewing files that changed from the base of the PR and between f97734b and 1885f52.

📒 Files selected for processing (2)
  • acestep/gpu_config.py
  • acestep/llm_inference.py

Copy link
Contributor

@ChuxiJ ChuxiJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Verdict: Needs Changes ❌

The PR addresses a real issue (Blackwell GPUs hanging during vLLM init), but has several problems that must be fixed before merge.

Blocking Bug: Missing import crashes ALL users

In acestep/acestep_v15_pipeline.py, is_blackwell_gpu is imported only in the except ImportError fallback block but NOT in the primary try block. When running normally, the try block succeeds, is_blackwell_gpu is never imported, and main() raises NameError — crashing startup for all users, not just Blackwell GPU owners.

Fix: Add from acestep.gpu_config import is_blackwell_gpu to the primary try import block.

Medium: Daemon thread timeout leaks CUDA state

The thread-based vLLM init timeout has issues:

  1. After join(timeout=180) returns with the thread still alive, the daemon thread continues running, holding CUDA context and GPU memory — Python threads cannot be killed
  2. No torch.cuda.empty_cache() or CUDA context cleanup after timeout
  3. This is largely redundant since the is_blackwell_gpu() guard already prevents vLLM from being attempted on Blackwell GPUs

Recommendation: Remove the thread-based timeout (the Blackwell guard handles it), or replace with multiprocessing.Process which can be properly terminated.

Scope: Docker and CLAUDE.md should be separate PRs

The PR mixes Blackwell fix + Docker support + CLAUDE.md updates. Docker-specific issues:

  • ENV ACESTEP_LM_BACKEND=pt hardcodes PyTorch for ALL Docker users, bypassing auto-detection that benefits non-Blackwell GPUs
  • Container runs as root — should use an unprivileged user
  • Entrypoint logs potentially sensitive values (--api-key, --auth-password)

Other

  • Mixed print() vs logger for Blackwell detection messages — use loguru consistently
  • No unit tests for is_blackwell_gpu() despite being trivially testable with mocks
  • CLAUDE.md VRAM tier description ("16GB+ -> 4B") doesn't match actual gpu_config.py logic

The core detection logic (is_blackwell_gpu using get_device_capability() >= 12) is solid. Please fix the import bug, simplify the timeout approach, and split Docker into its own PR.

Copy link
Contributor

@ChuxiJ ChuxiJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Overall: The Blackwell GPU detection concept is good, but there is a critical import bug that must be fixed before merge. Docker additions should be split into a separate PR.

Critical

  1. Missing import in primary import branch (acestep_v15_pipeline.py): is_blackwell_gpu is only imported in the except ImportError fallback branch (line 78), but not in the primary import path (lines 51-58). When run normally (python -m acestep.acestep_v15_pipeline), is_blackwell_gpu will be undefined at line 147, crashing the application at startup. This is a blocking bug.

High

  1. Daemon thread resource leak (llm_inference.py): When vLLM init times out, the daemon thread continues running in the background, potentially holding GPU memory and CUDA contexts. Python threads cannot be forcefully terminated. Consider using multiprocessing.Process instead, or document that the process should be restarted after a timeout.

Medium

  1. Docker defaults ACESTEP_LM_BACKEND=pt for all GPUs: The Dockerfile sets ENV ACESTEP_LM_BACKEND=pt, meaning even non-Blackwell GPUs in Docker use PyTorch instead of vLLM by default. This seems overly conservative.

  2. Scope creep: The PR bundles Docker support (Dockerfile + docker-compose.yml) and a CLAUDE.md file with a bug fix. Per the project's conventions, these should be separate PRs.

Low

  1. print() vs logger in acestep_v15_pipeline.py: The Blackwell detection message uses print() while llm_inference.py correctly uses logger.warning().
  2. AGENTS.md referenced in CLAUDE.md but doesn't exist in the repo.
  3. Docker layer caching: COPY . /app/ before pip install invalidates the cache on any source change. Copy pyproject.toml/requirements first.

Recommendation: Fix the critical import bug before merge. Split Docker support into a separate PR.

Copy link
Contributor

@ChuxiJ ChuxiJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Good initiative addressing Blackwell GPU compatibility! Several comments:

Positive:

  • is_blackwell_gpu() with compute capability ≥12 detection is correct
  • The 180s timeout wrapper for vLLM init is a smart safety net
  • Docker + docker-compose support is a nice addition
  • Debug logging in the entrypoint is helpful for troubleshooting

Concerns:

  1. Abandoned daemon threads: When vLLM init times out, the daemon thread is "abandoned" but may still hold GPU resources (CUDA context, allocated VRAM). This could leave the system in a bad state. Consider adding a comment noting this limitation, or documenting that a process restart may be needed after a timeout.

  2. CLAUDE.md inclusion: Adding CLAUDE.md is helpful but should be a separate PR — it's unrelated to the Blackwell fix and adds review burden. Some of the content (architecture overview, conventions) should be validated by maintainers.

  3. Dockerfile:

    • Pinning torch==2.10.0+cu128 is very specific — this may not exist yet or may become outdated quickly. Consider using a version range or documenting the pin reason.
    • The pip install approach (vs uv) diverges from the project's standard tooling.
  4. Test plan: The test plan is manual-only. Could you add an automated test for is_blackwell_gpu() (mocking torch.cuda.get_device_capability)?

Suggestion:

Split this into two PRs:

  1. Blackwell detection + vLLM timeout (core fix)
  2. Docker support + CLAUDE.md (infrastructure)

@ChuxiJ
Copy link
Contributor

ChuxiJ commented Mar 20, 2026

Code review

Found 3 issues:

  1. Critical: is_blackwell_gpu not imported in primary code pathis_blackwell_gpu is imported only in the except ImportError fallback block (line ~78), not in the primary try block (lines 51-59). The normal execution path (python -m acestep.acestep_v15_pipeline, uv run acestep) enters the try block and will crash with NameError: name 'is_blackwell_gpu' is not defined when main() calls it. This affects all users, not just Blackwell GPU owners. Fix: add is_blackwell_gpu to the try block's import statement.

  2. Daemon thread timeout leaves orphaned CUDA context — When init_thread.join(timeout=180) returns with the thread still alive, it continues running LLM(...) holding CUDA resources. The fallback to _load_pytorch_model() then creates a second concurrent GPU-using path with no synchronization. Since is_blackwell_gpu() already prevents vLLM on Blackwell GPUs, the thread timeout is redundant for the stated use case. Recommend removing it entirely.

  3. Dockerfile hard-codes ENV ACESTEP_LM_BACKEND=pt — This bypasses auto-detection for all Docker users, even on GPUs where vLLM works fine (A100, RTX 3090, etc). The Dockerfile also references the removed nano-vllm directory which will fail at build time.

Additionally: PR has merge conflicts with main and needs a rebase.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants