Skip to content

Conversation

@yondonfu
Copy link
Contributor

@yondonfu yondonfu commented Jan 8, 2026

Summary

  • Upgrade torch 2.8.0 → 2.9.1, torchvision 0.23.0 → 0.24.1
  • Upgrade torchao 0.13.0 → 0.15.0, triton 3.4 → 3.5.1
  • Bump Python 3.10 → 3.12 (flash-attn 2.8.3 wheels for torch 2.9 only support cp312)
  • Add nvidia-cudnn-cu12 override (9.15+) to fix Conv3D bf16 performance regression
  • Add automatic cuDNN patching for Windows via .pth file

Why

PyTorch 2.9.1 has a Conv3D bf16 performance regression with cuDNN < 9.15. On Windows, PyTorch bundles cuDNN in torch/lib and loads it by full path, ignoring pip packages. The .pth file automatically copies the newer cuDNN DLLs at Python startup.

The flash-attn 2.8.3 prebuilt wheels for torch 2.9 are only available for Python 3.12 (cp312), requiring the Python version bump.

Changes

  • .python-version: 3.10.12 → 3.12.8
  • pyproject.toml:
    • requires-python>=3.12
    • Bump deps + add cuDNN override + force-include for .pth
    • Update wheel URLs for cp312 (flash-attn, sageattention)
    • ruff target-version → py312
  • .github/workflows/lint.yml: Python 3.10 → 3.12
  • src/scope/core/patches/cudnn.py: Use importlib.util.find_spec() to find package paths WITHOUT importing torch (prevents DLL locking)
  • patches.pth: Installed to site-packages, runs at Python startup

Test plan

  • Fresh uv sync installs all deps correctly
  • cuDNN version reports 91701 (9.17.1) automatically on fresh install
  • Pipeline FPS restored to ~10 (from ~7)
  • pytest passes
  • ruff check passes
  • Linux CI passes
  • Windows CI passes
  • Linux benchmarking
  • Windows benchmarking

🤖 Generated with Claude Code

Signed-off-by: Yondon Fu <yondon.fu@gmail.com>
@yondonfu yondonfu marked this pull request as draft January 8, 2026 20:35
Signed-off-by: Yondon Fu <yondon.fu@gmail.com>
Signed-off-by: Yondon Fu <yondon.fu@gmail.com>
@yondonfu
Copy link
Contributor Author

Linux benchmarking

LongLive test script comparable throughput + latency vs. main

Windows benchmarking

LongLive, StreamDiffusionV2, MemFlow, Krea comparable throughput + latency vs main

@yondonfu yondonfu marked this pull request as ready for review January 14, 2026 22:25
@yondonfu yondonfu merged commit b35751a into main Jan 14, 2026
5 checks passed
@yondonfu yondonfu deleted the scope-torch-bump branch January 14, 2026 22:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants