A PyTorch-native and Flexible Inference Engine with
Hybrid Cache Acceleration and Parallelism for 🤗DiTs
| Baseline | SCM S S* | SCM F D* | SCM U D* | +TS | +compile | +FP8* |
|---|---|---|---|---|---|---|
| 24.85s | 15.4s | 11.4s | 8.2s | 8.2s | 🎉7.1s | 🎉4.5s |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
🤗Why Cache-DiT❓❓Cache-DiT is built on top of the Diffusers library and now supports nearly 🔥ALL DiTs from Diffusers, including over 🤗70+ DiTs. Please refer to our online documentation at readthedocs.io for more details. The optimizations made by Cache-DiT include: (UAA: Ulysses Anything Attention)
- 🎉Hybrid Cache Acceleration (DBCache, DBPrune, TaylorSeer, SCM and more)
- 🎉Context Parallelism (w/ Extended Diffusers' CP APIs, UAA, Async Ulysses, FP8 comm)
- 🎉Tensor Parallelism (w/ PyTorch native DTensor and Tensor Parallelism APIs)
- 🎉Text Encoder Parallelism (w/ PyTorch native DTensor and Tensor Parallelism APIs)
- 🎉Auto Encoder (VAE) Parallelism (w/ Data or Tile Parallelism, avoid OOM)
- 🎉ControlNet Parallelism (w/ Context Parallelism for ControlNet module)
- 🎉Built-in HTTP serving deployment support with simple REST APIs
- 🎉Natively compatible with Compile, Offloading, Quantization, ...
- 🎉Integration into vLLM-Omni, SGLang Diffusion, SD.Next, ...
- 🎉Natively supports NVIDIA GPUs, Ascend NPUs (>= 1.2.0), ...
You can install the cache-dit from PyPI or from source:
pip3 install -U cache-dit # or, pip3 install git+https://github.com/vipshop/cache-dit.gitThen try
>>> import cache_dit
>>> from diffusers import DiffusionPipeline
>>> # The pipe can be any diffusion pipeline.
>>> pipe = DiffusionPipeline.from_pretrained("Qwen/Qwen-Image")
>>> # Cache Acceleration with One-line code.
>>> cache_dit.enable_cache(pipe)
>>> # Or, Hybrid Cache Acceleration + Parallelism.
>>> from cache_dit import DBCacheConfig, ParallelismConfig
>>> cache_dit.enable_cache(
... pipe, cache_config=DBCacheConfig(),
... parallelism_config=ParallelismConfig(ulysses_size=2)
... )
>>> from cache_dit import load_configs
>>> # Or, Load Acceleration config from a custom yaml file.
>>> cache_dit.enable_cache(pipe, **load_configs("config.yaml"))
>>> output = pipe(...) # Just call the pipe as normal.Please refer to our online documentation at readthedocs.io for more details.
- 📊Examples - The easiest way to enable hybrid cache acceleration and parallelism for DiTs with cache-dit is to start with our examples for popular models: FLUX, Z-Image, Qwen-Image, Wan, etc.
- 🌐HTTP Serving - Deploy cache-dit models with HTTP API for text-to-image, image editing, multi-image editing, and text/image-to-video generation.
- 🎉User Guide - For more advanced features, please refer to the 🎉User Guide for details.
- ❓FAQ - Frequently asked questions including attention backend configuration, troubleshooting, and optimization tips.
- 🔥Ascend NPU x Cache-DiT
- 🎉Diffusers x Cache-DiT
- 🎉SGLang Diffusion x Cache-DiT
- 🎉vLLM-Omni x Cache-DiT
- 🎉Nunchaku x Cache-DiT
- 🎉SD.Next x Cache-DiT
- 🎉stable-diffusion.cpp x Cache-DiT
- 🎉jetson-containers x Cache-DiT
Special thanks to vipshop's Computer Vision AI Team for supporting document, testing and deployment of this project. We learned the design and reused code from the following projects: Diffusers, SGLang, vLLM-Omni, ParaAttention, xDiT, TaylorSeer and LeMiCa.
@misc{cache-dit@2025,
title={cache-dit: A PyTorch-native and Flexible Inference Engine with Hybrid Cache Acceleration and Parallelism for DiTs.},
url={https://github.com/vipshop/cache-dit.git},
note={Open-source software available at https://github.com/vipshop/cache-dit.git},
author={DefTruth, vipshop.com},
year={2025}
}




