Skip to content
/ pcpp Public

Accelerating the inference of diffusion models by asynchronously conditioning generation on neighboring context.

License

Notifications You must be signed in to change notification settings

xiuyuz/pcpp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PCPP: Partially Conditioned Patch Parallelism for Accelerated Diffusion Model Inference

teaser Diffusion models have exhibited exciting capabilities in generating images and are also very promising for video creation. However, the inference speed of diffusion models is limited by the slow sampling process, restricting its use cases. The sequential denoising steps required for generating a single sample could take tens or hundreds of iterations and thus have become a significant bottleneck. This limitation is more salient for applications that are interactive in nature or require small latency. To address this challenge, we propose Partially Conditioned Patch Parallelism (PCPP) to accelerate the inference of high-resolution diffusion models. Using the fact that the difference between the images in adjacent diffusion steps is nearly zero, Patch Parallelism (PP) leverages multiple GPUs communicating asynchronously to compute patches of an image in multiple computing devices based on the entire image (all patches) in the previous diffusion step. PCPP develops PP to reduce computation in inference by conditioning only on parts of the neighboring patches in each diffusion step, which also decreases communication among computing devices. As a result, PCPP decreases the communication cost by around 70% compared to DistriFusion (the state of the art implementation of PP) and achieves 2.36 ~ 8.02X inference speed-up using 4 ~ 8 GPUs compared to 2.32 ~ 6.71X achieved by DistriFusion depending on the computing device configuration and resolution of generation at the cost of a possible decrease in image quality. PCPP demonstrates the potential to strike a favorable trade-off, enabling high-quality image generation with substantially reduced latency.

Partially Conditioned Patch Parallelism for Accelerated Diffusion Model Inference
XiuYu Zhang, Zening Luo, Michelle E. Lu
UC Berkeley
2024

Overview

idea Building on DistriFusion, we introduce our PCPP for parallelizing the inference of diffusion models using multiple devices asynchronously by point-to-point communication. The key idea is to partition the image horizontally into non-overlapping patches and process each patch conditioned on only itself and the parts of its neighboring patches (from the previous step) on separate devices. This approach is based on the hypothesis that generating image patches does not always necessitate dependency on all other patches; instead, satisfactory results can be achieved by relying solely on neighboring patches (for some cases). Please refer to the paper for details.

The major difference in communication is shown as the replacement of PatchParallelismCommManager class in DistriFusion to our PCPPCommManager class defined in distrifuser/utils.py. We also made necessary changes in distrifuser/models/base_model.py, distrifuser/models/distri_sdxl_unet_pp.py, distrifuser/modules/pp/attn.py, and distrifuser/pipelines.py.

Prerequisites

  • Python3
  • NVIDIA GPU + CUDA >= 12.0 and corresponding CuDNN
  • PyTorch >= 2.2.

Usage Example (adapted from DistriFusion)

In scripts/sdxl_example.py, we provide a minimal script for running SDXL with PCPP.

import torch
import json

from distrifuser.pipelines import DistriSDXLPipeline
from distrifuser.utils import DistriConfig

distri_config = DistriConfig(height=1024, width=1024, warmup_steps=4, split_batch=True)
pipeline = DistriSDXLPipeline.from_pretrained(
    distri_config=distri_config,
    pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0",
    variant="fp16",
    use_safetensors=True,
)

pipeline.set_progress_bar_config(disable=distri_config.rank != 0)

    
image = pipeline(
    prompt="Astronaut in a jungle, cold color palette, muted colors, detailed, 8k",
    generator=torch.Generator(device="cuda").manual_seed(233),
    num_inference_steps = 20,
).images[0]
if distri_config.rank == 0:
    image.save("doctor-PCPP-0.3.png")

The running command is

torchrun --nproc_per_node=$N_GPUS scripts/sdxl_example.py

where $N_GPUS is the number GPUs you want to use.

Specifically, our distrifuser shares the same APIs as DistriFusion and can be used in a similar way. The default partial value for PCPP generation is 0.3, and as for now, it needs to be changed manually in distrifuser/modules/pp/attn.py.

    # ---------------------------------------------------------------------------- #
    #                               Edit partial here                              #
    # ---------------------------------------------------------------------------- #
    portion = 0.3
    # ---------------------------------------------------------------------------- #
    #                               Edit partial here                              #
    # ---------------------------------------------------------------------------- #

Benchmark (adapted from DistriFusion)

Our benchmark results are using PyTorch 2.2 and diffusers 0.24.0. First, you may need to install some additional dependencies:

pip install git+https://github.com/zhijian-liu/torchprofile datasets torchmetrics dominate clean-fid

COCO Quality

You can use scripts/generate_coco.py to generate images with COCO captions. The command is

torchrun --nproc_per_node=$N_GPUS scripts/generate_coco.py --no_split_batch

where $N_GPUS is the number GPUs you want to use. By default, the generated results will be stored in results/coco. You can also customize it with --output_root. Some additional arguments that you may want to tune:

  • --num_inference_steps: The number of inference steps. We use 50 by default.
  • --guidance_scale: The classifier-free guidance scale. We use 5 by default.
  • --scheduler: The diffusion sampler. We use DDIM sampler by default.
  • --warmup_steps: The number of additional warmup steps (4 by default).
  • --sync_mode: Different GroupNorm synchronization modes. By default, it is using our corrected asynchronous GroupNorm.
  • --parallelism: The parallelism paradigm you use. By default, it is patch parallelism implemented following PCPP. You can use tensor for tensor parallelism and naive_patch for naïve patch.

After you generate all the images, you can use our script scripts/compute_metrics.py to calculate PSNR, LPIPS and FID. The usage is

python scripts/compute_metrics.py --input_root0 $IMAGE_ROOT0 --input_root1 $IMAGE_ROOT1

where $IMAGE_ROOT0 and $IMAGE_ROOT1 are paths to the image folders you are trying to comparing. If IMAGE_ROOT0 is the ground-truth foler, please add a --is_gt flag for resizing. We also provide a script scripts/dump_coco.py to dump the ground-truth images.

Citation

If you find our work useful, please consider citing it in your own research.

@misc{zhang2024pcpp,
      title={Partially Conditioned Patch Parallelism for Accelerated Diffusion Model Inference}, 
      author={XiuYu Zhang and Zening Luo and Michelle E. Lu},
      year={2024},
      eprint={2412.02962},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.02962}, 
}

Acknowledgments

Our code is developed based on DistriFusion and thus adapted the same MIT license.

About

Accelerating the inference of diffusion models by asynchronously conditioning generation on neighboring context.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages