A practical, step-by-step guide for creating SVDQuants (Singular Value Decomposition Quantizations) of diffusion models using DeepCompressor.
The Nunchaku team built an excellent quantization tool, but practical documentation was limited. This guide fills that gap with real-world instructions and pre-configured settings.
Check out my HuggingFace profile for ready-to-use SVDQuants I've already created.
- Before You Start: Important Considerations
- Step 1: Environment Setup
- Step 2: Configure Your Model
- Step 3: Create Baseline (Optional)
- Step 4: Prepare Calibration Dataset
- Step 5: Run Quantization
- Step 6: Convert for Deployment
- Configuration Reference
- Troubleshooting
- Flux.1 (Dev and Schnell) - This guide covers this
- SANA
- PixArt
- Coming soon: Qwen, WAN
GPU (VRAM):
- Minimum: 48GB (slow, lower quality results)
- Optimal: 96GB
- My pick: RTX Pro 6000 Blackwell (96GB, more affordable than H100)
CPU:
- Don't skimp here! Single-core performance matters a lot for this workload.
Storage:
- 200GB Container Size or more
Time:
- Expect 18-20 hours per quantization for Flux.1 Dev models
- Budget accordingly if using cloud GPUs (time × hourly rate)
Cloud GPU Providers:
I've created an automated setup script that handles everything: PyTorch, Poetry, DeepCompressor, dependencies, and configurations.
cd /workspace # or wherever you want to work
wget https://raw.githubusercontent.com/spooknik/deepcompressor-guide/refs/heads/main/install_deepcompressor.sh
chmod +x install_deepcompressor.sh
./install_deepcompressor.shWhat this script does:
- Installs system dependencies
- Installs Poetry (Python package manager)
- Installs PyTorch 2.8.0 with CUDA 12.8
- Clones DeepCompressor
- Downloads config files from this repository
- Fixes critical bug in DeepCompressor's dependencies (
pyav→av) - Sets exact package versions to avoid conflicts
- Configures environment variables for optimal performance
- Prompts for HuggingFace authentication
Time: ~10 minutes depending on internet speed
HuggingFace Login:
- When prompted, paste your HF token (get it from https://huggingface.co/settings/tokens)
- Or skip and login later:
huggingface-cli login
cd /workspace/deepcompressor
poetry run python test_installation.pyBefore quantizing, you need to configure your specific model. The config files are in /workspace/deepcompressor/configs/.
Open and modify configs/models/your-model.yaml:
pipeline:
name: CenKreChro # Your model's name
path: Tiwaz/CenKreChro # HuggingFace repo path
dtype: torch.bfloat16
eval:
num_steps: 25 # Number of inference steps
guidance_scale: 1 # CFG scale
num_samples: 128 # Number of images to generate for evaluationImportant: Update the quant.calib.path to match your model name:
quant:
calib:
batch_size: 32
path: datasets/torch.bfloat16/YOUR-MODEL-NAME/fmeuler25-g1/qdiff/s128Available configs in configs/svdquant/:
int4.yaml- 4-bit integer quantization (most common)nvfp4.yaml- NVIDIA FP4 formatgptq.yaml- GPTQ methodfast.yaml- Faster processing (5 grids vs 20, fewer samples)
The fast.yaml speeds up quantization but may reduce quality slightly. Combine it with other configs:
# Standard quality (slow)
configs/svdquant/int4.yaml
# Faster processing (recommended)
configs/svdquant/int4.yaml configs/svdquant/fast.yamlRecommended but not required. This step samples the full-precision model to create reference metrics for comparison.
Time: 2-3 hours
cd /workspace/deepcompressor
poetry run python -m deepcompressor.app.diffusion.ptq \
configs/models/your-model.yaml \
--output-dirname referenceWhat this does:
- Generates images using the original BF16/FP16 model
- Creates baseline metrics (FID, CLIP scores, etc.)
- Saves to
baselines/directory - Allows objective quality comparison after quantization
Sample count options:
4- Speedrun, don't care about evaluation results128- Fast, less accurate comparison256- Balanced (my recommendation)1024- More accurate, takes longer5000- Default, very thorough but very slow
Adjust by modifying num_samples in your model config.
DeepCompressor needs sample images to calibrate the quantization. This step generates those images using your model.
Time: 30-60 minutes
cd /workspace/deepcompressor
poetry run python -m deepcompressor.app.diffusion.dataset.collect.calib \
configs/models/your-model.yaml \
configs/collect/qdiff.yamlWhat this does:
- Loads prompts from
configs/prompts/qdiff.yaml - Generates 128 images (configurable in
configs/collect/qdiff.yaml) - Saves calibration dataset to the path specified in your model config
- These images are used to measure activation ranges during quantization
Customize sample count:
Edit configs/collect/qdiff.yaml:
collect:
num_samples: 128 # Increase for better calibration, decrease to save timeThis is the main event. Quantization takes the longest - budget 18-20 hours for Flux.1 models.
cd /workspace/deepcompressor
poetry run python -m deepcompressor.app.diffusion.ptq \
configs/models/your-model.yaml \
configs/svdquant/int4.yaml \
configs/svdquant/fast.yaml \
--eval-benchmarks MJHQ \
--eval-num-samples 256 \
--save-model output/quantizedCommand breakdown:
configs/models/your-model.yaml- Your model configurationconfigs/svdquant/int4.yaml- Quantization method (4-bit integer)configs/svdquant/fast.yaml- Speed optimization (optional)--eval-benchmarks MJHQ- Evaluate against MJHQ benchmark--eval-num-samples 256- Number of images for evaluation--save-model output/quantized- Where to save the quantized model
Monitor progress:
Open another terminal and watch GPU usage:
watch -n 1 nvidia-smior
nvtopWhat happens during quantization:
- Loads your full model into memory
- Applies smooth quantization to reduce errors
- Performs low-rank decomposition on weights
- Quantizes weights and activations to 4-bit
- Evaluates quality against calibration and benchmarks
- Saves quantized checkpoint
After quantization completes, convert the checkpoint to Nunchaku's deployment format.
For Int4
cd /workspace/deepcompressor
poetry run python -m deepcompressor.backend.nunchaku.convert \
--quant-path output/quantized \
--output-root output/deployed \
--model-name your-model-nameFor Fp4
cd /workspace/deepcompressor
poetry run python -m deepcompressor.backend.nunchaku.convert \
--quant-path output/quantized \
--output-root output/deployed \
--model-name your-model-name \
--float-pointParameters:
--quant-path- Path to quantized checkpoint from Step 5--output-root- Where to save deployment files--model-name- Name for the deployed model--float-point- Must be used for FP4
The outpout of the previous command leaves us with two .safetensors files, which we need to merge together to get something we can use in ComfyUI.
Download transformer_blocks.safetensors and unquantized_layers.safetensors and place them in a folder in ComfyUI\models\unet\your-model-name
Create this workflow and run it (or drag this image into ComfyUI)
Now in ComfyUI\models\unet you have the finished merged SVDQuant!
configs/
├── models/
│ └── your-model.yaml # Model-specific settings
├── svdquant/
│ ├── __default__.yaml # Base quantization settings
│ ├── int4.yaml # 4-bit integer quantization
│ ├── nvfp4.yaml # NVIDIA FP4 quantization
│ ├── gptq.yaml # GPTQ method
│ └── fast.yaml # Speed optimization
├── collect/
│ └── qdiff.yaml # Calibration collection settings
└── prompts/
├── qdiff.yaml # General prompts
└── lora/ # LoRA-specific prompts
Batch Sizes (in model config):
If you encounter out-of-memory errors, reduce these values:
quant:
calib:
batch_size: 32 # Lower if OOM
wgts:
calib_range:
element_batch_size: 128 # Try 64 if OOM
sample_batch_size: 32 # Try 16 if OOMSkip Patterns:
These layers are sensitive to quantization and should be skipped:
embed- Embedding layersresblock_*- ResNet componentstransformer_proj_*- Transformer projectionstransformer_norm- Normalization layersdown_sample/up_sample- Resolution changes
Don't modify these unless you know what you're doing!
- Reduce batch sizes in your model config (see above)
- Use
fast.yamlto reduce memory pressure - Lower
num_samplesin calibration and evaluation - Close other GPU applications
The install script should fix this automatically, but if you see this error:
cd /workspace/deepcompressor
sed -i 's/pyav = ">= 13.0.0"/av = ">= 13.0.0"/' pyproject.toml
poetry run pip install av>=13.0.0
poetry installMake sure you're authenticated with HuggingFace:
huggingface-cli loginSome models require accepting terms on their HuggingFace page first.
- This is normal! Flux.1 takes 18-20 hours for quality quantization
- Ensure you're using a powerful CPU and not in powersave (check with
htop) - Make sure GPU isn't throttling (check temps with
nvidia-smi) - Consider using
fast.yamlfor quicker (but potentially lower quality) results
- Increase calibration samples: Edit
configs/collect/qdiff.yamland increasenum_samples - Remove
fast.yaml: Use full grid search (20 grids instead of 5) - Check your hardware: 48GB VRAM produces lower quality than 80GB+
- Try different quantization method: Test
nvfp4.yamlinstead ofint4.yaml
Sadly no. Chroma was developed from Flux Schnell but the model archtecture is different and Deepcompressor / Nunchaku don't support it. We need to wait for someone to add support.
I made an attempt with help from others to merge Chroma info Flux, but the results were subpar and not worth persueing.
A very good alternative is CenKreChro, a Krea + Chroma merge.
- DeepCompressor Issues: https://github.com/nunchaku-tech/deepcompressor/issues
- This Guide Issues: https://github.com/spooknik/deepcompressor-guide/issues
MIT License - See LICENSE file for details.