This guide provides detailed instructions for quantizing the Microsoft Phi-4-mini-instruct model using the Model Quantizer tool.
- Model: Microsoft Phi-4-mini-instruct
- Size: 4.2B parameters
- Original Format: 16-bit floating point (FP16)
- Original Memory Usage: ~8.4GB
The Phi-4-mini model is a powerful yet relatively compact model that can run on consumer hardware. However, even at 4.2B parameters, it requires significant memory in its original form. Quantization offers several benefits:
- Reduced Memory Usage: Quantized versions use significantly less memory
- Faster Loading: Smaller models load faster
- Broader Accessibility: Can run on devices with limited memory
- Comparable Performance: Maintains most of the original model's capabilities
The Model Quantizer supports multiple quantization methods for Phi-4-mini:
GPTQ works well across all platforms, including macOS:
# 8-bit GPTQ (better quality, ~50% memory reduction)
model-quantizer microsoft/Phi-4-mini-instruct --bits 8 --method gptq --output-dir phi4-mini-gptq-8bit
# 4-bit GPTQ (better memory efficiency, ~75% memory reduction)
model-quantizer microsoft/Phi-4-mini-instruct --bits 4 --method gptq --output-dir phi4-mini-gptq-4bitFor Mac users, explicitly specifying the CPU device can help ensure compatibility:
# Explicitly use CPU device on Mac
model-quantizer microsoft/Phi-4-mini-instruct --bits 4 --method gptq --device cpu --output-dir phi4-mini-gptq-4bitBitsAndBytes is optimized for CUDA devices:
# 8-bit BitsAndBytes
model-quantizer microsoft/Phi-4-mini-instruct --bits 8 --method bitsandbytes --output-dir phi4-mini-bnb-8bit
# 4-bit BitsAndBytes
model-quantizer microsoft/Phi-4-mini-instruct --bits 4 --method bitsandbytes --output-dir phi4-mini-bnb-4bitAWQ can provide good results for 4-bit quantization:
# 4-bit AWQ
model-quantizer microsoft/Phi-4-mini-instruct --bits 4 --method awq --output-dir phi4-mini-awq-4bitAfter quantizing, benchmark the model to evaluate its performance:
# Run the automated benchmark process
run-benchmark --original microsoft/Phi-4-mini-instruct --quantized ./phi4-mini-gptq-4bit --device cpu --max_tokens 50 --output_dir benchmark_results --update-model-cardThis will generate a comprehensive report comparing the original and quantized models and update the model card with benchmark results.
Test the quantized model interactively:
# Chat with the model
chat-with-model --model_path ./phi4-mini-gptq-4bit --device cpu
# Use a custom system prompt
chat-with-model --model_path ./phi4-mini-gptq-4bit --system_prompt "You are a helpful AI assistant specialized in science."Share your quantized model with the community:
# Publish to Hugging Face Hub
model-quantizer microsoft/Phi-4-mini-instruct --bits 4 --method gptq --output-dir phi4-mini-gptq-4bit --publish --repo-id YOUR_USERNAME/phi4-mini-gptq-4bit| Model Version | Memory Usage | Loading Time | Generation Speed | Quality |
|---|---|---|---|---|
| Original (FP16) | ~8.4GB | Baseline | Baseline | Baseline |
| GPTQ 8-bit | ~4.2GB | Similar | Slightly slower | Very close to original |
| GPTQ 4-bit | ~2.1GB | Faster | Similar or faster | Slight degradation |
| BnB 8-bit | ~4.2GB | Similar | Similar | Very close to original |
| BnB 4-bit | ~2.1GB | Faster | Similar | Moderate degradation |
- For macOS users: Use GPTQ 4-bit for best memory efficiency or GPTQ 8-bit for best quality
- For Windows/Linux with CUDA: Try both GPTQ and BitsAndBytes to see which performs better on your hardware
- For memory-constrained devices: Use 4-bit quantization (GPTQ recommended)
- For quality-sensitive applications: Use 8-bit quantization
If you encounter issues with GPTQ quantization, try explicitly specifying the device:
# Try CPU device
model-quantizer microsoft/Phi-4-mini-instruct --method gptq --bits 4 --device cpu
# Try CUDA device if available
model-quantizer microsoft/Phi-4-mini-instruct --method gptq --bits 4 --device cudaIf you encounter other issues on macOS, try:
export PYTORCH_ENABLE_MPS_FALLBACK=1This allows PyTorch to fall back to CPU for operations not supported on MPS.
If you encounter CUDA out of memory errors:
- Try a lower bit width (4-bit instead of 8-bit)
- Reduce the batch size for calibration
- Use CPU for quantization instead of CUDA
GPTQ quantization can be time-consuming. For faster results:
- Use a smaller calibration dataset
- Increase the group size parameter
- Use a more powerful GPU if available
The Phi-4-mini model is an excellent candidate for quantization, offering significant memory savings while maintaining most of its capabilities. The 4-bit GPTQ quantized version is particularly impressive, reducing memory usage by approximately 75% while still providing good performance.