Quantizing the Microsoft Phi-4-mini-instruct Model

This guide provides detailed instructions for quantizing the Microsoft Phi-4-mini-instruct model using the Model Quantizer tool.

Model Specifications

Model: Microsoft Phi-4-mini-instruct
Size: 4.2B parameters
Original Format: 16-bit floating point (FP16)
Original Memory Usage: ~8.4GB

Why Quantize Phi-4-mini?

The Phi-4-mini model is a powerful yet relatively compact model that can run on consumer hardware. However, even at 4.2B parameters, it requires significant memory in its original form. Quantization offers several benefits:

Reduced Memory Usage: Quantized versions use significantly less memory
Faster Loading: Smaller models load faster
Broader Accessibility: Can run on devices with limited memory
Comparable Performance: Maintains most of the original model's capabilities

Quantization Options

The Model Quantizer supports multiple quantization methods for Phi-4-mini:

GPTQ Quantization (Recommended for most users)

GPTQ works well across all platforms, including macOS:

# 8-bit GPTQ (better quality, ~50% memory reduction)
model-quantizer microsoft/Phi-4-mini-instruct --bits 8 --method gptq --output-dir phi4-mini-gptq-8bit

# 4-bit GPTQ (better memory efficiency, ~75% memory reduction)
model-quantizer microsoft/Phi-4-mini-instruct --bits 4 --method gptq --output-dir phi4-mini-gptq-4bit

For Mac users, explicitly specifying the CPU device can help ensure compatibility:

# Explicitly use CPU device on Mac
model-quantizer microsoft/Phi-4-mini-instruct --bits 4 --method gptq --device cpu --output-dir phi4-mini-gptq-4bit

BitsAndBytes Quantization (Best for CUDA devices)

BitsAndBytes is optimized for CUDA devices:

# 8-bit BitsAndBytes
model-quantizer microsoft/Phi-4-mini-instruct --bits 8 --method bitsandbytes --output-dir phi4-mini-bnb-8bit

# 4-bit BitsAndBytes
model-quantizer microsoft/Phi-4-mini-instruct --bits 4 --method bitsandbytes --output-dir phi4-mini-bnb-4bit

AWQ Quantization (Experimental)

AWQ can provide good results for 4-bit quantization:

# 4-bit AWQ
model-quantizer microsoft/Phi-4-mini-instruct --bits 4 --method awq --output-dir phi4-mini-awq-4bit

Benchmarking

After quantizing, benchmark the model to evaluate its performance:

# Run the automated benchmark process
run-benchmark --original microsoft/Phi-4-mini-instruct --quantized ./phi4-mini-gptq-4bit --device cpu --max_tokens 50 --output_dir benchmark_results --update-model-card

This will generate a comprehensive report comparing the original and quantized models and update the model card with benchmark results.

Interactive Testing

Test the quantized model interactively:

# Chat with the model
chat-with-model --model_path ./phi4-mini-gptq-4bit --device cpu

# Use a custom system prompt
chat-with-model --model_path ./phi4-mini-gptq-4bit --system_prompt "You are a helpful AI assistant specialized in science."

Publishing Your Quantized Model

Share your quantized model with the community:

# Publish to Hugging Face Hub
model-quantizer microsoft/Phi-4-mini-instruct --bits 4 --method gptq --output-dir phi4-mini-gptq-4bit --publish --repo-id YOUR_USERNAME/phi4-mini-gptq-4bit

Performance Comparison

Model Version	Memory Usage	Loading Time	Generation Speed	Quality
Original (FP16)	~8.4GB	Baseline	Baseline	Baseline
GPTQ 8-bit	~4.2GB	Similar	Slightly slower	Very close to original
GPTQ 4-bit	~2.1GB	Faster	Similar or faster	Slight degradation
BnB 8-bit	~4.2GB	Similar	Similar	Very close to original
BnB 4-bit	~2.1GB	Faster	Similar	Moderate degradation

Recommendations

For macOS users: Use GPTQ 4-bit for best memory efficiency or GPTQ 8-bit for best quality
For Windows/Linux with CUDA: Try both GPTQ and BitsAndBytes to see which performs better on your hardware
For memory-constrained devices: Use 4-bit quantization (GPTQ recommended)
For quality-sensitive applications: Use 8-bit quantization

Troubleshooting

GPTQ Device Selection

If you encounter issues with GPTQ quantization, try explicitly specifying the device:

# Try CPU device
model-quantizer microsoft/Phi-4-mini-instruct --method gptq --bits 4 --device cpu

# Try CUDA device if available
model-quantizer microsoft/Phi-4-mini-instruct --method gptq --bits 4 --device cuda

macOS-Specific Issues

If you encounter other issues on macOS, try:

export PYTORCH_ENABLE_MPS_FALLBACK=1

This allows PyTorch to fall back to CPU for operations not supported on MPS.

CUDA Out of Memory

If you encounter CUDA out of memory errors:

Try a lower bit width (4-bit instead of 8-bit)
Reduce the batch size for calibration
Use CPU for quantization instead of CUDA

Slow Quantization

GPTQ quantization can be time-consuming. For faster results:

Use a smaller calibration dataset
Increase the group size parameter
Use a more powerful GPU if available

Conclusion

The Phi-4-mini model is an excellent candidate for quantization, offering significant memory savings while maintaining most of its capabilities. The 4-bit GPTQ quantized version is particularly impressive, reducing memory usage by approximately 75% while still providing good performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantizing the Microsoft Phi-4-mini-instruct Model

Model Specifications

Why Quantize Phi-4-mini?

Quantization Options

GPTQ Quantization (Recommended for most users)

BitsAndBytes Quantization (Best for CUDA devices)

AWQ Quantization (Experimental)

Benchmarking

Interactive Testing

Publishing Your Quantized Model

Performance Comparison

Recommendations

Troubleshooting

GPTQ Device Selection

macOS-Specific Issues

CUDA Out of Memory

Slow Quantization

Conclusion

FilesExpand file tree

phi4_mini.md

Latest commit

History

phi4_mini.md

File metadata and controls

Quantizing the Microsoft Phi-4-mini-instruct Model

Model Specifications

Why Quantize Phi-4-mini?

Quantization Options

GPTQ Quantization (Recommended for most users)

BitsAndBytes Quantization (Best for CUDA devices)

AWQ Quantization (Experimental)

Benchmarking

Interactive Testing

Publishing Your Quantized Model

Performance Comparison

Recommendations

Troubleshooting

GPTQ Device Selection

macOS-Specific Issues

CUDA Out of Memory

Slow Quantization

Conclusion