This project provides tools for quantizing and testing large language models from the DeepSeek family, specifically focusing on the DeepSeek-R1-Distill-Qwen-1.5B model. The primary goal is to make these models more accessible by reducing their memory footprint through quantization to Int4 precision and ONNX format conversion.
Our quantization pipeline achieves impressive results:
| Metric | Value | Impact |
|---|---|---|
| Model Size Reduction | 78.1% | 7.75GB → 1.70GB |
| Average Speedup | 5.24x | 0.6 → 1.7 tokens/sec |
| Accuracy Retention | 80.2% | 62.14% → 49.84% |
| Memory Efficiency | ~75% less | Run on 8GB RAM devices |
Quantization reduces the precision of numerical values in neural networks:
FP32 (32-bit) ➡️ INT8 (8-bit) ➡️ INT4 (4-bit)
┌─────────┐ ┌─────┐ ┌───┐
│111010101│ → │10101│ → │1010│
└─────────┘ └─────┘ └───┘
More Precise Smaller Tiny!
(~7.75GB model) (~3GB model) (~1.7GB model)
Quantization dramatically reduces model size and memory footprint while attempting to preserve model quality!
Distillation is where a smaller model (student) learns from a larger model (teacher):
Teacher Model Student Model
(Large, Slow) (Small, Fast)
┌───────────┐ ┌─────┐
│ │ → │ │
│ 7B-175B │ Knowledge │ 1-3B│
│ Parameters│ Transfer │Params│
└───────────┘ └─────┘
DeepSeek-R1-Distill-Qwen-1.5B is already a distilled model that learned from larger models!
# Clone the repository (if you haven't already)
git clone https://github.com/shyamsridhar123/Quantization
# Create and activate a virtual environment (recommended)
python -m venv .venv
.\.venv\Scripts\activate # On Windows
# Install dependencies
pip install -r requirements.txt# NEW: Complete pipeline with fixes for ONNX generation
python download_model.py --model-id "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
python convert_to_onnx.py --model-path ./model --output-path ./onnx_model
python fix_onnx_model.py --input-path ./onnx_model --output-path ./onnx_fixed
python quantize_model.py --input-path ./onnx_fixed --output-path ./quantized_model --quant-type int4
# Or use the all-in-one script:
python reexport_with_position_ids.py --model-id deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --output-dir ./onnx_fixed --quantize# Interactive testing with direct ONNX Runtime approach
python direct_interactive_test.py --model-path ./quantized_model/onnx_int4
# NEW: Enhanced comparison with accuracy metrics
python enhanced_compare_models.py --num-samples 5 --max-length 200 --temperature 0.7# Run performance benchmarks
python benchmark_model.py --model-path ./quantized_model/onnx_int4/model_quantized.onnx --num-threads 4If you encounter issues:
# Run diagnostics
python diagnose_onnx_model.py --model-path ./quantized_model/onnx_int4/model_quantized.onnx
# Try direct ONNX inference if Optimum integration fails
python enhanced_onnx_inference.py --model-path ./quantized_model/onnx_int4/model_quantized.onnxBased on our comprehensive evaluation across different prompt types:
| Task Type | Original Accuracy | Quantized Accuracy | Similarity | Speedup |
|---|---|---|---|---|
| Definition | 76.2% | 41.5% | 11.3% | 2.4x |
| Explanation | 63.1% | 50.4% | 11.9% | 14.2x |
| Technical | 55.6% | 60.6% | 8.1% | 6.5x |
| Introduction | 63.2% | 44.2% | 13.6% | 0.6x |
| Average | 62.14% | 49.84% | 12.53% | 5.24x |
Our evaluation measures multiple aspects of output quality:
- Keyword Coverage: Presence of expected domain-specific terms
- Concept Coverage: Inclusion of key concepts and relationships
- Relevance Score: Alignment with the prompt topic
- Sentence Coherence: Grammatical and structural quality
- Completeness: Presence of introduction, body, and conclusion
- Factual Accuracy: Absence of contradictions or errors
Quantizing models provides these amazing benefits:
┌─────────────────────────────────────┐
│ 🚀 78.1% Smaller File Size │
│ 💾 ~75% Less Memory Usage │
│ ⚡ 5.24x Faster Inference Speed │
│ 🖥️ Run on Consumer Hardware │
│ 🔋 Lower Energy Consumption │
│ 🏠 Enable Edge & Local Deployment │
└─────────────────────────────────────┘
-
Install Required Dependencies:
pip install -r requirements.txt -
Verify Environment Setup:
python environment_check.py
-
Download and Convert to ONNX (NEW recommended approach):
# Download model python download_model.py --model-id "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B" # Convert to ONNX python convert_to_onnx.py --model-path ./model --output-path ./onnx_model # Fix ONNX model for generation python fix_onnx_model.py --input-path ./onnx_model --output-path ./onnx_fixed
-
Quantize the Model to Int4:
# Quantize to INT4 python quantize_model.py --input-path ./onnx_fixed --output-path ./quantized_model --quant-type int4 # Or use the all-in-one script: python reexport_with_position_ids.py --model-id deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --output-dir ./onnx_fixed --quantize
-
Run Diagnostics:
python diagnose_onnx_model.py --model-path ./quantized_model/onnx_int4/model_quantized.onnx
-
Interactive Testing:
# Recommended approach (handles common issues): python direct_interactive_test.py --model-path ./quantized_model/onnx_int4 # Alternative approach using Optimum: python interactive_test.py --model-path ./quantized_model/onnx_int4
-
NEW: Comprehensive Comparison:
# Compare original vs quantized with accuracy metrics python enhanced_compare_models.py \ --original-path ./onnx_fixed \ --quantized-path ./quantized_model/onnx_int4 \ --num-samples 5 \ --max-length 200
-
Performance Benchmarking:
python benchmark_model.py --model-path ./quantized_model/onnx_int4/model_quantized.onnx --num-threads 4
-
Compare with Original Model:
python compare_models.py --quantized-path ./quantized_model/onnx_int4 --model-id deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
-
Generate Detailed Reports:
# Generate HTML report python generate_report.py --results-dir ./inference_results --model-path ./quantized_model/onnx_int4 --output-file ./model_report.html # View enhanced comparison results cat ./comparison_results_enhanced/enhanced_comparison_report.txt
- Convert to FP16 for Troubleshooting:
python convert_to_fp16.py --input-model ./quantized_model/onnx_int4/model_quantized.onnx --output-model ./quantized_model/onnx_fp16/model_fp16.onnx
Quantization in this repository follows this process:
Original Model (FP32)
⬇️
┌─────────────────────────────────────────────┐
│ 1. Export to ONNX │
│ - Convert model architecture to ONNX │
│ - Fix generation issues │
│ - Handle onnx::Gather_3 parameter │
└─────────────────────────────────────────────┘
⬇️
┌─────────────────────────────────────────────┐
│ 2. Calibration │
│ - Analyze weight distributions │
│ - Determine optimal scaling factors │
└─────────────────────────────────────────────┘
⬇️
┌─────────────────────────────────────────────┐
│ 3. INT4 Quantization │
│ - Apply scaling to weights │
│ - Convert FP32 values to INT4 values │
│ - Store quantization parameters │
│ - Result: 78.1% size reduction │
└─────────────────────────────────────────────┘
⬇️
┌─────────────────────────────────────────────┐
│ 4. Optimization │
│ - Apply ONNX Runtime optimizations │
│ - Fuse operations where possible │
│ - Result: 5.24x speedup │
└─────────────────────────────────────────────┘
⬇️
Quantized Model (INT4)
- 🔧 Core Quantization Pipeline: Complete workflow from model download to INT4 quantization
- 🧪 Testing Framework: Comprehensive testing with accuracy metrics and performance benchmarks
- 📊 Evaluation Suite: Detailed comparison tools with visualizations
- 🧹 Maintenance Utilities: Scripts for project organization and cleanup
download_model.py- Download model from HuggingFaceconvert_to_onnx.py- Convert PyTorch to ONNX formatfix_onnx_model.py- Fix ONNX model for proper generationquantize_model.py- Quantize to INT4/INT8enhanced_compare_models.py- Comprehensive accuracy comparison
reexport_with_position_ids.py- 🛠️ All-in-one export with position_ids fixrun_quantization.py- 🔢 Dedicated quantization scriptquantize_int4.ipynb- 📓 Jupyter notebook for Int4 quantization
direct_interactive_test.py- 💬 Interactive testing using direct ONNX Runtime (most reliable)test_quantized_model.py- 🧪 Basic inference testingdiagnose_onnx_model.py- 🔍 Detailed diagnostic checks for model issuesdiagnose_generation.py- 🔍 Debug generation issuesbenchmark_model.py- 📊 Performance benchmarking across different input sizes
enhanced_compare_models.py- 📈 Full comparison with accuracy metricsgenerate_report.py- 📄 Generate HTML reportscompare_models.py- 🔄 Compare quantized vs original
run_inference_tests.ps1- 🧪 Comprehensive testing suiterun_master_test.ps1- 🚀 Runs all tests in sequence
maintain.bat- 🛠️ Central maintenance dashboard with menu optionscleanup_simple.ps1- 🧹 Removes temporary files and Python cache
Symptom: KeyError: 'logits' or similar errors during model inference.
Solution: Use the fixed ONNX model:
python fix_onnx_model.py --input-path ./onnx_model --output-path ./onnx_fixedSymptom: Model fails to load or initialize.
Solution: Try using different loading parameters:
# For Qwen2-based models
session_options = ort.SessionOptions()
session_options.intra_op_num_threads = 4
session = ort.InferenceSession(model_path, session_options)Symptom: Slow inference or high memory usage.
Solution: Adjust thread count to match your CPU cores:
python benchmark_model.py --model-path ./model_path --num-threads 4 # Adjust to your CPUSymptom: Low accuracy scores or irrelevant outputs.
Solution: Tune generation parameters:
python enhanced_compare_models.py --temperature 0.5 --max-length 150- 🧵 Thread Count: Adjust
--num-threadsbased on your CPU. Generally, setting it to the number of physical cores works best. - 🔢 Token Limit: Use
--max-tokensto control generation length. - 📁 Model Path: All scripts accept
--model-pathto specify the model location. - 🌡️ Temperature: Control randomness with
--temperature(0.0-1.0)
This project demonstrates several important techniques:
┌─────────────────────────────────────────────────┐
│ 🚀 ONNX Conversion │
│ - Hardware-agnostic deployment │
│ - Runtime optimizations for any hardware │
│ - Fixed generation issues with Gather_3 │
├─────────────────────────────────────────────────┤
│ 🧮 INT4 Quantization │
│ - 78.1% size reduction achieved │
│ - 7.75GB → 1.70GB model size │
│ - 80.2% accuracy retention │
│ - 5.24x speedup in inference │
├─────────────────────────────────────────────────┤
│ 🔧 Architecture-Specific Fixes │
│ - Fixes for Qwen2 architecture │
│ - Custom onnx::Gather_3 handling │
│ - Position tracking for generation │
├─────────────────────────────────────────────────┤
│ 📊 Comprehensive Evaluation │
│ - Multi-metric accuracy assessment │
│ - Performance vs quality trade-off analysis │
│ - Detailed visualizations and reports │
│ - Real-world prompt testing │
└─────────────────────────────────────────────────┘
- ONNX Runtime Documentation 📚
- Optimum Documentation 🤗
- DeepSeek-R1 Model Card 🧠
- ONNX Runtime Quantization Guide 📖
This project is licensed under the MIT No Attribution License - see the LICENSE file for details.
Contributions are welcome! Areas of interest:
- Support for additional quantization methods (GPTQ, AWQ)
- Automatic mixed-precision quantization
- Model-specific optimization profiles
- Streaming inference implementation
- Better accuracy preservation techniques
See CONTRIBUTING.md for guidelines. 🌟
┌─────────────────────────────────────────────────────────────────┐
│ │
│ 🖥️ Run Advanced AI locally on your own hardware │
│ 🔒 Keep your data private - no cloud required │
│ 💸 No subscription costs or API fees │
│ 🌱 Lower environmental impact than cloud inference │
│ 🛠️ Full control over inference parameters │
│ 🚀 Deploy in resource-constrained environments │
│ │
│ With INT4 quantization, you can run this 1.5B parameter │
│ model on devices with just 8GB RAM while maintaining │
│ 80% of the original accuracy! │
│ │
└─────────────────────────────────────────────────────────────────┘
The enhanced comparison generates several visualizations:
- Accuracy Radar Chart: Compares keyword coverage, concept coverage, relevance, coherence, completeness, and factual accuracy
- Performance vs Accuracy Scatter: Shows the trade-off between speed and quality
- Accuracy Heatmap: Detailed metric breakdown by sample type
- Accuracy by Sample Bar Chart: Overall accuracy comparison across different prompt types
View these in ./comparison_results_enhanced/ after running the enhanced comparison.
This project is for educational purposes only. The models and code provided are intended for research and learning, and should not be used in production environments without further testing and validation.
Note: Results may vary based on hardware, ONNX Runtime version, and specific use cases. The accuracy metrics are based on our evaluation suite and may not reflect performance on all tasks.