Optimizing DeepSeek: Int4 Quantization & Local Inference Engine 🚀

This project provides tools for quantizing and testing large language models from the DeepSeek family, specifically focusing on the DeepSeek-R1-Distill-Qwen-1.5B model. The primary goal is to make these models more accessible by reducing their memory footprint through quantization to Int4 precision and ONNX format conversion.

🎯 Quantization Results Summary

Our quantization pipeline achieves impressive results:

Metric	Value	Impact
Model Size Reduction	78.1%	7.75GB → 1.70GB
Average Speedup	5.24x	0.6 → 1.7 tokens/sec
Accuracy Retention	80.2%	62.14% → 49.84%
Memory Efficiency	~75% less	Run on 8GB RAM devices

🔥 What is Model Quantization & Distillation? 🔥

📊 Quantization Explained

Quantization reduces the precision of numerical values in neural networks:

FP32 (32-bit) ➡️ INT8 (8-bit) ➡️ INT4 (4-bit)
   ┌─────────┐      ┌─────┐       ┌───┐
   │111010101│  →   │10101│  →    │1010│
   └─────────┘      └─────┘       └───┘
  More Precise       Smaller      Tiny!
  (~7.75GB model)   (~3GB model)  (~1.7GB model)

Quantization dramatically reduces model size and memory footprint while attempting to preserve model quality!

🧪 Distillation Explained

Distillation is where a smaller model (student) learns from a larger model (teacher):

       Teacher Model              Student Model
       (Large, Slow)            (Small, Fast)
       ┌───────────┐              ┌─────┐
       │           │     →        │     │
       │  7B-175B  │  Knowledge   │ 1-3B│
       │ Parameters│  Transfer    │Params│
       └───────────┘              └─────┘

DeepSeek-R1-Distill-Qwen-1.5B is already a distilled model that learned from larger models!

🚀 Quick Start Guide

1. 🛠️ Setup Environment

# Clone the repository (if you haven't already)
git clone https://github.com/shyamsridhar123/Quantization

# Create and activate a virtual environment (recommended)
python -m venv .venv
.\.venv\Scripts\activate  # On Windows

# Install dependencies
pip install -r requirements.txt

2. 🌟 Export and Quantize the Model

# NEW: Complete pipeline with fixes for ONNX generation
python download_model.py --model-id "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
python convert_to_onnx.py --model-path ./model --output-path ./onnx_model
python fix_onnx_model.py --input-path ./onnx_model --output-path ./onnx_fixed
python quantize_model.py --input-path ./onnx_fixed --output-path ./quantized_model --quant-type int4

# Or use the all-in-one script:
python reexport_with_position_ids.py --model-id deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --output-dir ./onnx_fixed --quantize

3. 🔍 Test the Quantized Model

# Interactive testing with direct ONNX Runtime approach
python direct_interactive_test.py --model-path ./quantized_model/onnx_int4

# NEW: Enhanced comparison with accuracy metrics
python enhanced_compare_models.py --num-samples 5 --max-length 200 --temperature 0.7

4. 📊 Benchmark Performance

# Run performance benchmarks
python benchmark_model.py --model-path ./quantized_model/onnx_int4/model_quantized.onnx --num-threads 4

5. 🔧 Troubleshooting

If you encounter issues:

# Run diagnostics
python diagnose_onnx_model.py --model-path ./quantized_model/onnx_int4/model_quantized.onnx

# Try direct ONNX inference if Optimum integration fails
python enhanced_onnx_inference.py --model-path ./quantized_model/onnx_int4/model_quantized.onnx

📊 Detailed Performance Analysis

Inference Performance by Task Type

Based on our comprehensive evaluation across different prompt types:

Task Type	Original Accuracy	Quantized Accuracy	Similarity	Speedup
Definition	76.2%	41.5%	11.3%	2.4x
Explanation	63.1%	50.4%	11.9%	14.2x
Technical	55.6%	60.6%	8.1%	6.5x
Introduction	63.2%	44.2%	13.6%	0.6x
Average	62.14%	49.84%	12.53%	5.24x

Accuracy Metrics Breakdown

Our evaluation measures multiple aspects of output quality:

Keyword Coverage: Presence of expected domain-specific terms
Concept Coverage: Inclusion of key concepts and relationships
Relevance Score: Alignment with the prompt topic
Sentence Coherence: Grammatical and structural quality
Completeness: Presence of introduction, body, and conclusion
Factual Accuracy: Absence of contradictions or errors

🌈 Why Quantize Models? 🌈

Quantizing models provides these amazing benefits:

┌─────────────────────────────────────┐
│ 🚀 78.1% Smaller File Size          │
│ 💾 ~75% Less Memory Usage           │
│ ⚡ 5.24x Faster Inference Speed     │
│ 🖥️ Run on Consumer Hardware         │
│ 🔋 Lower Energy Consumption         │
│ 🏠 Enable Edge & Local Deployment   │
└─────────────────────────────────────┘

📝 Complete Workflow

Step 1: 🧰 Prepare Your Environment

Install Required Dependencies:
```
pip install -r requirements.txt
```
Verify Environment Setup:
```
python environment_check.py
```

Step 2: 🔄 Export and Quantize the Model

Download and Convert to ONNX (NEW recommended approach):

# Download model
python download_model.py --model-id "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"

# Convert to ONNX
python convert_to_onnx.py --model-path ./model --output-path ./onnx_model

# Fix ONNX model for generation
python fix_onnx_model.py --input-path ./onnx_model --output-path ./onnx_fixed

Quantize the Model to Int4:

# Quantize to INT4
python quantize_model.py --input-path ./onnx_fixed --output-path ./quantized_model --quant-type int4

# Or use the all-in-one script:
python reexport_with_position_ids.py --model-id deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --output-dir ./onnx_fixed --quantize

Step 3: 🧪 Test the Quantized Model

Run Diagnostics:

python diagnose_onnx_model.py --model-path ./quantized_model/onnx_int4/model_quantized.onnx

Interactive Testing:

# Recommended approach (handles common issues):
python direct_interactive_test.py --model-path ./quantized_model/onnx_int4

# Alternative approach using Optimum:
python interactive_test.py --model-path ./quantized_model/onnx_int4

NEW: Comprehensive Comparison:

# Compare original vs quantized with accuracy metrics
python enhanced_compare_models.py \
    --original-path ./onnx_fixed \
    --quantized-path ./quantized_model/onnx_int4 \
    --num-samples 5 \
    --max-length 200

Step 4: 📈 Benchmark and Compare

Performance Benchmarking:

python benchmark_model.py --model-path ./quantized_model/onnx_int4/model_quantized.onnx --num-threads 4

Compare with Original Model:

python compare_models.py --quantized-path ./quantized_model/onnx_int4 --model-id deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

Generate Detailed Reports:

# Generate HTML report
python generate_report.py --results-dir ./inference_results --model-path ./quantized_model/onnx_int4 --output-file ./model_report.html

# View enhanced comparison results
cat ./comparison_results_enhanced/enhanced_comparison_report.txt

Step 5: 🔄 Format Conversion (Optional)

Convert to FP16 for Troubleshooting:

python convert_to_fp16.py --input-model ./quantized_model/onnx_int4/model_quantized.onnx --output-model ./quantized_model/onnx_fp16/model_fp16.onnx

🔍 Quantization Deep Dive

Quantization in this repository follows this process:

                 Original Model (FP32)
                         ⬇️
┌─────────────────────────────────────────────┐
│ 1. Export to ONNX                           │
│    - Convert model architecture to ONNX     │
│    - Fix generation issues                  │
│    - Handle onnx::Gather_3 parameter        │
└─────────────────────────────────────────────┘
                         ⬇️
┌─────────────────────────────────────────────┐
│ 2. Calibration                              │
│    - Analyze weight distributions           │
│    - Determine optimal scaling factors      │
└─────────────────────────────────────────────┘
                         ⬇️
┌─────────────────────────────────────────────┐
│ 3. INT4 Quantization                        │
│    - Apply scaling to weights               │
│    - Convert FP32 values to INT4 values     │
│    - Store quantization parameters          │
│    - Result: 78.1% size reduction           │
└─────────────────────────────────────────────┘
                         ⬇️
┌─────────────────────────────────────────────┐
│ 4. Optimization                             │
│    - Apply ONNX Runtime optimizations       │
│    - Fuse operations where possible         │
│    - Result: 5.24x speedup                  │
└─────────────────────────────────────────────┘
                         ⬇️
                 Quantized Model (INT4)

🏗️ Project Structure

🧩 Key Components

🔧 Core Quantization Pipeline: Complete workflow from model download to INT4 quantization
🧪 Testing Framework: Comprehensive testing with accuracy metrics and performance benchmarks
📊 Evaluation Suite: Detailed comparison tools with visualizations
🧹 Maintenance Utilities: Scripts for project organization and cleanup

📄 Important Scripts

🔄 NEW: Core Pipeline Scripts

download_model.py - Download model from HuggingFace
convert_to_onnx.py - Convert PyTorch to ONNX format
fix_onnx_model.py - Fix ONNX model for proper generation
quantize_model.py - Quantize to INT4/INT8
enhanced_compare_models.py - Comprehensive accuracy comparison

🔄 Legacy Quantization Scripts

reexport_with_position_ids.py - 🛠️ All-in-one export with position_ids fix
run_quantization.py - 🔢 Dedicated quantization script
quantize_int4.ipynb - 📓 Jupyter notebook for Int4 quantization

🧪 Testing and Inference

direct_interactive_test.py - 💬 Interactive testing using direct ONNX Runtime (most reliable)
test_quantized_model.py - 🧪 Basic inference testing
diagnose_onnx_model.py - 🔍 Detailed diagnostic checks for model issues
diagnose_generation.py - 🔍 Debug generation issues
benchmark_model.py - 📊 Performance benchmarking across different input sizes

📊 Analysis and Reporting

enhanced_compare_models.py - 📈 Full comparison with accuracy metrics
generate_report.py - 📄 Generate HTML reports
compare_models.py - 🔄 Compare quantized vs original

🔄 Execution Scripts

run_inference_tests.ps1 - 🧪 Comprehensive testing suite
run_master_test.ps1 - 🚀 Runs all tests in sequence

🧹 Maintenance

maintain.bat - 🛠️ Central maintenance dashboard with menu options
cleanup_simple.ps1 - 🧹 Removes temporary files and Python cache

❓ Common Issues and Solutions

1. 'Logits' Error During Generation ❌

Symptom: KeyError: 'logits' or similar errors during model inference.

Solution: Use the fixed ONNX model:

python fix_onnx_model.py --input-path ./onnx_model --output-path ./onnx_fixed

2. Model Loading Failures ❌

Symptom: Model fails to load or initialize.

Solution: Try using different loading parameters:

# For Qwen2-based models
session_options = ort.SessionOptions()
session_options.intra_op_num_threads = 4
session = ort.InferenceSession(model_path, session_options)

3. Performance Issues ⏱️

Symptom: Slow inference or high memory usage.

Solution: Adjust thread count to match your CPU cores:

python benchmark_model.py --model-path ./model_path --num-threads 4  # Adjust to your CPU

4. Poor Generation Quality 📉

Symptom: Low accuracy scores or irrelevant outputs.

Solution: Tune generation parameters:

python enhanced_compare_models.py --temperature 0.5 --max-length 150

⚙️ Configuration Options

🧵 Thread Count: Adjust --num-threads based on your CPU. Generally, setting it to the number of physical cores works best.
🔢 Token Limit: Use --max-tokens to control generation length.
📁 Model Path: All scripts accept --model-path to specify the model location.
🌡️ Temperature: Control randomness with --temperature (0.0-1.0)

📚 The Technical Significance of This Project

This project demonstrates several important techniques:

┌─────────────────────────────────────────────────┐
│ 🚀 ONNX Conversion                              │
│   - Hardware-agnostic deployment                │
│   - Runtime optimizations for any hardware      │
│   - Fixed generation issues with Gather_3       │
├─────────────────────────────────────────────────┤
│ 🧮 INT4 Quantization                            │
│   - 78.1% size reduction achieved              │
│   - 7.75GB → 1.70GB model size                 │
│   - 80.2% accuracy retention                   │
│   - 5.24x speedup in inference                 │
├─────────────────────────────────────────────────┤
│ 🔧 Architecture-Specific Fixes                  │
│   - Fixes for Qwen2 architecture               │
│   - Custom onnx::Gather_3 handling             │
│   - Position tracking for generation            │
├─────────────────────────────────────────────────┤
│ 📊 Comprehensive Evaluation                     │
│   - Multi-metric accuracy assessment            │
│   - Performance vs quality trade-off analysis   │
│   - Detailed visualizations and reports         │
│   - Real-world prompt testing                   │
└─────────────────────────────────────────────────┘

🔗 Resources

📜 License

This project is licensed under the MIT No Attribution License - see the LICENSE file for details.

👥 Contributing

Contributions are welcome! Areas of interest:

Support for additional quantization methods (GPTQ, AWQ)
Automatic mixed-precision quantization
Model-specific optimization profiles
Streaming inference implementation
Better accuracy preservation techniques

See CONTRIBUTING.md for guidelines. 🌟

✨ Why This Matters

┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│  🖥️  Run Advanced AI locally on your own hardware               │
│  🔒  Keep your data private - no cloud required                 │
│  💸  No subscription costs or API fees                          │
│  🌱  Lower environmental impact than cloud inference            │
│  🛠️  Full control over inference parameters                     │
│  🚀  Deploy in resource-constrained environments                │
│                                                                 │
│  With INT4 quantization, you can run this 1.5B parameter       │
│  model on devices with just 8GB RAM while maintaining          │
│  80% of the original accuracy!                                  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

📊 Visualization Examples

The enhanced comparison generates several visualizations:

Accuracy Radar Chart: Compares keyword coverage, concept coverage, relevance, coherence, completeness, and factual accuracy
Performance vs Accuracy Scatter: Shows the trade-off between speed and quality
Accuracy Heatmap: Detailed metric breakdown by sample type
Accuracy by Sample Bar Chart: Overall accuracy comparison across different prompt types

View these in ./comparison_results_enhanced/ after running the enhanced comparison.

Disclaimer

This project is for educational purposes only. The models and code provided are intended for research and learning, and should not be used in production environments without further testing and validation.

Note: Results may vary based on hardware, ONNX Runtime version, and specific use cases. The accuracy metrics are based on our evaluation suite and may not reflect performance on all tasks.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
charts		charts
comparison_results_enhanced		comparison_results_enhanced
notebooks		notebooks
.gitignore		.gitignore
CLEANUP_README.md		CLEANUP_README.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
PROJECT_MANAGEMENT.md		PROJECT_MANAGEMENT.md
README.MD		README.MD
TESTING_README.md		TESTING_README.md
benchmark_model.py		benchmark_model.py
benchmark_results.json		benchmark_results.json
cleanup_simple.ps1		cleanup_simple.ps1
compare_models.py		compare_models.py
convert_to_fp16.py		convert_to_fp16.py
deepseek_onnx_helper.py		deepseek_onnx_helper.py
diagnose_onnx_model.py		diagnose_onnx_model.py
direct_interactive_test.py		direct_interactive_test.py
direct_onnx_inference.py		direct_onnx_inference.py
enhanced_compare_models.py		enhanced_compare_models.py
enhanced_cpu_only_test.py		enhanced_cpu_only_test.py
enhanced_onnx_inference.py		enhanced_onnx_inference.py
environment_check.py		environment_check.py
generate_report.py		generate_report.py
interactive_test.py		interactive_test.py
maintain.bat		maintain.bat
model_report.html		model_report.html
organize_notebooks.ps1		organize_notebooks.ps1
organize_notebooks_simple.ps1		organize_notebooks_simple.ps1
pyproject.toml		pyproject.toml
qwen2_onnx_export.py		qwen2_onnx_export.py
reexport_with_position_ids.py		reexport_with_position_ids.py
requirements.txt		requirements.txt
run_inference_tests.ps1		run_inference_tests.ps1
run_master_test.ps1		run_master_test.ps1
run_quantization.py		run_quantization.py
sample_prompts.txt		sample_prompts.txt
test_quantized_model.py		test_quantized_model.py
troubleshoot_onnx_model.py		troubleshoot_onnx_model.py
validate_organization.py		validate_organization.py

License

shyamsridhar123/Quantization

Folders and files

Latest commit

History

Repository files navigation