Skip to content

Model quantization techniques for efficient LLM inference. Experiments with INT8, INT4, and mixed-precision quantization.

License

Notifications You must be signed in to change notification settings

shyamsridhar123/Quantization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Optimizing DeepSeek: Int4 Quantization & Local Inference Engine 🚀

This project provides tools for quantizing and testing large language models from the DeepSeek family, specifically focusing on the DeepSeek-R1-Distill-Qwen-1.5B model. The primary goal is to make these models more accessible by reducing their memory footprint through quantization to Int4 precision and ONNX format conversion.

🎯 Quantization Results Summary

Our quantization pipeline achieves impressive results:

Metric Value Impact
Model Size Reduction 78.1% 7.75GB → 1.70GB
Average Speedup 5.24x 0.6 → 1.7 tokens/sec
Accuracy Retention 80.2% 62.14% → 49.84%
Memory Efficiency ~75% less Run on 8GB RAM devices

🔥 What is Model Quantization & Distillation? 🔥

📊 Quantization Explained

Quantization reduces the precision of numerical values in neural networks:

FP32 (32-bit) ➡️ INT8 (8-bit) ➡️ INT4 (4-bit)
   ┌─────────┐      ┌─────┐       ┌───┐
   │111010101│  →   │10101│  →    │1010│
   └─────────┘      └─────┘       └───┘
  More Precise       Smaller      Tiny!
  (~7.75GB model)   (~3GB model)  (~1.7GB model)

Quantization dramatically reduces model size and memory footprint while attempting to preserve model quality!

🧪 Distillation Explained

Distillation is where a smaller model (student) learns from a larger model (teacher):

       Teacher Model              Student Model
       (Large, Slow)            (Small, Fast)
       ┌───────────┐              ┌─────┐
       │           │     →        │     │
       │  7B-175B  │  Knowledge   │ 1-3B│
       │ Parameters│  Transfer    │Params│
       └───────────┘              └─────┘

DeepSeek-R1-Distill-Qwen-1.5B is already a distilled model that learned from larger models!

🚀 Quick Start Guide

1. 🛠️ Setup Environment

# Clone the repository (if you haven't already)
git clone https://github.com/shyamsridhar123/Quantization

# Create and activate a virtual environment (recommended)
python -m venv .venv
.\.venv\Scripts\activate  # On Windows

# Install dependencies
pip install -r requirements.txt

2. 🌟 Export and Quantize the Model

# NEW: Complete pipeline with fixes for ONNX generation
python download_model.py --model-id "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
python convert_to_onnx.py --model-path ./model --output-path ./onnx_model
python fix_onnx_model.py --input-path ./onnx_model --output-path ./onnx_fixed
python quantize_model.py --input-path ./onnx_fixed --output-path ./quantized_model --quant-type int4

# Or use the all-in-one script:
python reexport_with_position_ids.py --model-id deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --output-dir ./onnx_fixed --quantize

3. 🔍 Test the Quantized Model

# Interactive testing with direct ONNX Runtime approach
python direct_interactive_test.py --model-path ./quantized_model/onnx_int4

# NEW: Enhanced comparison with accuracy metrics
python enhanced_compare_models.py --num-samples 5 --max-length 200 --temperature 0.7

4. 📊 Benchmark Performance

# Run performance benchmarks
python benchmark_model.py --model-path ./quantized_model/onnx_int4/model_quantized.onnx --num-threads 4

5. 🔧 Troubleshooting

If you encounter issues:

# Run diagnostics
python diagnose_onnx_model.py --model-path ./quantized_model/onnx_int4/model_quantized.onnx

# Try direct ONNX inference if Optimum integration fails
python enhanced_onnx_inference.py --model-path ./quantized_model/onnx_int4/model_quantized.onnx

📊 Detailed Performance Analysis

Inference Performance by Task Type

Based on our comprehensive evaluation across different prompt types:

Task Type Original Accuracy Quantized Accuracy Similarity Speedup
Definition 76.2% 41.5% 11.3% 2.4x
Explanation 63.1% 50.4% 11.9% 14.2x
Technical 55.6% 60.6% 8.1% 6.5x
Introduction 63.2% 44.2% 13.6% 0.6x
Average 62.14% 49.84% 12.53% 5.24x

Accuracy Metrics Breakdown

Our evaluation measures multiple aspects of output quality:

  • Keyword Coverage: Presence of expected domain-specific terms
  • Concept Coverage: Inclusion of key concepts and relationships
  • Relevance Score: Alignment with the prompt topic
  • Sentence Coherence: Grammatical and structural quality
  • Completeness: Presence of introduction, body, and conclusion
  • Factual Accuracy: Absence of contradictions or errors

🌈 Why Quantize Models? 🌈

Quantizing models provides these amazing benefits:

┌─────────────────────────────────────┐
│ 🚀 78.1% Smaller File Size          │
│ 💾 ~75% Less Memory Usage           │
│ ⚡ 5.24x Faster Inference Speed     │
│ 🖥️ Run on Consumer Hardware         │
│ 🔋 Lower Energy Consumption         │
│ 🏠 Enable Edge & Local Deployment   │
└─────────────────────────────────────┘

📝 Complete Workflow

Step 1: 🧰 Prepare Your Environment

  1. Install Required Dependencies:

    pip install -r requirements.txt
  2. Verify Environment Setup:

    python environment_check.py

Step 2: 🔄 Export and Quantize the Model

  1. Download and Convert to ONNX (NEW recommended approach):

    # Download model
    python download_model.py --model-id "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
    
    # Convert to ONNX
    python convert_to_onnx.py --model-path ./model --output-path ./onnx_model
    
    # Fix ONNX model for generation
    python fix_onnx_model.py --input-path ./onnx_model --output-path ./onnx_fixed
  2. Quantize the Model to Int4:

    # Quantize to INT4
    python quantize_model.py --input-path ./onnx_fixed --output-path ./quantized_model --quant-type int4
    
    # Or use the all-in-one script:
    python reexport_with_position_ids.py --model-id deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --output-dir ./onnx_fixed --quantize

Step 3: 🧪 Test the Quantized Model

  1. Run Diagnostics:

    python diagnose_onnx_model.py --model-path ./quantized_model/onnx_int4/model_quantized.onnx
  2. Interactive Testing:

    # Recommended approach (handles common issues):
    python direct_interactive_test.py --model-path ./quantized_model/onnx_int4
    
    # Alternative approach using Optimum:
    python interactive_test.py --model-path ./quantized_model/onnx_int4
  3. NEW: Comprehensive Comparison:

    # Compare original vs quantized with accuracy metrics
    python enhanced_compare_models.py \
        --original-path ./onnx_fixed \
        --quantized-path ./quantized_model/onnx_int4 \
        --num-samples 5 \
        --max-length 200

Step 4: 📈 Benchmark and Compare

  1. Performance Benchmarking:

    python benchmark_model.py --model-path ./quantized_model/onnx_int4/model_quantized.onnx --num-threads 4
  2. Compare with Original Model:

    python compare_models.py --quantized-path ./quantized_model/onnx_int4 --model-id deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
  3. Generate Detailed Reports:

    # Generate HTML report
    python generate_report.py --results-dir ./inference_results --model-path ./quantized_model/onnx_int4 --output-file ./model_report.html
    
    # View enhanced comparison results
    cat ./comparison_results_enhanced/enhanced_comparison_report.txt

Step 5: 🔄 Format Conversion (Optional)

  1. Convert to FP16 for Troubleshooting:
    python convert_to_fp16.py --input-model ./quantized_model/onnx_int4/model_quantized.onnx --output-model ./quantized_model/onnx_fp16/model_fp16.onnx

🔍 Quantization Deep Dive

Quantization in this repository follows this process:

                 Original Model (FP32)
                         ⬇️
┌─────────────────────────────────────────────┐
│ 1. Export to ONNX                           │
│    - Convert model architecture to ONNX     │
│    - Fix generation issues                  │
│    - Handle onnx::Gather_3 parameter        │
└─────────────────────────────────────────────┘
                         ⬇️
┌─────────────────────────────────────────────┐
│ 2. Calibration                              │
│    - Analyze weight distributions           │
│    - Determine optimal scaling factors      │
└─────────────────────────────────────────────┘
                         ⬇️
┌─────────────────────────────────────────────┐
│ 3. INT4 Quantization                        │
│    - Apply scaling to weights               │
│    - Convert FP32 values to INT4 values     │
│    - Store quantization parameters          │
│    - Result: 78.1% size reduction           │
└─────────────────────────────────────────────┘
                         ⬇️
┌─────────────────────────────────────────────┐
│ 4. Optimization                             │
│    - Apply ONNX Runtime optimizations       │
│    - Fuse operations where possible         │
│    - Result: 5.24x speedup                  │
└─────────────────────────────────────────────┘
                         ⬇️
                 Quantized Model (INT4)

🏗️ Project Structure

🧩 Key Components

  • 🔧 Core Quantization Pipeline: Complete workflow from model download to INT4 quantization
  • 🧪 Testing Framework: Comprehensive testing with accuracy metrics and performance benchmarks
  • 📊 Evaluation Suite: Detailed comparison tools with visualizations
  • 🧹 Maintenance Utilities: Scripts for project organization and cleanup

📄 Important Scripts

🔄 NEW: Core Pipeline Scripts

  • download_model.py - Download model from HuggingFace
  • convert_to_onnx.py - Convert PyTorch to ONNX format
  • fix_onnx_model.py - Fix ONNX model for proper generation
  • quantize_model.py - Quantize to INT4/INT8
  • enhanced_compare_models.py - Comprehensive accuracy comparison

🔄 Legacy Quantization Scripts

  • reexport_with_position_ids.py - 🛠️ All-in-one export with position_ids fix
  • run_quantization.py - 🔢 Dedicated quantization script
  • quantize_int4.ipynb - 📓 Jupyter notebook for Int4 quantization

🧪 Testing and Inference

  • direct_interactive_test.py - 💬 Interactive testing using direct ONNX Runtime (most reliable)
  • test_quantized_model.py - 🧪 Basic inference testing
  • diagnose_onnx_model.py - 🔍 Detailed diagnostic checks for model issues
  • diagnose_generation.py - 🔍 Debug generation issues
  • benchmark_model.py - 📊 Performance benchmarking across different input sizes

📊 Analysis and Reporting

  • enhanced_compare_models.py - 📈 Full comparison with accuracy metrics
  • generate_report.py - 📄 Generate HTML reports
  • compare_models.py - 🔄 Compare quantized vs original

🔄 Execution Scripts

  • run_inference_tests.ps1 - 🧪 Comprehensive testing suite
  • run_master_test.ps1 - 🚀 Runs all tests in sequence

🧹 Maintenance

  • maintain.bat - 🛠️ Central maintenance dashboard with menu options
  • cleanup_simple.ps1 - 🧹 Removes temporary files and Python cache

❓ Common Issues and Solutions

1. 'Logits' Error During Generation ❌

Symptom: KeyError: 'logits' or similar errors during model inference.

Solution: Use the fixed ONNX model:

python fix_onnx_model.py --input-path ./onnx_model --output-path ./onnx_fixed

2. Model Loading Failures ❌

Symptom: Model fails to load or initialize.

Solution: Try using different loading parameters:

# For Qwen2-based models
session_options = ort.SessionOptions()
session_options.intra_op_num_threads = 4
session = ort.InferenceSession(model_path, session_options)

3. Performance Issues ⏱️

Symptom: Slow inference or high memory usage.

Solution: Adjust thread count to match your CPU cores:

python benchmark_model.py --model-path ./model_path --num-threads 4  # Adjust to your CPU

4. Poor Generation Quality 📉

Symptom: Low accuracy scores or irrelevant outputs.

Solution: Tune generation parameters:

python enhanced_compare_models.py --temperature 0.5 --max-length 150

⚙️ Configuration Options

  • 🧵 Thread Count: Adjust --num-threads based on your CPU. Generally, setting it to the number of physical cores works best.
  • 🔢 Token Limit: Use --max-tokens to control generation length.
  • 📁 Model Path: All scripts accept --model-path to specify the model location.
  • 🌡️ Temperature: Control randomness with --temperature (0.0-1.0)

📚 The Technical Significance of This Project

This project demonstrates several important techniques:

┌─────────────────────────────────────────────────┐
│ 🚀 ONNX Conversion                              │
│   - Hardware-agnostic deployment                │
│   - Runtime optimizations for any hardware      │
│   - Fixed generation issues with Gather_3       │
├─────────────────────────────────────────────────┤
│ 🧮 INT4 Quantization                            │
│   - 78.1% size reduction achieved              │
│   - 7.75GB → 1.70GB model size                 │
│   - 80.2% accuracy retention                   │
│   - 5.24x speedup in inference                 │
├─────────────────────────────────────────────────┤
│ 🔧 Architecture-Specific Fixes                  │
│   - Fixes for Qwen2 architecture               │
│   - Custom onnx::Gather_3 handling             │
│   - Position tracking for generation            │
├─────────────────────────────────────────────────┤
│ 📊 Comprehensive Evaluation                     │
│   - Multi-metric accuracy assessment            │
│   - Performance vs quality trade-off analysis   │
│   - Detailed visualizations and reports         │
│   - Real-world prompt testing                   │
└─────────────────────────────────────────────────┘

🔗 Resources

📜 License

This project is licensed under the MIT No Attribution License - see the LICENSE file for details.

👥 Contributing

Contributions are welcome! Areas of interest:

  • Support for additional quantization methods (GPTQ, AWQ)
  • Automatic mixed-precision quantization
  • Model-specific optimization profiles
  • Streaming inference implementation
  • Better accuracy preservation techniques

See CONTRIBUTING.md for guidelines. 🌟

✨ Why This Matters

┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│  🖥️  Run Advanced AI locally on your own hardware               │
│  🔒  Keep your data private - no cloud required                 │
│  💸  No subscription costs or API fees                          │
│  🌱  Lower environmental impact than cloud inference            │
│  🛠️  Full control over inference parameters                     │
│  🚀  Deploy in resource-constrained environments                │
│                                                                 │
│  With INT4 quantization, you can run this 1.5B parameter       │
│  model on devices with just 8GB RAM while maintaining          │
│  80% of the original accuracy!                                  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

📊 Visualization Examples

The enhanced comparison generates several visualizations:

  1. Accuracy Radar Chart: Compares keyword coverage, concept coverage, relevance, coherence, completeness, and factual accuracy
  2. Performance vs Accuracy Scatter: Shows the trade-off between speed and quality
  3. Accuracy Heatmap: Detailed metric breakdown by sample type
  4. Accuracy by Sample Bar Chart: Overall accuracy comparison across different prompt types

View these in ./comparison_results_enhanced/ after running the enhanced comparison.

Disclaimer

This project is for educational purposes only. The models and code provided are intended for research and learning, and should not be used in production environments without further testing and validation.


Note: Results may vary based on hardware, ONNX Runtime version, and specific use cases. The accuracy metrics are based on our evaluation suite and may not reflect performance on all tasks.

About

Model quantization techniques for efficient LLM inference. Experiments with INT8, INT4, and mixed-precision quantization.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published