feat(quantized): add multi-dtype support for bf16/f16 activations #3310

danielclough · 2026-01-18T00:29:55Z

feat(quantized): add multi-dtype support for bf16/f16 activations

Summary

This PR enables running quantized models with different activation data types (f32, bf16, f16) via the new --dtype flag in the quantized example. Using half-precision activations can significantly improve inference speed on GPUs with fast fp16/bf16 tensor cores while maintaining model quality.

Key Changes

User-Facing

New --dtype flag in the quantized example to select activation precision:
- --dtype f32 (default): Standard 32-bit floating point
- --dtype bf16: BFloat16 - better numerical range, ideal for newer NVIDIA GPUs
- --dtype f16: Float16 - maximum memory savings, widely supported

CUDA Kernel Enhancements

Extended quantized.cu with F16/BF16 output support for all quantized matmul kernels
Added specialized kernels that dequantize directly to half-precision formats
Significant kernel expansion (~1900 lines added) to support mixed-precision operations

Metal Kernel Enhancements

Updated quantized.metal with F16/BF16 output support for Apple Silicon
Extended kernel templates for half-precision dequantization

Core Infrastructure

QMatMul::forward() now handles dtype mismatches automatically via auto-conversion
New QMatMul::from_arc_with_transposed_data() for GGUF files from diffusion tools (stable-diffusion.cpp) that use different data layouts
New RmsNorm::from_qtensor_with_dtype() for eager dtype conversion at load time

Model Loading Improvements

ModelWeights::from_gguf() and ModelWeights::from_ggml() now auto-infers activation dtype from embedding tensor storage format (F16/BF16 embeddings → matching activation dtype)
New ModelWeights::from_gguf_with_dtype() and ModelWeights::from_ggml_with_dtype() for explicit dtype control
Pre-converted cos/sin/neg_inf tensors to target dtype during model loading to avoid runtime conversion overhead

Bug Fixes

Fixed dtype mismatch in LayerNorm operations
Fixed dtype mismatch in rms_norm operations
Fixed routing weights dtype in MoE models

Files Changed

File	Changes
`candle-core/src/quantized/cuda.rs`	+392 lines - CUDA backend dtype handling
`candle-core/src/quantized/metal.rs`	+98 lines - Metal backend dtype handling
`candle-core/src/quantized/mod.rs`	+30 lines - QMatMul dtype conversion
`candle-examples/examples/quantized/main.rs`	+18 lines - CLI flag
`candle-kernels/src/quantized.cu`	+1901 lines - CUDA kernels
`candle-metal-kernels/src/kernels/quantized.rs`	+84 lines - Metal kernel bindings
`candle-metal-kernels/src/metal_src/quantized.metal`	+1328 lines - Metal kernels
`candle-nn/src/layer_norm.rs`	+17 lines - Dtype fixes
`candle-nn/src/ops.rs`	+8 lines - Dtype fixes
`candle-transformers/src/models/quantized_llama.rs`	+76 lines - Model loading
`candle-transformers/src/quantized_nn.rs`	+11 lines - RmsNorm helper

Total: +3,441 lines, -522 lines across 11 files

Usage Example

# Run with BFloat16 activations for faster inference
cargo run --release --features cuda --example quantized -- \
    --model llama3.1-8b-instruct \
    --dtype bf16 \
    --prompt "Hello, world!"

# Run with Float16 activations
cargo run --release --features cuda --example quantized -- \
    --model llama3.1-8b-instruct \
    --dtype f16 \
    --prompt "Hello, world!"

Performance Impact

Using --dtype bf16 or --dtype f16 can provide:

Faster matrix multiplications on GPUs with dedicated tensor cores
Reduced memory bandwidth requirements
Lower overall memory footprint for activations

Compatibility Notes

GGUF files from llama.cpp: Work as expected with all dtype options
GGUF files from stable-diffusion.cpp: May require transposed data handling (use QMatMul::from_arc_with_transposed_data())
GGML files: Fully supported with the new dtype parameter

Testing

Tested with various quantized models (Q4_0, Q4_K_M, Q5_K_M, Q8_0) using f32, bf16, and f16 activation dtypes on both Metal and CUDA hardware.

Although this is unrelated to my code changes I'm changing it because it causes clippy to fail on mac:

98 -    #[error("{op} can only be performed on a single dimension")]                                                    
98 +    #[error("{op} can only be performed on a single dimension, found {dims:?}")]

Clippy Error:

error: value assigned to `dims` is never read
  --> candle-core/src/error.rs:99:45
   |
99 |     OnlySingleDimension { op: &'static str, dims: Vec<usize> },
   |                                             ^^^^
   |
   = help: maybe it is overwritten before being read?
   = note: `-D unused-assignments` implied by `-D warnings`
   = help: to override `-D warnings` add `#[allow(unused_assignments)]`

error: could not compile `candle-core` (lib) due to 1 previous error

Enable running quantized models with different activation dtypes via the new --dtype flag in the quantized example. This improves inference speed on GPUs with fast fp16/bf16 tensor cores. Changes: - Add --dtype flag to quantized example (f32, bf16, f16) - Add F16/BF16 output support to CUDA quantized matmul kernels - Add dtype mismatch auto-conversion in QMatMul::forward() - Add RmsNorm::from_qtensor_with_dtype() for eager dtype conversion - Add transposed data layout handling for GGUF files from diffusion tools - Fix dtype mismatch in LayerNorm and rms_norm operations

danielclough added 2 commits January 17, 2026 16:11

fix: make from_ggml backwards compatible

1344878

ivarflakstad requested a review from EricLBuehler January 18, 2026 14:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(quantized): add multi-dtype support for bf16/f16 activations #3310

feat(quantized): add multi-dtype support for bf16/f16 activations #3310

Uh oh!

danielclough commented Jan 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat(quantized): add multi-dtype support for bf16/f16 activations #3310

Are you sure you want to change the base?

feat(quantized): add multi-dtype support for bf16/f16 activations #3310

Uh oh!

Conversation

danielclough commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

feat(quantized): add multi-dtype support for bf16/f16 activations

Summary

Key Changes

User-Facing

CUDA Kernel Enhancements

Metal Kernel Enhancements

Core Infrastructure

Model Loading Improvements

Bug Fixes

Files Changed

Usage Example

Performance Impact

Compatibility Notes

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

danielclough commented Jan 18, 2026 •

edited

Loading