-
Notifications
You must be signed in to change notification settings - Fork 0
03_custom_quantized_layers_.md
Welcome back! In the first chapter, we learned about the Quantizer class, the tool that kicks off the quantization process. In the second chapter, we explored the different recipes, W8A32 and W8A16, that tell the Quantizer how to quantize.
Now, let's look at the building blocks that make these quantization methods possible: the Custom Quantized Layers.
Think about a typical neural network layer, like torch.nn.Linear. This layer is designed to work with standard floating-point numbers, specifically float32. Its internal weights are stored as float32, and when you give it an input (also usually float32), it performs the calculation (output = input @ weights.T + bias) using float32 math.
import torch
import torch.nn as nn
# A standard Linear layer
linear_layer = nn.Linear(in_features=10, out_features=5)
print(f"Data type of weights: {linear_layer.weight.dtype}")
print(f"Example input data type: {torch.randn(1, 10).dtype}")
# When you do linear_layer(input), the calculation uses float32Data type of weights: torch.float32
Example input data type: torch.float32
As we learned, quantization aims to use lower precision, like 8-bit integers (int8) for weights and maybe 16-bit floats (float16/bfloat16) for activations. A standard nn.Linear layer doesn't know how to store its weights as int8, nor does it have a special way to perform calculations mixing int8 weights with float32 or float16 inputs efficiently.
So, to make a model quantized, we need layers that can handle these lower-precision numbers.
TinyQ solves this by introducing its own specialized versions of layers that commonly appear in models, particularly nn.Linear. These are called Custom Quantized Layers.
Instead of modifying the original nn.Linear layers, TinyQ replaces them entirely with new modules designed specifically for quantized operations. The two main custom layers corresponding to the quantization methods are:
-
W8A32LinearLayer: For the W8A32 method (8-bit weights, 32-bit activations). -
W8A16LinearLayer: For the W8A16 method (8-bit weights, 16-bit activations).
These custom layers are also PyTorch nn.Modules, just like nn.Linear, meaning they fit seamlessly into your model's structure. However, their internal workings are different.
Let's look at what makes these layers special by examining their key components:
-
Storage for Quantized Weights: They don't store weights as
float32. Instead, they useint8tensors (torch.int8) to hold the compressed weight values. Sinceint8weights by themselves don't represent the originalfloat32values directly, they also need to store extra information like scales and zero points (as discussed in Weight Quantization Math) to convert theint8values back into a usable format during calculation. These tensors are typically registered asbuffersin PyTorch, meaning they are part of the model's state but aren't updated during training (since quantization is usually applied after training). -
Storage for Bias: The bias term is typically kept in full precision (
float32or the activation precision likefloat16) because it's a relatively small number of values and keeping it in higher precision helps maintain accuracy. -
A
quantize()Method: These layers have a special method, often namedquantize(), that is called after the layer is created. This method takes the originalfloat32weights (and sometimes bias) from thenn.Linearlayer it's replacing, performs the Weight Quantization Math to convert them toint8, calculates the necessary scales and zero points, and stores these results in its own buffers. -
A Custom
forward()Method: This is the heart of the layer. Instead of using a standardtorch.matmulwithfloat32numbers, thisforwardmethod implements a specific calculation (Quantized Forward Pass Functions) that knows how to combine the input activation (which might befloat32orfloat16) with theint8weights and scales to produce the output.
Let's peek at the basic structure of W8A32LinearLayer from tinyq.py.
# From tinyq.py
class W8A32LinearLayer(nn.Module):
def __init__(self, in_features, out_features, bias=True, dtype=torch.float32):
super().__init__()
self.in_features = in_features
self.out_features = out_features
# 1. Storage for Quantized Weights & Scales
# int8_weights stores the compressed values
self.register_buffer("int8_weights",
torch.randint(low=-128, high=127,
size=(out_features, in_features),
dtype=torch.int8))
# scales stores the multiplication factor needed for dequantization
self.register_buffer("scales",
torch.randn((out_features), dtype=dtype))
# Zero points are zero for symmetric quantization (W8A32 uses this)
self.register_buffer("zero_points",
torch.zeros((out_features), dtype=dtype))
# 2. Storage for Bias (optional)
if bias:
self.register_buffer("bias",
torch.randn((1, out_features), dtype=dtype))
else:
self.bias = None
# 3. The quantize method (details in Chapter 5)
def quantize(self, weights):
# This method takes the original float32 weights...
# ... performs the conversion to int8 and calculates scales ...
# ... and stores them in self.int8_weights and self.scales.
pass # Actual implementation shown in Chapter 5
# 4. The custom forward method (details in Chapter 6)
def forward(self, input):
# This method takes the input (float32 for W8A32)...
# ... and performs the calculation using int8_weights, scales, and bias.
pass # Actual implementation shown in Chapter 6
# W8A16LinearLayer has a similar structure but might store slightly different scales
# and its forward method expects float16 input.Explanation:
- The
__init__method sets up the layer's basic dimensions and creates placeholder tensors forint8_weights,scales,zero_points, andbias.register_bufferis key here – it tells PyTorch these tensors should be saved and loaded with the model's state, but are not parameters to be optimized by an optimizer. - The
quantizemethod (which we will detail in Chapter 5: Weight Quantization Math) is where the originalfloat32weights are processed and converted into theint8_weightsandscalesthat the layer will use. - The
forwardmethod (detailed in Chapter 6: Quantized Forward Pass Functions) defines how the layer performs the actual matrix multiplication and adds the bias using its stored quantized weights and the input activation.
Notice that the quantize and forward methods are placeholders in this structural view – their implementation details are crucial and covered in later chapters. But the existence of these methods is what defines a custom quantized layer in TinyQ.
Now that we know what these custom layers are, let's revisit how the Quantizer from Chapter 1 uses them.
When you call quantizer.quantize(q_method="w8a32") (or "w8a16"), the Quantizer does the following:
-
Selects the Target Class: Based on
"w8a32", it knows it needsW8A32LinearLayer. If it was"w8a16", it would selectW8A16LinearLayer. -
Finds
nn.LinearLayers: It traverses your original model's structure, looking for everynn.Linearlayer. -
Creates a New Custom Layer: For each
nn.Linearlayer it finds, it creates a new instance of the selected custom layer class (W8A32LinearLayerin our example), using the samein_features,out_features, andbiassetting as the originalnn.Linearlayer. -
Quantizes and Copies Data: It takes the
float32weights and bias from the originalnn.Linearlayer and passes the weights to the new custom layer'squantize()method. This method performs the quantization and stores theint8_weights,scales, etc., inside the new layer. The original bias is often copied directly to the new layer's bias buffer. -
Replaces the Layer: It then replaces the original
nn.Linearlayer in the model's structure with the newly created and quantized custom layer.
This process is orchestrated by the replace_linear_with_target_and_quantize function within tinyq.py, which we will dive into in Chapter 4: Model Structure Replacement.
Here's a simplified diagram of this replacement for one layer:
sequenceDiagram
participant Quantizer as tinyq.Quantizer
participant ReplaceFunc as replace_linear...
participant OriginalLinear as nn.Linear Layer
participant CustomLayer as W8A32LinearLayer<br/>or W8A16LinearLayer
Quantizer->ReplaceFunc: Start replacement process<br/>(using CustomLayer class)
ReplaceFunc->ReplaceFunc: Traverse model structure
ReplaceFunc->OriginalLinear: Find an nn.Linear layer
ReplaceFunc->OriginalLinear: Get weights & bias
ReplaceFunc->CustomLayer: Create instance(in_features, out_features, ...)
ReplaceFunc->CustomLayer: Call quantize(original_weights)
Note over CustomLayer: CustomLayer calculates<br/>int8_weights, scales, etc.
ReplaceFunc->CustomLayer: Copy original_bias (if exists)
ReplaceFunc->ReplaceFunc: Replace OriginalLinear<br/>with CustomLayer in model
ReplaceFunc-->Quantizer: Continue/Finished
By replacing the standard layers with these custom ones, the model is transformed. When you later run inference on the quantized model, the computation for these replaced layers will use the custom forward method, which is designed for efficiency with lower precision numbers.
Custom quantized layers like W8A32LinearLayer and W8A16LinearLayer are fundamental to TinyQ. They serve as the specialized building blocks that replace the standard nn.Linear layers in your model.
These custom layers are unique because they:
- Store weights in a compressed format (
int8) along with necessary scaling information. - Have a dedicated
quantize()method to convert originalfloat32weights to their internal low-precision format. - Implement a custom
forward()method designed to perform calculations efficiently using the low-precision weights and the specified activation precision (32-bit or 16-bit).
The Quantizer uses your chosen method (W8A32 or W8A16) to select the appropriate custom layer class and orchestrates the replacement of all nn.Linear layers in your model with instances of that class, calling the quantize() method on each new layer.
Understanding these custom layers is key to seeing how quantization goes from a concept (reducing precision) to a practical implementation (specialized modules).
In the next chapter, we'll delve into the replace_linear_with_target_and_quantize function and see exactly how TinyQ traverses the model structure and performs this layer replacement automatically.