03_custom_quantized_layers_.md

Chapter 3: Custom Quantized Layers

Welcome back! In the first chapter, we learned about the Quantizer class, the tool that kicks off the quantization process. In the second chapter, we explored the different recipes, W8A32 and W8A16, that tell the Quantizer how to quantize.

Now, let's look at the building blocks that make these quantization methods possible: the Custom Quantized Layers.

The Problem with Standard Layers

Think about a typical neural network layer, like torch.nn.Linear. This layer is designed to work with standard floating-point numbers, specifically float32. Its internal weights are stored as float32, and when you give it an input (also usually float32), it performs the calculation (output = input @ weights.T + bias) using float32 math.

import torch
import torch.nn as nn

# A standard Linear layer
linear_layer = nn.Linear(in_features=10, out_features=5)

print(f"Data type of weights: {linear_layer.weight.dtype}")
print(f"Example input data type: {torch.randn(1, 10).dtype}")

# When you do linear_layer(input), the calculation uses float32

Data type of weights: torch.float32
Example input data type: torch.float32

As we learned, quantization aims to use lower precision, like 8-bit integers (int8) for weights and maybe 16-bit floats (float16/bfloat16) for activations. A standard nn.Linear layer doesn't know how to store its weights as int8, nor does it have a special way to perform calculations mixing int8 weights with float32 or float16 inputs efficiently.

So, to make a model quantized, we need layers that can handle these lower-precision numbers.

The Solution: Custom Quantized Layers

TinyQ solves this by introducing its own specialized versions of layers that commonly appear in models, particularly nn.Linear. These are called Custom Quantized Layers.

Instead of modifying the original nn.Linear layers, TinyQ replaces them entirely with new modules designed specifically for quantized operations. The two main custom layers corresponding to the quantization methods are:

W8A32LinearLayer: For the W8A32 method (8-bit weights, 32-bit activations).
W8A16LinearLayer: For the W8A16 method (8-bit weights, 16-bit activations).

These custom layers are also PyTorch nn.Modules, just like nn.Linear, meaning they fit seamlessly into your model's structure. However, their internal workings are different.

Anatomy of a Custom Quantized Layer

Let's look at what makes these layers special by examining their key components:

Storage for Quantized Weights: They don't store weights as float32. Instead, they use int8 tensors (torch.int8) to hold the compressed weight values. Since int8 weights by themselves don't represent the original float32 values directly, they also need to store extra information like scales and zero points (as discussed in Weight Quantization Math) to convert the int8 values back into a usable format during calculation. These tensors are typically registered as buffers in PyTorch, meaning they are part of the model's state but aren't updated during training (since quantization is usually applied after training).
Storage for Bias: The bias term is typically kept in full precision (float32 or the activation precision like float16) because it's a relatively small number of values and keeping it in higher precision helps maintain accuracy.
A quantize() Method: These layers have a special method, often named quantize(), that is called after the layer is created. This method takes the original float32 weights (and sometimes bias) from the nn.Linear layer it's replacing, performs the Weight Quantization Math to convert them to int8, calculates the necessary scales and zero points, and stores these results in its own buffers.
A Custom forward() Method: This is the heart of the layer. Instead of using a standard torch.matmul with float32 numbers, this forward method implements a specific calculation (Quantized Forward Pass Functions) that knows how to combine the input activation (which might be float32 or float16) with the int8 weights and scales to produce the output.

Example: `W8A32LinearLayer` Structure

Let's peek at the basic structure of W8A32LinearLayer from tinyq.py.

# From tinyq.py

class W8A32LinearLayer(nn.Module):
    def __init__(self, in_features, out_features, bias=True, dtype=torch.float32):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features

        # 1. Storage for Quantized Weights & Scales
        # int8_weights stores the compressed values
        self.register_buffer("int8_weights",
                             torch.randint(low=-128, high=127,
                                           size=(out_features, in_features),
                                           dtype=torch.int8))
        # scales stores the multiplication factor needed for dequantization
        self.register_buffer("scales",
                             torch.randn((out_features), dtype=dtype))
        # Zero points are zero for symmetric quantization (W8A32 uses this)
        self.register_buffer("zero_points",
                             torch.zeros((out_features), dtype=dtype))

        # 2. Storage for Bias (optional)
        if bias:
            self.register_buffer("bias",
                                 torch.randn((1, out_features), dtype=dtype))
        else:
            self.bias = None

    # 3. The quantize method (details in Chapter 5)
    def quantize(self, weights):
        # This method takes the original float32 weights...
        # ... performs the conversion to int8 and calculates scales ...
        # ... and stores them in self.int8_weights and self.scales.
        pass # Actual implementation shown in Chapter 5

    # 4. The custom forward method (details in Chapter 6)
    def forward(self, input):
        # This method takes the input (float32 for W8A32)...
        # ... and performs the calculation using int8_weights, scales, and bias.
        pass # Actual implementation shown in Chapter 6

# W8A16LinearLayer has a similar structure but might store slightly different scales
# and its forward method expects float16 input.

Explanation:

The __init__ method sets up the layer's basic dimensions and creates placeholder tensors for int8_weights, scales, zero_points, and bias. register_buffer is key here – it tells PyTorch these tensors should be saved and loaded with the model's state, but are not parameters to be optimized by an optimizer.
The quantize method (which we will detail in Chapter 5: Weight Quantization Math) is where the original float32 weights are processed and converted into the int8_weights and scales that the layer will use.
The forward method (detailed in Chapter 6: Quantized Forward Pass Functions) defines how the layer performs the actual matrix multiplication and adds the bias using its stored quantized weights and the input activation.

Notice that the quantize and forward methods are placeholders in this structural view – their implementation details are crucial and covered in later chapters. But the existence of these methods is what defines a custom quantized layer in TinyQ.

How the Quantizer Uses Custom Layers

Now that we know what these custom layers are, let's revisit how the Quantizer from Chapter 1 uses them.

When you call quantizer.quantize(q_method="w8a32") (or "w8a16"), the Quantizer does the following:

Selects the Target Class: Based on "w8a32", it knows it needs W8A32LinearLayer. If it was "w8a16", it would select W8A16LinearLayer.
Finds nn.Linear Layers: It traverses your original model's structure, looking for every nn.Linear layer.
Creates a New Custom Layer: For each nn.Linear layer it finds, it creates a new instance of the selected custom layer class (W8A32LinearLayer in our example), using the same in_features, out_features, and bias setting as the original nn.Linear layer.
Quantizes and Copies Data: It takes the float32 weights and bias from the original nn.Linear layer and passes the weights to the new custom layer's quantize() method. This method performs the quantization and stores the int8_weights, scales, etc., inside the new layer. The original bias is often copied directly to the new layer's bias buffer.
Replaces the Layer: It then replaces the original nn.Linear layer in the model's structure with the newly created and quantized custom layer.

This process is orchestrated by the replace_linear_with_target_and_quantize function within tinyq.py, which we will dive into in Chapter 4: Model Structure Replacement.

Here's a simplified diagram of this replacement for one layer:

sequenceDiagram
    participant Quantizer as tinyq.Quantizer
    participant ReplaceFunc as replace_linear...
    participant OriginalLinear as nn.Linear Layer
    participant CustomLayer as W8A32LinearLayer<br/>or W8A16LinearLayer

    Quantizer->ReplaceFunc: Start replacement process<br/>(using CustomLayer class)
    ReplaceFunc->ReplaceFunc: Traverse model structure
    ReplaceFunc->OriginalLinear: Find an nn.Linear layer
    ReplaceFunc->OriginalLinear: Get weights & bias
    ReplaceFunc->CustomLayer: Create instance(in_features, out_features, ...)
    ReplaceFunc->CustomLayer: Call quantize(original_weights)
    Note over CustomLayer: CustomLayer calculates<br/>int8_weights, scales, etc.
    ReplaceFunc->CustomLayer: Copy original_bias (if exists)
    ReplaceFunc->ReplaceFunc: Replace OriginalLinear<br/>with CustomLayer in model
    ReplaceFunc-->Quantizer: Continue/Finished

By replacing the standard layers with these custom ones, the model is transformed. When you later run inference on the quantized model, the computation for these replaced layers will use the custom forward method, which is designed for efficiency with lower precision numbers.

Conclusion

Custom quantized layers like W8A32LinearLayer and W8A16LinearLayer are fundamental to TinyQ. They serve as the specialized building blocks that replace the standard nn.Linear layers in your model.

These custom layers are unique because they:

Store weights in a compressed format (int8) along with necessary scaling information.
Have a dedicated quantize() method to convert original float32 weights to their internal low-precision format.
Implement a custom forward() method designed to perform calculations efficiently using the low-precision weights and the specified activation precision (32-bit or 16-bit).

The Quantizer uses your chosen method (W8A32 or W8A16) to select the appropriate custom layer class and orchestrates the replacement of all nn.Linear layers in your model with instances of that class, calling the quantize() method on each new layer.

Understanding these custom layers is key to seeing how quantization goes from a concept (reducing precision) to a practical implementation (specialized modules).

In the next chapter, we'll delve into the replace_linear_with_target_and_quantize function and see exactly how TinyQ traverses the model structure and performs this layer replacement automatically.

Next Chapter: Model Structure Replacement

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

03_custom_quantized_layers_.md

Chapter 3: Custom Quantized Layers

The Problem with Standard Layers

The Solution: Custom Quantized Layers

Anatomy of a Custom Quantized Layer

Example: `W8A32LinearLayer` Structure

How the Quantizer Uses Custom Layers

Conclusion

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

03_custom_quantized_layers_.md

Chapter 3: Custom Quantized Layers

The Problem with Standard Layers

The Solution: Custom Quantized Layers

Anatomy of a Custom Quantized Layer

Example: W8A32LinearLayer Structure

How the Quantizer Uses Custom Layers

Conclusion

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Example: `W8A32LinearLayer` Structure