Lab 1 - AI Model Design and Quantization

Overview

  1. Lab 1.1 - AI Model Design
  2. Lab 1.2 - AI Model Quantization
  3. Homework Requirements

Lab 1.1 - AI Model Design

1. Targeting Task and Dataset

In the labs of this course, we will perform an image classification task on the CIFAR-10 dataset. CIFAR-10 consists of 60,000 colored images (32×32 pixels, RGB channels) across 10 classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. It is split into 50,000 training and 10,000 test images.

Many convolutional neural networks (CNNs) have been evaluated on CIFAR-10, making it a valuable dataset for learning image classification techniques.

2. Model Architecture Design

Below, we introduce several commonly used operators in convolutional neural networks (CNNs). In the assignment of this lab, you will be asked to design your own model
architecture using these operators and apply it to the CIFAR-10 image classification task.

In the following formulas, we present:

N=batch sizeCin=number of input channelsCout=number of output channelsH=height of the tensorW=width of the tensor

Convolution

Convolutions are widely used for feature extraction in computer vision tasks. A convolution operator can defined by its size, stride, padding, dilation, etc.

image

Here is the mathematical representation:

out(Ni,Cout,j)=bias(Cout,j)+k=0Cin1weight(Cout,j,k)input(Ni,k)

In AlexNet, 11×11, 5×5, and 3×3 Conv2D layers are used. However, using multiple convolution kernel sizes in a model increases the complexity of hardware implementation. Therefore, in this lab, we will use only 3×3 Conv2D with a padding size of 1 and a stride of 1, following the approach in VGGNet. With these settings, the spatial dimensions of the output feature map will remain the same as those of the input feature map.

Linear

Fully-connected layers, also known as linear layers or dense layers, connect every input neuron to every output neuron and are commonly used as the classifier in a neural network.

image

y=xWT+b

Rectified Linear Unit (ReLU)

ReLU is an activation function, which sets all negative values to zero while keeping positive values unchanged. It introduces non-linearity to the model, helping neural networks learn complex patterns while mitigating the vanishing gradient problem compared to other activation functions like sigmoid, hyperbolic tangent, etc.

image

ReLU(x)={0, if x<0x, otherwise

Max Pooling

Max pooling is a downsampling operation commonly used in convolutional neural networks (CNNs) to reduce the spatial dimensions of feature maps while preserving important features. In the following formulas, we present the typical 2D max pooling operation.

image

out(Ni,Cj,h,w)=maxm=0,,kH1maxn=0,,kW1input(Ni,Cj,stride[0]×h+m,stride[1]×w+n)

Batch Normalization

Batch Normalization (BN) is a technique used in deep learning to stabilize and accelerate training by normalizing the inputs of each layer. It reduces internal covariate shift, making optimization more efficient.

y=xE[x]Var[x]+ϵγ+βwhereE[]="the mean"Var[]="the variance"γ,β="learnable parameters"RCoutϵ="small constant to avoid division by zero"

The learnable parameters

γ and
β
are updated during training but remains fixed during inference.

  1. During forward propagation of training: the mean and variance are calculated from the current mini-batch.
  2. During backward propagation of training:
    γ
    and
    β
    are updated with the gradients
    Lγ
    and
    Lβ
    respectively
  3. During inference:
    γ
    and
    β
    are fixed

As the following figure shows, the batch normalization are applied per channel. That is, the mean and variance are computed over all elements in the batch (

N) and spatial dimension (
H
and
W
). Each channel has independent
γc
and
βc
parameters.

image

3. Model Training and Hyperparameter Tuning

You can use these techniques to improve the accuracy and efficiency of model training.

Lab 1.2 - AI Model Quantization

Why Quantization?

In the beginning, we need to discuss the different data types that can be used for computation and their associated hardware costs.

Number Representation

Integers

Integer Format Length (bit) Value Range
INT32 32 -2,147,483,648 ~ 2,147,483,647
UINT32 32 0 ~ 4,294,967,295
INT16 16 -32768 ~ 32767
UINT16 16 0 ~ 65535
INT8 8 -128 ~ 127
UINT8 8 0 ~ 255

Floating Point Numbers

Floating Point Format Length (bits) Exponent (E) Mantissa (M) Applications
FP64 64 11 52 High precision computing, scientific simulations
FP32 32 8 23 General computing, 3D rendering, machine learning
TF32 32 8 10 NVIDIA proposed, AI training acceleration
FP16 16 5 10 Low-power AI training and inference
BF16 16 8 7 AI training, better compatibility with FP32
FP8-E4M3 8 4 3 Low-precision AI inference
FP8-E5M2 8 5 2 Low-precision AI inference

Hardware Energy/Area Cost on Different Numeric Operation

image

Floating-point arithmetic is more computationally expensive due to the overhead of mantissa alignment and mantissa multiplications

Quantization Schemes

Uniform/Non-uniform

Uniform Quantization (Linear Quantization)

q=Q(r)=clip(rs+z,qmin,qmax)rD(q)=s(qz)whereQ="the quantization operator"D="the de-quantization operator"q="quantized value"r="real value"s="the real difference between quantized steps"Rz="the quantized value mapping to real number 0"Z

The precise definition of the scaling factor

s and zero point
z
varies with which quantization scheme is utilzed.

Non-Uniform Quantization (Logarithmic/power-of-2 Quantization)

q=Q(r)=sign(r)clip(log2|r|s+z,qmin,qmax)rD(q)=s2zq

Symmetric/Asymmetric

Asymmetric/Affine Uniform Quantiztaion

s=βαqmaxqminRz=qminαsZ

Symmetric Uniform Quantiztaion

s=2max(|α|,|β|)qmaxqminRz=qmax+qmin2={0, if signed128, if unsignedZ

Comparison

Compared to asymmetric quantization, symmetric quantization is more hardware-friendly, which eliminates the cross terms of quantized matrix multiplication

  1. Faster Matrix Multiplication
    With symmetric quantization, all data is scaled using the same factor, and the zero-point is set to 0. This allows direct execution of integer-only operations:
    Cq=Aq×Bq

    In contrast, with asymmetric quantization, matrix multiplication becomes more complex because the zero-point must be accounted for:
    Cq=(AqZA)×(BqZB)

    This introduces additional subtraction operations, increasing the computational cost.

Clipping Range

image

Min-Max Clipping

For min-max clipping, clipping range = dynamic range

α=rminβ=rmax

Moving Average Clipping

αt={rminif t=0c rmin+(1c)αt1otherwiseβt={rmaxif t=0c rmax+(1c)βt1otherwise

Percentile Clipping

Percentile

α=P5(r)β=P95(r)

Reduced Range/Full Range

For

b-bit signed integer quantization

qmin
qmax
Full range
2b1
2b11
Reduce range
2b1+1
2b11

For

b-bit unsigned integer quantization

qmin
qmax
Full range
0
2b1
Reduce range
0
2b2

For example, the integer representation of an 8-bit signed integer quantized number with reduce range is in the interval

[127,127].

Calibration Algorithms

The process of choosing the input clipping range is known as calibration. The simplest technique (also the default in PyTorch) is to record the running minimum and maximum values and assign them to

α and
β
.
In PyTorch, Observer modules (code) collect statistics on the input values and calculate the qparams scale and zero_point . Different calibration schemes result in different quantized outputs, and it’s best to empirically verify which scheme works best for your application and architecture.
PyTorch

Weight-only Quantization

In weight-only quantization, only weights are and quantized, while activations remain in full-precision (FP32).

Static Quantization

In static quantization approach, both weight's and activation's clipping range are pre-calculated and the resulting quantization parameters remain fixed during inference. This approach does not add any computational overhead during runtime.

Dynamic Quantization

In dynamic quantization, activation's clipping range as well as the quantization parameters are dynamically calculated for each activation map during runtime. This approach requires run-time computation of the signal statistics (min, max, percentile, etc.) which can have a very high overhead.

Weight-only quantization Static quantization Dynamic quantization
calibrate on weights before inference before inference before inference
quantize on weights before inference before inference before inference
calibrate on activations no before inference during inference
quantize on activations no during inference during inference
runtime overhead of quantization no low high

PTQ/QAT

PTQ (Post-Training Quantization)

image
All the weights and activations quantization parameters are determined without any re-training of the NN model.
In this assignment, we will use this method to perform quantization on our model.

QAT (Quantization-Aware Training)

image
Quantization can slightly alter trained model parameters, shifting them from their original state. To mitigate this, the model can be re-trained with quantized parameters to achieve better convergence and lower loss.

Straight-Through Estimator (STE)

In QAT, since quantization is non-differentiable, standard backpropagation cannot compute gradients. The STE in Quantization-Aware Training (QAT) allows gradients to bypass this step, enabling the model to be trained as if it were using continuous values while still applying quantization constraints.

image

Quantization Errors

A metric to evaluate the numerical error introduced by quantization.

L(r,r^) , where rr^=D(Q(r))

where

r is the original tensor,
r^=D(Q(r))
is the tensor after quantization and dequantization.

Mean-Square Error (MSE)

The most commonly-used metric for quantization error.

L(r,r^)=1Ni=1N(rir^i)2

Other Quantization Error Metrics

Fake Quantization/Integer-only Quantization

Fake Quantization (Simulated Quantization)

In simulated quantization, the quantized model parameters are stored in low-precision, but the operations (e.g. matrix multiplications and convolutions) are carried out with floating point arithmetic.

Therefore, the quantized parameters need to be dequantized before the floating point operations

Integer-only Quantization

In integer-only quantization, all the operations are performed using low-precision integer arithmetic.

Hardware-Friendly Design

Dyadic Quantization

A type of integer-only quantization that all of the scaling factors are restricted to be dyadic numbers defined as:

sb2c

where

s is a floating point number, and
b
and
c
are integers.

Dyadic quantization can be implemented with only bit shift and integer arithmetic operations, which eliminates overhead of expensive dequantization and requantization.

Power-of-Two Uniform/Scale Quantization (Similar to Dyadic Quantization)

Same concept as dyadic quantization, we replace the numerator

b with
1
. This approach improves hardware efficiency since it further eliminates the need for the integer multiplier.

s12c

Power-of-Two Uniform/Scale Quantization constrains the scaling factor to a power-of-two value, enabling efficient computation through bit shifts, while Power-of-Two (Logarithmic) Quantization directly quantizes data to power-of-two values, reducing multiplications to simple bitwise operations.

Derivation of Quantized MAC

In order to simplify the hardware implementation, we use layerwise symmetric uniform quantization for all layers.

Here are the data types for inputs, weights, biases, outputs, and partial sums:

input/output weight bias/psum
Data type uint8 int8 int32

Note that the scaling factor of bias is the product of input's scale and weight's scale. And rounding method is truncation instead of round-to-nearest.

(1)x¯=clamp(xsx+128,0,255)Zuint8dim(x)w¯=clamp(wsw,128,127)Zint8dim(w)b¯=clamp(bsxsw,231,2311)Zint32dim(b)y¯=clamp(ysy+128,0,255)Zuint8dim(y)

The notation

ZN denotes a vector space of dimension
N
where all elements (or components) are integers. See also Cartesian product.

where the scaling factors are calaulated by

(2)sx=2max(|xmin|,|xmax|)|255Rfloat32sw=2max(|wmin|,|wmax|)|255Rfloat32sy=2max(|ymin|,|ymax|)255Rfloat32

The original values can be approximated by dequantizing the quantized numbers.

(3)xsx(x¯128)wsww¯bsxswb¯ysy(y¯128)

Quantized Linear Layer with ReLU

Rectified linear unit (ReLU) is one of the most commonly-used activation functions due to its simplicity.

ReLU(x)=max(x,0)={x,  if x>00,  otherwise

Linear layer:

yi=ReLU(bi+jxjwji)

sy(y¯i128)=ReLU(sxsw(b¯i+j(x¯j128)w¯ji))

The scaling factors

sx,
sw
, and
sy
are typically in
[0,1]
, which doesn't affect the result of ReLU.

(5)y¯i=sxswsyReLU(b¯i+j(x¯j128)w¯ji)only int32 operationsfloat32 operations involved+128

Hardware-Friendly Design

Power-of-Two Quantization

With

b=1 in dyadic quantization, we further get power-of-two quantization:

s12c=2c , where cZ

The matrix multiplication can be approximated as:

(6)y¯i2(cx+cwcy)ReLU(b¯i+j(x¯j128)w¯ji)+128=(ReLU(b¯i+j(x¯j128)w¯ji)(cx+cwcy))+128

We can use shifting to replace multiplication and division when applying a scaling factor.

Derivation of Batch Normalization Folding

During inference, batch normalization (BN) can be fused with Conv2d or Linear layers to improve inference efficiency, reduce memory access, and increase computational throughput. This also simplifies the hardware implementation in Lab3 by eliminating the need for separate BN computation. The derivation is as follows.

Consider a batch normalization (BN) layer expressed by the following equation:

(1)zc=ycμcσc2+ϵγc+βc

where:

Expanding Eq. 1, we obtain:

zc=(γcσc2+ϵ)yc+(βcγcμcσc2+ϵ)

Assuming that the numerical distribution during inference is the same as in the training set, the statistics

μc and
σc2
obtained during training are considered fixed values during inference. These values can then be fused into the preceding Conv2d or Linear layer.

For example, consider a Linear layer:

(2)yc=bc+ixiwic

where the output

y is normalized by BatchNorm to obtain
z
. Substituting Eq. 2 into Eq. 1:

zc=(γcσc2+ϵ)(bc+ixiwic)+(βcγcμcσc2+ϵ)=(ixiγcwicσc2+ϵ)+(βcγcμcσc2+ϵ+γcbcσc2+ϵ)ixiwic+bc

After rearranging, we observe that the Linear + BN operation can be expressed as a new Linear operation with updated weights and biases:

(3)wc=γcσc2+ϵwcbc=γcσc2+ϵ(bcμc)+βc

Quantization in Practice

In this section, we will demonstrate how to perform quantization with PyTorch framework with a simple yet comprehensive example.

image

Let's discuss the quantization process using PyTorch step by step:

  1. Calibration Data
  2. Pre-trained Model
  3. Customization Quantization Scheme
  4. Operator Fusion
  5. Insert Observer
  6. Calibration
  7. Quantization

1. Calibration Data

During calibration, only a small amount of data is required. Therefore, the batch size is set to 1

dataset = '{DATASET}'
backend = '{Quantization_scheme}'
model_path = 'path/to/your/model'
*_, test_loader = DATALOADERS[dataset](batch_size=1)

2. Pre-trained Model

Load model weights trained from Lab 1.1.

model = network(in_channels, in_size).eval().cpu()
model.load_state_dict(torch.load(model_path))

3. Customize Quantization Scheme

Configure Quantization

model = tq.QuantWrapper(model)  
model.qconfig = CustomQConfig[{Your_Quantization_Scheme_in_CustomQConfig_class}].value
print(f"Quantization backend: {model.qconfig}")
class CustomQConfig(Enum):
    POWER2 = torch.ao.quantization.QConfig(
        activation=PowerOfTwoObserver.with_args(
            dtype=torch.quint8, qscheme=torch.per_tensor_symmetric
        ),
        weight=PowerOfTwoObserver.with_args(
            dtype=torch.qint8, qscheme=torch.per_tensor_symmetric
        ),
    )
    DEFAULT = None

The torch.ao.quantization.QConfig class helps define custom quantization schemes by specifying:

  1. How activations should be quantized
  2. How weights should be quantized

These are parameters tells your custom observer (class PowerOfTwoObserver(...)) how to calculate scale and zero point.

Parameter Description
dtype=torch.quint8 unsigned 8-bit quantization
dtype=torch.qint8 signed 8-bit quantization
qscheme=torch.per_tensor_symmetric select symmetric quantization in one tensor
qscheme=torch.per_tensor_affine select asymmetric quantization in one tensor

4. Operator Fusion

Operator fusion is a technique that combines multiple operations into a single efficient computation to reduce memory overhead and improve execution speed.

Module fusion combines multiple sequential modules (eg: [Conv2d, BatchNorm, ReLU]) into one. Fusing modules means the compiler needs to only run one kernel instead of many; this speeds things up and improves accuracy by reducing quantization error.
PyTorch

Common fusions include:

If you want to performed module fusion to improve performance, you should call the following API before calibration.

tq.fuse_modules(model: nn.Module, modules_to_fuse: list[str], inplace=False) -> nn.Module

You can see the reference for more details.

5. Insert Observer

tq.prepare(model, inplace=True)

6. Calibration

Define calibration function first:

def calibrate(model, loader, device=DEFAULT_DEVICE):
    model.eval().to(device)   
    for x, _ in loader:       
        model(x.to(device))  
        break                 

Apply calibration by directly call the above function after inserting the observer via tq.prepare.

calibrate(model, test_loader, "cpu")

This runs one batch of data (previous step) through the model to collect activation statistics.

7. Quantization

Use tq.convert(model.cpu(), inplace=True) to convert the model into a fully-quantized model.

tq.convert(model.cpu(), inplace=True)

Finally, save your quantized model with given filename

save_model({Your Model}, "filename.pt")

Reference

Homework Requirements

1. Train a VGG-like Model using CIFAR-10 dataset.

VGG is a classic CNN architecture used in computer vision. Compared to the previous CNNs, it only use 3x3 convolution, making it easy to be implmeneted and supported by existing and customized hardware accelerators.

In this course, we are going to deploy a VGG-like model onto our custom hardware accelerator and complete an end-to-end inference of image recognition. In this lab, students are requested to implement a VGG-like model in PyTorch with only the following operators:

Layer Type
Conv1 Conv2D (3 → 64)
MaxPool 2×2 Pooling
Conv2 Conv2D (64 → 192)
MaxPool 2×2 Pooling
Conv3 Conv2D (192 → 384)
Conv4 Conv2D (384 → 256)
Conv5 Conv2D (256 → 256)
MaxPool 2×2 Pooling
Flatten -
FC6 Linear (256*fmap_size² → 256)
FC7 Linear (256 → 128)
FC8 Linear (128 → num_classes)

Students are required to design and train your model with only allowed operators using as few parameters as possible while ensuring the accuracy greater than 80%. You can use any training techniques and adjust the hyperparameters (e.g. learning rate tuning, optimizer, etc.) to achieve this goal.

For full precision model, your model should achive the following metrics:

2. Quantize the VGG-like Model as INT8 Precision

Then, quantize the model to INT8 precision while preserving a high level of accuracy compared to the full-precision model.

Quantization scheme

Use the power-of-two uniform/scale, symmetric quantization we previously-mentioned to quantize your model.

For quantized model, your model should achive the following metrics:

3. Complete Your Report (report.md)

Submission Rule

import torch import torch.nn as nn import torch.ao.quantization as tq class VGG(nn.Module): """ Implement your model here """ def __init__(self, in_channels=3, in_size=32, num_classes=10) -> None: super().__init__() def forward(self, x: torch.Tensor) -> torch.Tensor: x = self.conv1(x) x = self.conv2(x) x = self.conv3(x) x = self.conv4(x) x = self.conv5(x) x = torch.flatten(x, start_dim=1) x = self.fc6(x) x = self.fc7(x) x = self.fc8(x) return x if __name__ == "__main__": model = VGG8() inputs = torch.randn(1, 3, 32, 32) print(model) from torchsummary import summary summary(model, (3, 32, 32), device="cpu")