How Quantization helps Huge Neural Networks run on Tiny Hardware

Nandini Tengli
4 min readJun 5, 2024

--

Quantization refers to constraining an input from a continuous set of values to a discrete set of values. Constraining values in this way helps reduce computational load since floating-point computations are expensive. Restricting precision, and representing weights using 16, 8, or 4 bits rather than 32 bits helps reduce storage. This is how quantization helps to run a huge neural network on our phones or laptops.

Another reason a model needs to be quantized is because, some hardware like heterogenous compute chips or microcontrollers that go into edge devices (like phones, or even devices in cars that use neural networks) are only capable of performing computations in 16-bit or 8-bit. These hardware constraints require models to be quantized to run on this hardware.

Whenever we quantize the model, a loss in accuracy should be expected. The graph below shows the accuracy drop as the model is increasingly compressed (from right to left, more compression as we go to the left)

Accuracy vs Compression Rate of Alex Net source: https://arxiv.org/pdf/1510.00149

The graph above shows that we can compress about 11% before there is a drastic loss in accuracy.

The goal of quantization and other compression techniques is to reduce the model size and reduce the latency of inference without losing accuracy.

This article will briefly go over 2 methods of quantizing neural networks:

  • K-Means Quantization
  • Linear Quantization

Note: this article is my notes from an MIT lecture about Quantization

K-Means Quantization

This method works by grouping weights in the weight matrix using the K-means clustering algorithm. We then pick a centroid of each group, which will be the quantized weight. We can then store the weight matrix as indices to the centroid.

K-means quantization source: https://arxiv.org/pdf/1510.00149

Here, the squares with the same colors belong to the same cluster and will be represented by an index to the centroid lookup table. Let’s look at how this reduces storage

Storage:

Weights (32-bit float): 16 * 32 = 64 Bytes

Indices (2-bit Uint): 16 * 2 = 4 Bytes

Look-up Table (32-bit float): 32 * 4 = 16 Bytes

Quantized Weights = 16 + 4 = 20 Bytes

Quantization error is the error between the reconstructed weights and the original set of weights:

So for the example above the quantization error would look like this:

We can “fine-tune” the quantized weights (the centroid values) to reduce this error by calculating the gradient and then clustering them in the same way as the weights. We then accumulate the gradients, and sum them up. We then multiply the sum(s) by the learning rate and then subtract that from the initial centroids. In this way, we can tune the quantized weights.

K-means quantization fine-tuning source: https://arxiv.org/pdf/1510.00149

Here weights are stored as Integers. During computation, we ‘decompress’ the weights by using the lookup table (centroid table) and use these decompressed weights for computation. So we don’t reduce the computational load since the centroids are still represented in floating point, but the memory is drastically reduced.

This method is useful when we have a memory bottleneck, like in LLAMA2, where memory is the bottleneck.

Linear Quantization

Works using an Affine mapping to map integers to real numbers (the weights).

The equation used for mapping is:

Here is how we figure out the Zero point and the Scale Factor

Example:

Given the Weight Matrix:

We can calculate S as:

Then we calculate the Zero Point:

We need to round the result since the goal is for Z to be an integer.

So in the Example from above:

So here is what Linear Quantization looks like:

Now for the computation aspect, we can substitute the mapping equation for the Weight Matrix.

For instance, a matrix multiplication with quantized weights would look like this:

--

--

Nandini Tengli

Machine Learning Engineer working on creating smarter cars