Model Compression — the what, why, and how of it.

7 min readMar 26, 2021

In this article, we will go through the what, why, and how of model compression.

What is model compression?

Model compression is basically any technique you can use to decrease the size, the computation required, and overall footprint of the model, maintaining its performance. Generally, all of these are for the test step, while we can take any amount of compute for training.

Why is model compression needed?

Recently, there has been a trend of developing big, bad and mean GPUhungry models. From GPT-3 to BERT, they are everywhere. If ever used in production, these models are good for server-side computations, where you can support a large number of TPUs running in the background, but they are no good for edge devices.

Let’s look at an example via a science fiction-esque story. Say you are visiting a friend in an autonomous taxi. A pedestrian walks in front of you, and just in time, the car needs to decide to apply the brakes. Here one can’t risk being dependent on a server-side connection. What if there is a lag or lack of network? So you need a GPU/TPU in your car, limiting the amount of computing.

Now say you want to talk to your friend via video call. Here we need to compress the video feed at one end and decode it at the other to make it more seamless, but would you want a server somewhere storing your personal video feed or prefer using your mobile device to prevent any data leaks?

Now imagine you are in a foreign country, and the taxi breaks down, and you want to talk to the locals using the translator in your smartwatch. Do you want to send the signal to a server and wait for its response or do the computation on the watch itself? This is just one scenario out of many use cases borne out of the need for lower latencies, data privacy, and easier human-computer interaction.

Another under-the-radar use for model compression or some of its techniques is to improve the model's performance. Neural networks often require a large number of parameters, and one can use model compression techniques to act as a sort of regularization, which leads to better generalizations.

Why at Jumio?

Model Compression is an essential step in being able to put models on edge devices as described above, and for us, it’s essential to have these models right in the hands of users via their mobiles as they begin their user journey.

Having these models on mobiles themselves tackle minor avoidable problems that lead to rejection of good IDs. These are cases when the IDs are blurry or important parts are hidden due to glare or other occlusions. This simplifies the user journey greatly, as good IDs no longer need to be rejected, which in turn leads to higher conversion rates for our users!

Now that you understand the need for model compression let’s delve into how it’s done.

How is model compression done?

Model compression falls into two stages: training the model and deploying the model.

Training the model

First, you need to decide which library to use. Here are some you might consider:

Next, you need to consider the problem at hand and choose the path to take. Now we discuss what the choices are and how to go about making the correct one.

Choice 1: Using smaller backbones

Reference: Tan et al. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

Intro:

The easiest way to reduce the model size is to use smaller backbones! A lot of work has been done to build new backbones that have good performance with fewer parameters.

Choice of model:

You can start with something like EfficientNets, which are highly optimized for classification and achieve state-of-the-art accuracy on ImageNet with an order of magnitude better efficiency.

Consideration:

However, using smaller models directly is not the best option if you have complex tasks due to limiting model capacity. In such cases, we can choose the other options.

Choice 2: Pruning and Distillation

Pruning Intro:

Pruning refers to pruning or removing the weights of the model that have a low impact on performance, just like how a tree’s dead leaves and branches are cut off to help the other leaves grow better. This leads to leaner and often more generalizable models. There are two main approaches to pruning:

Unstructured Pruning: Looks at each node individually and checks how much it’s contributing to the model performance and whether the impact is low. If the magnitude of weight is low, the weight is zeroed out.
Structured Pruning: Prunes complete layers out of the model instead of individual weights.

Structured vs. Unstructured pruning:

Unstructured pruning is much more flexible in its formulation, as one can aim to weed out more unnecessary weights. Still, it has an important bottleneck in that there are no good public libraries that can do inference over sparse neural networks. Hence one has to use zero-valued weights for the ones that are pruned out. This means the memory the model uses in production stays the same. Although where it’s still useful is if one wants to use zipped models where it occupies a smaller disk footprint while expanded to full size while in memory (RAM).

Structured pruning, on the other hand, can lead to smaller models directly on export. The challenge associated with it is writing code that can actually delete the layers corresponding to low impact on the performance. This code needs to be written for every backbone to be useful, leading to more engineering work.

Distillation Intro:

Knowledge distillation works with distilling the knowledge of a larger and high-performance model onto a smaller model through student-teacher learning or other methods. This can be interpreted similarly to pruning, but the final network’s characteristics are decided already, unlike in pruning, where they are inferred.

Q. Why not use pruning with smaller models?

Smaller models like EfficientNets are optimized for size. Hence they have fewer branches or nodes that are not important for model performance. This means methods like structured pruning wane in performance quickly. One can still utilize unstructured pruning, but its limitations, as mentioned above, prove to be challenging for deployment.

Choice of model:

You can use pruning to create a student network and use it in the student-teacher learning regime, although it’s less common to use.

https://github.com/LTH14/FSKD

Train a smaller member of the same family of backbone in the student-teacher regime.

https://github.com/imirzadeh/Teacher-Assistant-Knowledge-Distillation (Resnets)

Choice 3: Large model and pruning

This approach is ideal when one has a model that achieves good performance and wants to create a smaller version. Some good resources for models that can be used to achieve pruning are as follows:

Quantization

The final step to all the above options is to use quantization. Quantization is complementary to all other model compression methods. It can be used when a model is trained or as a fine-tuning step over a smaller dataset.

The idea behind quantization is that instead of using float32, one can get decent performance even with float16 or lower representations. This means that weights can be stored in less space while inference also works faster.

This approach is the easiest to achieve; Tensorflow supports quantization natively with TFLite, and Pytorch has some support for it as well. For more information, see this blog post. You can also achieve higher quantization levels by using special deployment tools and/or hardware (processor).

Low-rank approximation

You can decompose a matrix into smaller matrices and approximate its value to save space. The same can be done for layers of a neural network. This approach isn’t that common, and typically the other approaches lead to better solutions.

Deploying on edge devices

After training your model, take note of the following key considerations when deploying your model on edge devices.

Model Size + Tasks

This is the obvious concern described in the previous section. Oftentimes it’s not just one task that we try to achieve on devices but several. This means we might not have just one model but many, which leads to further restrictions on the sizes that we can have. Try to use a single backbone even if it’s larger than what may be necessary for a single task and use it in a multi-task setting if possible with multiple heads.

Library Size

Apart from the model, the device must also contain the libraries used to run the model, e.g., Pytorch or TensorFlow. In this aspect, TensorFlow is currently ahead of Pytorch, with its TFLite pipeline can be used from training to model optimization to exporting and deploying with a small footprint.

Native Libraries

These days mobile devices themselves have a lot of great native machine learning libraries. iOS has coreml, while Android has Tensorflow for easier integration.

Summary

Model compression is a powerful tool in the ML toolkit to help solve problems on a plethora of IoT devices but even on the server-side of things. It can lead to gains in terms of generalization on new data. We only briefly touched upon this topic, and one interesting read on this topic is the lottery hypothesis paper.

I hope this post helped you figure out where to get started with all these tools and build upon them!