Are you wondering when to use model quantization for a deep learning model? Or maybe you want to learn more about what model quantization entails? Well either way, you are in the right place! In this article, we tell you everything you need to know to understand when to use model quantization.
First, we provide a high level overview of what model quantization is and how model quantization is applied. After that, we discuss some of the main advantages and disadvantages of model quantization. Finally, we provide examples of situations where model quantization should and should not be used.
What is model quantization?
What is model quantization? Model quantization is a technique that can be applied to reduce the amount of memory and compute that is required in order to train or make inference with a deep learning model. This is traditionally done by changing the data type that is used to represent model weights and activations from a high precision data type to a lower precision data type that takes up less memory. This may involve changing the data type from a large float to a smaller float that can be represented using less memory, or making larger changes like converting the representation from a large float to a compact integer.
What data is required for model quantization?
Is there any special data that is required in order to apply model quantization? The short answer to this question is no, there is no special data that is required in order to perform model quantization. Model quantization is simply a space saving technique that makes it possible for a large model to be deployed on a smaller machine. That means that you would use the same data to train or make inference with a quantized model as you would with a larger version of the model that is not quantized.
Advantages and disadvantages of model quantization
What are some of the main advantages and disadvantages of model quantization? In this section, we will discuss some of the main advantages and disadvantages of model quantization.
Advantages of model quantization
Here are some of the primary advantages of model quantization.
- Run large models on small compute. One of the main advantages of model quantization is that it enables you to run increasingly large models on small compute resources. This can be very important if you work in an environment where you do not have access to large compute resources.
- Cheaper inference. Large compute resources can become very expensive very fast, especially if you are using cloud computing resources rather than resources that are owned by your company. If you are able to run your models on small compute instances, then you will almost certainly incur cost benefits. If you are using resources that are provisioned by a third party, then you will pay less money to borrow resources. If you are using resources that are fully owned by your company, you still stand to save on energy costs.
- Faster inference. In addition to enabling you to run large models on smaller compute, model quantization can also reduce the amount of time that is required to make inference with models. This can be an important advantage if you are operating in an environment where low latency is important.
- Enables running models on embedded devices. An additional benefit of being able to run large models on small compute is that this enables you to run large models on embedded devices that have limited compute power.
- Can be applied after training. Model quantization is a technique that can be applied at multiple stages in the model development and deployment lifecycle, including after a model has already been trained. That means that it is a viable option if you already have a large model trained and you want to be able to use that specific model for inference.
Disadvantages of model quantization
Here are some of the primary disadvantages of model quantization.
- Potential loss of accuracy. When you reduce the precision of the data types that you are using to represent your data, you are potentially losing some information in your data. This has the potential to result in small reductions in accuracy. While these reductions are typically relatively small, they can be costly if you are working in a situation where small increases in accuracy reap large gains.
- More decisions to be made on configuration. When you implement model quantization, there are often choices that need to be made regarding how to quantize the data. For example, you may need to decide the new data type that should be used to represent the data or the quantization algorithm that should be applied. This means that you will need to use more time and brain power to configure your model. There may also be more debugging required.
- Operations may not be available for new data type. One disadvantage of model quantization is that when you change the data type that is used to represent model weights and activations, you risk running into a situation where the new data type you have switched to does not support operations that you were previously performing on the data. This is particularly common if you make large changes to the data type, such as switching from a float representation to an integer representation. In these situations, you may need to change the operations you are using (if possible).
- Hardware may not support the new data type. It is also possible that you may run into situations where the hardware that you are using does not support the new data type that you want to use. In these situations, you may have to pivot and think of another way to reduce your memory usage.
- More difficult to represent very small or large values. When you reduce the size of the data type that you are using to represent a number, you usually reduce the range of values that you are able to represent with that data type. This means that you may run into difficulties when trying to represent very large or very small values. This may make it more difficult to represent outlier cases. Depending on the systems that you are using, this could result in errors being raised or null values being introduced.
When to use model quantization
When should you use model quantization to reduce the size of your model? Here are some examples of situations where it makes sense to use model quantization.
- When a model is too large to run on available compute. The main reason to use model quantization is if the model you want to use cannot be run on the hardware you have available because it is too large. This may happen when you are working in a company that uses in-house compute resources rather than cloud compute resources. If no one has required large compute resources before, they may not be available. This may also happen if you are working on embedded devices.
When not to use model quantization
When should you avoid using model quantization to reduce the size of your model? Here are some examples of situations where it does not make sense to use model quantization.
- When small improvements in model accuracy are very advantageous. There are some situations where small improvements in model accuracy can result in large rewards. In these situations, model accuracy may be a much more important metric than other operational metrics like cost and latency. When you use model quantization, you are often trading small reductions in model accuracy for large improvements in operational metrics. If model accuracy matters above all else, then model quantization may not make sense for you.
- When large compute resources are readily available. If you work in an environment where large compute resources are readily available for cheap (or even for free), then you may not need to use model quantization. Model quantization is most commonly used to avoid having to use large compute resources, so if this is not a concern for you then you may not need to use model quantization.
- How to improve large language model performance
- When to use large language models
- When to use LoRA to fine tune LLMs