Quantization in Machine Learning and Deep Learning
Floating-point precision (FP32) is the most common precision used for training new models, while FP16, INT8, and others are employed for quantization. Quantization involves converting higher-precision values to lower-precision values, resulting in a loss of precision that may degrade model prediction quality. However, quantization offers several benefits:
Just to do inference on a large language model (LLM) such as BLOOM-176B, you would need to have 8x NVIDIA 80GB A100 GPUs1. Currently the NVIDIA A100 80GBs have an average price of $23,454.24 on Coin Poet.
To fine-tune BLOOM-176B, you'd need 72 of these GPUs!
Quantitizing has the following benefits:
- Reduced Memory Usage: For example, FP16 requires half the memory footprint of FP32 and INT8 half the memory of FP16. This enables training of larger models or larger mini-batches on the same hardware.
- Faster Computation: On many GPUs or TPUs (Tensor Processing Units), using FP16 or INT8 can speed up arithmetic operations over FP32, reducing training time.
- Improved Resource Utilization: Certain GPUs are optimized for FP16 operations, allowing for more efficient utilization of these resources.
Is Quantization Necessary?
FP16 is increasingly being used right from the start of training new models, particularly in deep learning. This approach is known as mixed precision training. It combines the use of FP32 and FP16 to optimize the training process.
Quantization is rapidly evolving. For example, in 2023 the 'LLM.int8() paper'2 demonstrated that quantitizing down to 8-bit from 16-bit while keeping a couple of dimensions in high precision maintains zero-shot LLM performance. In their 2023 study, Liu et al. demonstrated that 4-bit floating-point quantization could effectively reduce model size without substantially compromising performance 3. I'm sure there are others, but I am aware of these.
To gain a more intuitive understanding of quantization, I recommend Tim Dettmers' blog posts:
- LLM.int8() and Emergent Features
- A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using Hugging Face Transformers, Accelerate and bitsandbytes by Hugging Face.
Footnotes
-
A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using Hugging Face Transformers, Accelerate and bitsandbytes by Hugging Face ↩
-
Dettmers, Tim, et al. "Llm. int8 (): 8-bit matrix multiplication for transformers at scale." arXiv preprint arXiv:2208.07339 (2022). ↩
-
Liu, Shih-yang, et al. "LLM-FP4: 4-Bit Floating-Point Quantized Transformers." arXiv preprint arXiv:2310.16836 (2023). ↩