Quantization in Machine Learning and Deep Learning

Floating-point precision (FP32) is the most common precision used for training new models, while FP16, INT8, and others are employed for quantization. Quantization involves converting higher-precision values to lower-precision values, resulting in a loss of precision that may degrade model prediction quality. However, quantization offers several benefits:

Just to do inference on a large language model (LLM) such as BLOOM-176B, you would need to have 8x NVIDIA 80GB A100 GPUs1. Currently the NVIDIA A100 80GBs have an average price of $22,628.30 on Coin Poet.

To fine-tune BLOOM-176B, you'd need 72 of these GPUs!

Quantitizing has the following benefits:

Is Quantization Necessary?

FP16 is increasingly being used right from the start of training new models, particularly in deep learning. This approach is known as mixed precision training. It combines the use of FP32 and FP16 to optimize the training process.

Quantization is rapidly evolving. For example, in 2023 the 'LLM.int8() paper'2 demonstrated that quantitizing down to 8-bit from 16-bit while keeping a couple of dimensions in high precision maintains zero-shot LLM performance. In their 2023 study, Liu et al. demonstrated that 4-bit floating-point quantization could effectively reduce model size without substantially compromising performance 3. I'm sure there are others, but I am aware of these.

To gain a more intuitive understanding of quantization, I recommend Tim Dettmers' blog posts:

Footnotes

  1. A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using Hugging Face Transformers, Accelerate and bitsandbytes by Hugging Face

  2. Dettmers, Tim, et al. "Llm. int8 (): 8-bit matrix multiplication for transformers at scale." arXiv preprint arXiv:2208.07339 (2022).

  3. Liu, Shih-yang, et al. "LLM-FP4: 4-Bit Floating-Point Quantized Transformers." arXiv preprint arXiv:2310.16836 (2023).