How much memory does my GPU need to run an LLM?

Inference

For this section, let's assume we're talking about inference (getting the LLM to generate output based on a prompt), not training or fine-tuning.

The way to estimate this is to take the precision of the model (usually "half-precision," 16-bit per parameter) multiplied by the number of parameters. Consider that 16 bits is equal to 2 bytes and a gigabyte is 1 billion bytes. So, for your 34-billion parameter model, you can calculate it as simply as:

For a 34-billion parameter model with 16-bit precision (2 bytes), it is calculated as follows:

34 billion parameters * 2 bytes/parameter = 68 billion bytes โ‰ˆ 68 GB

Similarly, if a model uses "double-precision" size parameters at 32 bits, or 4 bytes, then calculate it as follows:

34 billion parameters * 4 bytes/parameter = 136 billion bytes โ‰ˆ 136 GB

The largest memory on a single GPU available, as far as I know, are the NVIDIA A100 and the NVIDIA H100, each offering 80GB of GPU RAM.

Due to these large memory requirements, usually, the larger models are quantitized down to lower precision, which can reduce accuracy in some cases but makes them feasible on single-GPU machines.

There are some ways to optimize throughput in the face of limited memory, such as vLLM, which is reasonably easy to set up if you're familiar with Linux.

Some other good reading that I've come across on this topic: