We Compare AI

Quantization

Infrastructure
Simple Definition

A technique for reducing AI model size and memory requirements by representing weights with lower precision numbers.

Full Explanation

A full-precision model stores weights as 32-bit floats. 4-bit quantization reduces each weight to 4 bits — an 8x size reduction with modest quality loss. This makes large models runnable on consumer hardware. LLaMA 3.1 405B in full precision requires 810GB VRAM; 4-bit quantized versions run on a single A100 80GB GPU. Tools: llama.cpp, GGUF format, bitsandbytes.

Last verified: 2026-03-30← Back to Glossary