Quantization

Infrastructure

Simple Definition

A technique for reducing AI model size and memory requirements by representing weights with lower precision numbers.

Full Explanation

A full-precision model stores weights as 32-bit floats. 4-bit quantization reduces each weight to 4 bits — an 8x size reduction with modest quality loss. This makes large models runnable on consumer hardware. LLaMA 3.1 405B in full precision requires 810GB VRAM; 4-bit quantized versions run on a single A100 80GB GPU. Tools: llama.cpp, GGUF format, bitsandbytes.

Related Terms

Open Source AI

AI models whose weights and (sometimes) training code are publicly released, allowing anyone to run, modify, and build on them.

Inference

The process of running a trained AI model to generate outputs — what happens when you use an AI tool.

Large Language Model (LLM)

A type of AI model trained on vast amounts of text data that can generate, summarize, translate, and reason about language.

Last verified: 2026-04-13← Back to Glossary