Inference

Infrastructure

Simple Definition

The process of running a trained AI model to generate outputs — what happens when you use an AI tool.

Full Explanation

Training creates an AI model. Inference is using that model to generate predictions or responses. Inference costs and latency are what AI companies optimize for commercial deployment. Inference is measured in tokens per second and cost per token. Faster, cheaper inference is achieved through hardware optimization (GPUs/TPUs), model distillation, quantization, and batching.

Related Terms

Latency

The time delay between sending a request to an AI API and receiving the first token of response.

Large Language Model (LLM)

A type of AI model trained on vast amounts of text data that can generate, summarize, translate, and reason about language.

Token

The basic unit of text that AI language models process — roughly equivalent to 3/4 of a word in English.

Last verified: 2026-04-13← Back to Glossary