Inference
InfrastructureThe process of running a trained AI model to generate outputs — what happens when you use an AI tool.
Full Explanation
Training creates an AI model. Inference is using that model to generate predictions or responses. Inference costs and latency are what AI companies optimize for commercial deployment. Inference is measured in tokens per second and cost per token. Faster, cheaper inference is achieved through hardware optimization (GPUs/TPUs), model distillation, quantization, and batching.
Related Terms
The time delay between sending a request to an AI API and receiving the first token of response.
A type of AI model trained on vast amounts of text data that can generate, summarize, translate, and reason about language.
The basic unit of text that AI language models process — roughly equivalent to 3/4 of a word in English.