Tokenization

Core Concepts

Simple Definition

The process of splitting text into tokens (the basic units an LLM processes) before feeding it to a model.

Full Explanation

Different models use different tokenization approaches. GPT models use Byte Pair Encoding (BPE). One token is ~4 characters in English, but the ratio varies by language — Chinese, Japanese, and Korean use more tokens per character. Tokenization affects cost (API pricing is per token), context window usage, and model performance on non-English languages.

Related Terms

Token

The basic unit of text that AI language models process — roughly equivalent to 3/4 of a word in English.

Large Language Model (LLM)

A type of AI model trained on vast amounts of text data that can generate, summarize, translate, and reason about language.

Context Window

The maximum amount of text (measured in tokens) that an AI model can 'see' and process in a single interaction.

Last verified: 2026-04-13← Back to Glossary