Tokenization
Core ConceptsThe process of splitting text into tokens (the basic units an LLM processes) before feeding it to a model.
Full Explanation
Different models use different tokenization approaches. GPT models use Byte Pair Encoding (BPE). One token is ~4 characters in English, but the ratio varies by language — Chinese, Japanese, and Korean use more tokens per character. Tokenization affects cost (API pricing is per token), context window usage, and model performance on non-English languages.
Related Terms
The basic unit of text that AI language models process — roughly equivalent to 3/4 of a word in English.
A type of AI model trained on vast amounts of text data that can generate, summarize, translate, and reason about language.
The maximum amount of text (measured in tokens) that an AI model can 'see' and process in a single interaction.