We Compare AI

Tokenization

Core Concepts
Simple Definition

The process of splitting text into tokens (the basic units an LLM processes) before feeding it to a model.

Full Explanation

Different models use different tokenization approaches. GPT models use Byte Pair Encoding (BPE). One token is ~4 characters in English, but the ratio varies by language — Chinese, Japanese, and Korean use more tokens per character. Tokenization affects cost (API pricing is per token), context window usage, and model performance on non-English languages.

Last verified: 2026-03-30← Back to Glossary