Skip to content
Notes on AIE012Act 2 — Behavior & Limits

Tokenization

Show how tokenization explains strange behavior with numbers, emojis, and languages.

Tokenization is the process of turning raw text into tokens before an AI model processes it. It is preprocessing, not thinking — the model only sees the resulting pieces.

Full Explanation

Tokenization is the process of turning raw text into tokens before an AI model processes it. It is preprocessing, not thinking — the model only sees the resulting pieces.

Tokenization is learned from training data, not designed by hand. It balances compression (efficient reuse of common patterns) and flexibility (representing rare words, new terms, different languages, and typos via subwords).

It is model-specific: different models use different tokenizers, so the same sentence can become different tokens and different counts. This explains why limits, costs, and odd behavior with numbers, emojis, or languages vary between providers.

Tokenization has nothing to do with meaning — the tokenizer only cuts text. A useful mental model: tokenization is the scissors; the model is the brain. Once you see this step, differences between models start making sense.

Resources

No dedicated resources for this episode yet.

Browse the resource library →

More from Act 2 — Behavior & Limits

Alexey Makarov

Alexey Makarov

AI Enablement Strategist and Educator. Leading the AI Center of Excellence at SEFE. Creator of the Unreasonable AI YouTube channel. Based in Berlin.

About Alexey →