Notes on AIE012Act 2 — Behavior & Limits

Tokenization

Show how tokenization explains strange behavior with numbers, emojis, and languages.

Watch on YouTube

Full Explanation

Tokenization is the process of turning raw text into tokens before an AI model processes it. It is preprocessing, not thinking — the model only sees the resulting pieces.

Tokenization is learned from training data, not designed by hand. It balances compression (efficient reuse of common patterns) and flexibility (representing rare words, new terms, different languages, and typos via subwords).

It is model-specific: different models use different tokenizers, so the same sentence can become different tokens and different counts. This explains why limits, costs, and odd behavior with numbers, emojis, or languages vary between providers.

Tokenization has nothing to do with meaning — the tokenizer only cuts text. A useful mental model: tokenization is the scissors; the model is the brain. Once you see this step, differences between models start making sense.

Related AI Concepts

Model Token Context Window Cost (Tokens & Pricing)

Part of a learning track

AI Basics

Understand what AI is and how it actually works

✓Explain how modern AI generates outputs in your own words
✓Understand what tokens, context windows, and prompts actually are

Resources

No dedicated resources for this episode yet.

Browse the resource library →

More from Act 2 — Behavior & Limits

Tokens

Why Typos Don't Matter

E013Why Typos Don't Matter

Context Window

E014Context Window

← PreviousTokens Next →Why Typos Don't Matter

AI Enablement Strategist and Educator. Leading the AI Center of Excellence at SEFE. Creator of the Unreasonable AI YouTube channel. Based in Berlin.

About Alexey →