Tokenization - Breaking Language Into Pieces a Model Can Learn From // Megha Bose

Helpful context:

Consider the expression “1 + 1 = 2”. How many tokens is that in GPT-4’s tokenizer? Seven: "1", " +", " 1", " =", " 2". The spaces are part of the tokens. Now try "2024" - that’s one token. What about "100,000"? Three tokens: "100", ",", "000". None of this is arbitrary. These choices cascade into whether your language model can do arithmetic, how many words of context it can actually hold, and why the same model might be twice as capable in English as in Swahili.

Tokenization is the step that converts a raw string into the integer IDs that a language model actually sees. It happens before any neural network computation, and it shapes almost everything downstream.

Why Not Just Use Characters?

The first obvious choice: one token per character. ASCII gives 128 symbols. Unicode gives 144,000+. This seems appealingly simple and handles every possible input.

The problem is sequence length. The sentence “The quick brown fox” is 19 characters. Every word takes 4-10 times more steps to encode than it would as a word token. A transformer with a 4,096-token context window can hold roughly 4,096 characters - about two paragraphs. The same window with word-level tokens holds a full page of text.

More critically, the model has to learn language from scratch at the character level. “quick”, “quickly”, and “quicker” share the same root concept - but at the character level they’re just different sequences of integers. The model can learn that q-u-i-c-k and q-u-i-c-k-l-y are related, but it has to deduce this from examples. Every morphological relationship, every suffix and prefix, must be learned by observing patterns across characters. This is possible - character-level language models exist - but they require much more data and compute to reach the same quality as models that operate on longer units.

Why Not Just Use Words?

At the other extreme: one token per word. English has roughly 170,000 words. All languages combined, millions. And this immediately creates two problems.

The first is the vocabulary size. A model’s embedding table has one vector per token. At a million words, that’s a massive matrix, and most entries will be seen only a handful of times during training. Rare words get poor representations because the model has almost no data to learn from.

The second problem is out-of-vocabulary words. Any word you didn’t see during training is unknown. "unhappily" and "happiness" are completely different tokens, despite sharing "happy". A new proper noun - a person’s name, a city, a product - is unhandled. You need a special [UNK] token, and at inference time the model has no information about what that token represents.

Word-level tokenization also fails immediately on code, mathematics, URLs, and any non-whitespace-delimited language like Chinese or Japanese.

The Sweet Spot: Subword Tokenization

The solution that has dominated for the last decade is subword tokenization. The key insight: common words should be one token; rare words should be split into meaningful pieces; nothing should be truly unhandled.

In practice this looks like:

"running" → ["running"] (common word, one token)
"unhappily" → ["un", "happily"] or ["unhappy", "ily"] (less common, split at morpheme)
"supercalifragilistic" → ["super", "cal", "ifrag", "ilis", "tic"] (rare word, split into plausible chunks)

The model sees coherent subword units that carry meaning, sequences are shorter than character-level, and nothing is truly unknown.

Three algorithms dominate this space. They share the same goal but differ in how they decide what to merge.

BPE: Byte-Pair Encoding

Byte-Pair Encoding (Sennrich et al., 2016) is the algorithm that produces the tokenizer used in GPT-2, GPT-3, and most GPT-family models.

The algorithm is strikingly simple:

Initialize: the vocabulary is the set of individual characters (plus a special end-of-word symbol).
Repeat until you reach your target vocabulary size:
- Count every consecutive pair of tokens in your training corpus.
- Find the most frequent pair.
- Merge it: replace every occurrence of AB with the new token AB.
- Add AB to the vocabulary.
Stop when the vocabulary reaches the target size (e.g., 50,000 tokens).

The result: the most common sequences in your training corpus become single tokens. "the" gets merged early. "ing", "tion", "pre" get merged. Rare words end up as sequences of shorter pieces. And critically, every input can always be represented - even if a word is completely novel, it can be split down to individual characters.

Let’s trace a tiny example. Suppose your corpus contains only these words:

low, lower, lowest, new, newer

Start with individual characters. The most frequent pair might be e, r (appearing in lower, newer). Merge to er. Then l, o merges to lo. Then lo, w merges to low. Eventually low itself becomes a token. This mirrors the actual frequency structure of the corpus.

BPE is deterministic given the training corpus. Given the same data, you always get the same tokenizer.

WordPiece and SentencePiece

WordPiece (Schuster & Nakamura 2012, used in BERT) uses the same iterative merging idea as BPE, but the criterion for which pair to merge is different. Instead of raw frequency, WordPiece picks the pair whose merge maximizes the likelihood of the training corpus under a language model. This produces slightly different splits - you’ll see tokens like "##ing" in BERT, where "##" signals a word-internal continuation.

The difference matters in practice: WordPiece tends to create tokens that work well for the specific pre-training objective (masked language modeling), while BPE is more purely frequency-driven.

SentencePiece (Kudo & Richardson 2018, used in LLaMA, T5) takes a different approach to the input. Rather than assuming the text has already been split into words (by whitespace), SentencePiece treats the raw text stream - including spaces - as a sequence of Unicode characters. It then applies either BPE or a unigram language model on this stream.

Why does this matter? Because whitespace-based pre-tokenization is English-centric. Chinese and Japanese don’t separate words with spaces. Arabic morphology doesn’t align neatly with whitespace. SentencePiece handles all of these uniformly, encoding the space character as a special symbol "▁" (so "▁hello" means “hello with a preceding space”). LLaMA’s tokenizer uses this approach, which is part of why it handles multilingual input more gracefully than GPT-2-style tokenizers.

Byte-Level BPE: The GPT-4 Approach

GPT-4 uses tiktoken, which implements byte-level BPE. The idea: before doing anything with characters or words, convert all input to UTF-8 bytes. This gives a base vocabulary of exactly 256 symbols (one per byte value, 0 - 255).

BPE then merges these bytes into larger units, exactly as before. The result: any Unicode character can always be represented - at worst, as a sequence of its raw bytes. No unknown tokens. Ever.

This elegantly handles:

Emoji: "😀" is 4 UTF-8 bytes, and will often be merged into a single token after BPE.
Rare scripts: "日本語" encodes into bytes first, then BPE may or may not find merged tokens.
Typos and novel words: always representable, always encodeable.

Tiktoken also includes a pre-tokenization step that splits on whitespace and punctuation using a regex, which prevents the BPE from merging across word boundaries in most cases.

Vocabulary Size and Its Tradeoffs

The vocabulary size is a design parameter with real consequences in both directions.

Larger vocabulary (100k tokens, as in tiktoken):

Sequences are shorter. More content fits in the context window.
But each token is seen fewer times during training. Rare tokens get poor embeddings.
The embedding matrix and output projection layer are larger, adding parameters.

Smaller vocabulary (8k tokens):

Sequences are longer. Less content per context window. Higher inference cost.
But each token has more training data, so representations are better.
Smaller model footprint.

Modern LLMs have settled around 32,000 - 100,000 tokens:

LLaMA-2: 32,000 (SentencePiece BPE)
GPT-4 (tiktoken): ~100,256
Gemma: 256,000 (includes more multilingual coverage)

There’s no universal optimum - it depends on the languages covered, the training data mixture, and the intended use case.

Tokenization Pathologies

The choices made during tokenization have surprising consequences for what models can and cannot do well.

Numbers and arithmetic. The number 1,000,000 might tokenize as ["1", ",", "000", ",", "000"] - five tokens with no structure that reflects place value. The model has to learn that "000" followed by a comma followed by "000" is a million, purely from examples. This is one of the core reasons that LLMs struggle with exact arithmetic. The numeric structure is destroyed before the model ever sees the input. Models that perform better on arithmetic (like those trained with special number tokenizers or tool use) often bypass this by calling a calculator.

Rare languages. A concept that takes one token in English might take 6 - 10 tokens in Swahili, Tamil, or Yoruba, because those languages have fewer merges in the tokenizer’s training data (which was English-heavy). This has two concrete effects: the effective context window is shorter for those languages, and inference is more expensive. An English sentence of 100 tokens might require 300 tokens to say the same thing in a lower-resource language. Research on tokenizer fairness across languages has quantified this disparity extensively.

Code. Code tokenization is sensitive to indentation (Python), bracket matching (Lisp, JSON), and operator sequences. Good code tokenizers handle leading spaces carefully - in Python, the indentation level carries syntactic meaning, and if spaces get merged unpredictably, the model sees garbled syntax. This is why GitHub Copilot and Code Llama use modified tokenizers.

The “SolidGoldMagikarp” problem. Some tokens exist in the vocabulary because they appeared in BPE training data, but appear almost never in language model training data. The embeddings for these tokens are nearly random (never updated). When you prompt a model with these tokens, behavior is unpredictable - the model has no learned behavior for them. This was discovered by probing for unusual tokens in GPT models and finding that certain very long strings of repeated characters produce bizarre outputs.

Discomfort check. Tokenization happens before the model and is not differentiable. The tokenizer is not trained end-to-end with the language model - it’s a preprocessing step designed and fixed before training begins. This means the model cannot “learn” to fix tokenization mistakes. It receives integer IDs and has no way to recover the original bytes unless the tokenizer maps back. Some research tries to bypass tokenization entirely - CANINE (Clark et al., 2022) operates directly on Unicode characters, and MegaByte (Yu et al., 2023) processes raw bytes - but these approaches have not yet matched subword-based models on standard benchmarks at scale. The tokenization bottleneck is real and not yet solved.

The Token Budget

When people say “GPT-4 has a 128,000-token context window,” what does that actually mean in terms of text?

A rough rule of thumb: for conversational English, one token ≈ 0.75 words. So 128,000 tokens ≈ 96,000 words ≈ a full-length novel. But this estimate degrades quickly for other content:

Dense technical writing (formulas, specialized terms): more tokens per concept.
Code: typically 1 token per ~3 - 4 characters, depending on the language.
Non-English natural language: can be 2 - 3× more tokens than English for the same semantic content.

This means the effective context length varies substantially by task. A 128k-token window can hold a novel in English but only a chapter in some other languages - not because the architecture changed, but because the tokenizer is less efficient for that content.

The token count also determines inference cost. Most LLM APIs charge per token (input + output). A query that requires 50,000 tokens of context costs more than the same query with 5,000 tokens - and the tokenizer is the first thing that determines how many tokens your input consumes.

Summary

Algorithm	Key idea	Used in
BPE	Merge most-frequent pairs iteratively	GPT-2, GPT-3
WordPiece	Merge pairs that maximize corpus likelihood	BERT, DistilBERT
SentencePiece BPE	BPE on raw byte stream, language-agnostic	LLaMA, T5
Byte-level BPE	BPE on UTF-8 bytes; 256-symbol base vocabulary	GPT-4 (tiktoken)
Unigram LM	Probabilistic; prune vocabulary to maximize likelihood	XLNet, ALBERT

The fundamental tradeoffs:

Granularity	Vocabulary size	Sequence length	Handles unknowns?
Character	~256	Very long	Yes
Subword (BPE)	32k - 100k	Medium	Yes (fallback to chars)
Word	170k+	Short	No

Tokenization is the first transformation your data undergoes before any learning happens. The choices made here - vocabulary size, merge strategy, byte-level vs. character-level base - propagate through everything: what arithmetic the model can do, how much context it can see, which languages it handles well, and what it costs to run. The fact that these choices happen outside the training loop, and cannot be corrected by the model itself, makes getting them right unusually important.

Read next:

Scaling Laws - More Compute, More Data, More Predictably Better