Tokenization // Megha Bose

Prerequisite:

Why Neither Characters Nor Words Work Well

Language models do not operate on raw text. Every input must first be converted to a sequence of discrete tokens that are looked up in an embedding table. The granularity of that tokenisation is a fundamental design choice that shapes the model’s efficiency, vocabulary coverage, and representational capacity.

Character-level tokenisation treats each Unicode character as a token. Vocabulary size is tiny (roughly 256 for ASCII, a few thousand for Unicode basics), and every string is representable with no out-of-vocabulary issues. The cost is sequence length: a 1,000-character document becomes a 1,000-token sequence. Attention’s $O(n^2)$ complexity makes this prohibitively expensive, and meaningful linguistic units (words, morphemes) span many tokens, requiring the model to learn compositional structure from scratch at every layer.

Word-level tokenisation splits on whitespace and punctuation, treating each word form as a token. Sequences are short, and tokens often correspond to meaningful units. The problems are severe: vocabulary must be capped at some size $V$ (typically 50,000–100,000), and any word not seen during training maps to a special [UNK] token, destroying information. Morphologically rich languages (German, Finnish, Turkish) are particularly harmed - a single word can have dozens of inflected forms, each needing its own vocabulary entry.

Subword tokenisation is the practical compromise: split text into units that are larger than characters but smaller than words, covering common words whole while decomposing rare words into reusable subword pieces.

Byte-Pair Encoding

BPE (Sennrich et al., 2016) builds a vocabulary iteratively from a corpus:

Initialise the vocabulary with all characters (or bytes) that appear in the corpus.
Count the frequency of every adjacent pair of symbols in the current segmentation of the corpus.
Merge the most frequent pair into a single new symbol. Add it to the vocabulary.
Repeat steps 2–3 until the vocabulary reaches the target size $V$.

Formally, if the corpus is treated as a multiset of symbol sequences, each merge operation replaces all occurrences of a bigram $(a, b)$ with the new symbol $ab$, reducing the total token count by the pair’s frequency and adding one entry to the vocabulary. The merge rules learned on the training corpus are then applied greedily at inference time to tokenise new text: scan left to right and apply the highest-priority (earliest learned) applicable merge at each step.

BPE is used by GPT-2, GPT-3, GPT-4, LLaMA, and most GPT-family models. The vocabulary size $V$ is a key hyperparameter - GPT-2 uses $V = 50{,}257$, while newer models commonly use $V = 100{,}000$ to $128{,}000$.

WordPiece

WordPiece (Schuster & Nakamura, 2012; used in BERT) follows the same iterative merging procedure but uses a different criterion for selecting which pair to merge. Rather than merging the most frequent pair, it merges the pair $(a, b)$ that maximises:

$$\text{score}(a, b) = \frac{\text{freq}(ab)}{\text{freq}(a) \times \text{freq}(b)}$$

This is the pointwise mutual information between $a$ and $b$ normalised by their individual frequencies. Pairs that are highly collocated relative to their individual frequencies are preferred, even if they are not the single most frequent pair. In practice, WordPiece and BPE produce similar vocabularies on English text; WordPiece’s PMI criterion tends to handle morphological affixes more cleanly because common prefixes and suffixes co-occur with many stems, giving them high score.

Unigram Language Model Tokenisation

SentencePiece’s Unigram LM method (Kudo, 2018) starts from the opposite direction: begin with a large candidate vocabulary and prune it. The algorithm maintains a vocabulary $V$ with log-probabilities $\log p(x)$ for each piece $x$, and defines the tokenisation of a string $s$ as the segmentation $\mathbf{x}^\ast = \arg\max_{\mathbf{x}} \sum_i \log p(x_i)$, computed via Viterbi decoding. The expected log-likelihood over the training corpus is:

$$\mathcal{L} = \sum_{s \in \mathcal{D}} \log P(s) = \sum_{s \in \mathcal{D}} \log \sum_{\mathbf{x} \in \text{seg}(s)} \prod_i p(x_i)$$

At each pruning step, the algorithm computes the impact on $\mathcal{L}$ of removing each vocabulary entry and discards the bottom $\eta%$ (typically 10–20%) that contribute least. This continues until $|V|$ reaches the target. Because the method optimises a probabilistic objective, it naturally handles ambiguous segmentations through marginalisation.

Vocabulary Size Trade-offs

Vocabulary size $V$ sits at the intersection of several competing pressures. A larger vocabulary means fewer tokens per document - shorter sequences that are cheaper to process under $O(n^2)$ attention - but requires a larger embedding table ($V \times d_{\text{model}}$ parameters) and a larger output projection layer. More importantly, rare vocabulary entries receive few gradient updates during training, leading to poorly estimated embeddings for low-frequency tokens. Most production models settle in the range $V \in [32\text{K}, 128\text{K}]$ as the practical sweet spot.

For a model with $d_{\text{model}} = 4096$ and $V = 128{,}000$, the embedding table alone contributes $128{,}000 \times 4096 \approx 524\text{M}$ parameters - a non-trivial fraction of total model size.

Special Tokens

Tokenisers define several special tokens for control signals:

[CLS] (BERT): prepended to input; its final representation is used for classification.
[SEP] (BERT): separates two segments (e.g., question and context in QA).
[MASK] (BERT): replaces randomly selected tokens during masked language model pre-training.
[BOS] / <s>: beginning-of-sequence marker; cues the decoder to start generation.
[EOS] / </s>: end-of-sequence marker; signals the decoder to stop.
[PAD]: pads shorter sequences in a batch to a uniform length; corresponding positions receive an attention mask of 0 so they do not contribute to attention computations.

SentencePiece for Language-Agnostic Tokenisation

Standard BPE implementations assume Unicode text and whitespace-delimited words. SentencePiece (Kudo & Richardson, 2018) treats the raw input as a sequence of Unicode characters or bytes, without any language-specific pre-tokenisation. This makes it applicable to languages that do not use whitespace (Chinese, Japanese, Thai) and to code, where whitespace is syntactically significant. SentencePiece is used in T5, LLaMA, Gemma, and most multilingual models. It supports both BPE and Unigram LM as backend algorithms.

Tokenisation’s Impact on Model Behaviour

Tokenisation is not a neutral preprocessing step - it shapes what the model can and cannot represent.

Arithmetic. The number 12345 may be tokenised as “1”, “23”, “45” or “12345” depending on the tokeniser and training corpus distribution. The model must learn positional arithmetic over token sequences rather than over digits, which is a much harder inductive problem. This explains why LLMs make arithmetic errors that humans find trivial.

Character-level tasks. “How many letters in the word strawberry?” requires the model to reconstruct a character sequence from token sequences - a non-trivial inversion when a word is a single opaque token. The model cannot count characters it has never seen individually.

Multilingual fairness. A tokeniser trained predominantly on English will assign many tokens to common English words (often one token per word) and few tokens to equivalent concepts in morphologically rich or low-resource languages. A sentence of comparable semantic content costs more tokens - and therefore more compute - in Yoruba or Finnish than in English.

Examples

GPT-4 tokeniser on code vs. prose. English prose near GPT-4’s training distribution averages roughly 0.75 tokens per character (approximately 1.3 characters per token), meaning a 1,000-character paragraph encodes to approximately 750 tokens. Python code, with its structured indentation and identifier naming conventions, is tokenised at roughly 0.5 tokens per character, often yielding slightly shorter sequences. However, minified JavaScript or obfuscated code with unusual identifier names may tokenise very inefficiently - one token per character - because such patterns are rare in training data and received few merges.

Multilingual tokenisation fairness. With a tokeniser trained on English-dominated data (a common scenario), a sentence in English might encode to 15 tokens while the semantically equivalent sentence in Turkish (a highly agglutinative language) encodes to 40 tokens. This asymmetry means multilingual models process non-English text more slowly, at higher cost per semantic unit, and with more opportunity for error over the longer token sequence. Recent multilingual models address this by including proportionally more non-English text in tokeniser training, explicitly balancing tokens-per-character ratios across language groups.

Read Next: