Word Embeddings - Meaning as Position in Space
Helpful context:
- Neural Networks & Perceptrons - Function Approximation, Layer by Layer
- Probability as a Language - The Grammar of Uncertainty
- Linear Transformations - Geometry Encoded as Arithmetic
king $-$ man $+$ woman $\approx$ queen.
This arithmetic works. In a well-trained word embedding space, if you take the vector for “king,” subtract the vector for “man,” and add the vector for “woman,” the result is close to the vector for “queen.” Not approximately, not usually - reliably, with cosine similarity above 0.8.
That words have geometry - that language has spatial structure, that analogical relationships correspond to linear transformations in a high-dimensional space - is one of the most surprising empirical findings in the history of NLP. How did we get there? And more importantly: why does it work?
The Input Representation Problem
Computers don’t understand words. Every machine learning system ultimately operates on numbers. To apply any mathematical model to language, you first need to turn words into vectors. This is the representation problem, and it is harder than it sounds.
One-hot encoding is the naive solution. Assign each word in your vocabulary an index from 1 to $|V|$, where $|V|$ is the vocabulary size. The one-hot vector for word $w$ is a vector of length $|V|$ with a 1 at position $w$ and 0 everywhere else.
With a vocabulary of 50,000 words, each word is a 50,000-dimensional vector. This has three serious problems:
-
Dimensionality. Vectors of length 50,000 are expensive to store and compute with. Multiplying by a weight matrix means a $50000 \times d$ matrix multiplication.
-
Sparsity. Every one-hot vector is 99.998% zeros. Most of the information capacity is wasted.
-
No notion of similarity. The one-hot vectors for “cat” and “kitten” are orthogonal - their dot product is zero, just like “cat” and “democracy.” The representation cannot express that these words are semantically related.
What we want is a dense, low-dimensional representation that encodes semantic and syntactic similarity. Words that are used similarly should have similar vectors.
The Distributional Hypothesis
The key insight predates neural networks by decades. J.R. Firth wrote in 1957: “you shall know a word by the company it keeps.”
Words that appear in similar contexts tend to have similar meanings. “Dog” and “cat” appear with “pet,” “feed,” “veterinarian,” “fur.” “Paris” and “London” appear with “capital,” “city,” “flight,” “visit.” The contextual distribution of a word is a fingerprint of its meaning.
This is the distributional hypothesis: distributional similarity predicts semantic similarity. It is an approximation - two words with identical contextual distributions might have opposite meanings (“good” and “bad” both follow “the movie was”). But as an approximation, it is remarkably powerful.
If you could encode the distributional information of each word into a dense vector, words with similar distributions would have similar vectors. This is exactly what word embeddings do.
Word2Vec: The Key Insight
Mikolov et al. (2013) introduced Word2Vec, a method for learning word embeddings by training a neural network on a prediction task. The prediction task is discarded after training; the embeddings are what you keep.
The two variants are:
Skip-gram: Given a center word, predict its surrounding context words.
CBOW (Continuous Bag of Words): Given the surrounding context words, predict the center word.
Skip-gram is more commonly used and tends to work better for rare words, so let’s focus on that.
The Skip-gram Objective
Given a corpus of text, for each word $w_t$ at position $t$, define a context window of radius $r$: the words at positions $t-r, \ldots, t-1, t+1, \ldots, t+r$. The objective is to maximize:
$$\mathcal{L} = \sum_{t=1}^{T} \sum_{-r \leq j \leq r, j \neq 0} \log P(w_{t+j} \mid w_t).$$
The probability $P(c \mid w)$ (probability of context word $c$ given center word $w$) is modeled as a softmax over all vocabulary words:
$$P(c \mid w) = \frac{\exp(v_c^\top u_w)}{\sum_{c' \in V} \exp(v_{c'}^\top u_w)}.$$
Here $u_w \in \mathbb{R}^d$ is the embedding of word $w$ as a center word, and $v_c \in \mathbb{R}^d$ is the embedding of word $c$ as a context word. The model has two matrices: one for center words ($U$) and one for context words ($V$), each of size $|V| \times d$. The final embedding used for a word is typically $u_w$ (the center word embedding), or the average $(u_w + v_w)/2$.
The network architecture is simple: a single hidden layer of size $d$, no nonlinearity. Input is a one-hot vector of size $|V|$; the hidden layer is $W^{\text{in}} \in \mathbb{R}^{|V| \times d}$; the output layer is $W^{\text{out}} \in \mathbb{R}^{d \times |V|}$ followed by softmax. The rows of $W^{\text{in}}$ are the word embeddings. Training updates these weights to predict context words better.
The Softmax Problem
Computing the denominator of the softmax requires summing over all $|V|$ vocabulary words for every training example. With $|V| = 50000$ and millions of training examples, this is computationally infeasible.
Negative sampling is the standard solution. Instead of computing the full softmax, train a binary classifier: “is $(w, c)$ a real word-context pair, or is $c$ a randomly sampled ‘negative’ word?”
$$\mathcal{L}{\text{NEG}} = \log \sigma(v_c^\top u_w) + \sum{k=1}^{K} \mathbb{E}{c_k \sim P_n}[\log \sigma(-v{c_k}^\top u_w)],$$
where $P_n$ is a noise distribution (typically the unigram distribution raised to the 3/4 power) and $K$ is the number of negative samples (typically 5-20). This replaces the $|V|$-way softmax with $K+1$ logistic regressions - tractable with large vocabularies.
Why Word2Vec Works
The embeddings capture distributional information because the model is forced to encode it. To predict that “food” appears in the context of “dog,” the model must encode something about “dog” that makes it predict food-related words. That something, compressed into a $d$-dimensional vector, is the word’s distributional fingerprint.
Words that appear in similar contexts will end up with similar vectors because they must predict similar sets of context words. The model cannot afford to give “cat” and “dog” very different vectors if they appear in nearly identical contexts - it would have to learn separate prediction functions for nearly identical problems.
The reason word arithmetic works is deeper. Consider the relationship “is the capital of”: Paris is to France as Berlin is to Germany. In the training corpus, “Paris” appears with “France” in many of the same syntactic patterns as “Berlin” appears with “Germany” (“the capital of France,” “fly to Paris from France,” etc.). These similar syntactic relationships push the vector offsets to be similar: $v_{\text{Paris}} - v_{\text{France}} \approx v_{\text{Berlin}} - v_{\text{Germany}}$.
Analogical relationships become linear transformations because analogies are encoded as consistent syntactic patterns in text, and syntactic patterns produce consistent distributional patterns, and distributional patterns get compressed into consistent vector offsets.
GloVe: Global Co-occurrence Statistics
Word2Vec uses local context windows: it sees a narrow window around each word. GloVe (Global Vectors, Pennington et al. 2014) uses global statistics: the full word-word co-occurrence matrix across the corpus.
Let $X_{ij}$ be the number of times word $j$ appears in the context of word $i$ across the entire corpus. GloVe trains word vectors $w_i$ and context vectors $\tilde{w}_j$ by minimizing:
$$\mathcal{L} = \sum_{i,j=1}^{|V|} f(X_{ij})\left(w_i^\top \tilde{w}_j + b_i + \tilde{b}j - \log X{ij}\right)^2,$$
where $f(X_{ij})$ is a weighting function that down-weights very frequent co-occurrences (common words like “the” and “is” co-occur with everything but are not informative). The target is $\log X_{ij}$: the model tries to make the dot product of two word vectors match the log co-occurrence count.
GloVe has a cleaner mathematical story: the dot product of word vectors directly models log co-occurrence ratios, and this can be derived from a low-rank factorization of the co-occurrence matrix. In practice, GloVe and Word2Vec produce embeddings of similar quality; the choice often comes down to the corpus and computational budget.
Discomfort check. Word embeddings are not semantic. They are distributional. “Good” and “bad” often have similar embeddings - not because they mean the same thing, but because they appear in identical syntactic environments: “the movie was ___,” “it was really ___,” “I feel ___.” Distributional similarity is a proxy for semantic similarity that works well on average but fails badly for antonyms, negation, and sarcasm. When you use word embeddings for a downstream task, you are importing this limitation. If your task requires distinguishing good from bad, you need a model that goes beyond distributional co-occurrence - which is exactly what contextual language models (BERT, GPT) provide.
Subword Embeddings
Word2Vec and GloVe assign one vector per word. This creates problems:
- Rare words have very few training examples; their embeddings are noisy.
- Words not seen during training (out-of-vocabulary words) have no embedding at all.
- Morphological relationships are not captured: “unhappy,” “happiness,” “happy,” and “happily” each get separate vectors with no forced relationship.
Byte-pair encoding (BPE) addresses this by splitting words into subword units. Start with individual characters; repeatedly merge the most frequent adjacent pair into a new subword unit. After enough merges, frequent words become single tokens, rare words are split into subwords, and unknown words are split all the way to character sequences.
FastText (Bojanowski et al. 2017) takes this further: each word is represented as the sum of embeddings of its character $n$-grams. The embedding of “unhappy” is influenced by the embeddings of “un”, “happy”, “hap”, “app”, etc. This allows morphological information to be shared across words, and handles rare and unknown words gracefully.
SentencePiece is a language-independent subword tokenization algorithm used in many modern models (including T5 and multilingual BERT) that learns subword units from raw text without relying on whitespace.
Static vs. Contextual Embeddings
The fundamental limitation of Word2Vec, GloVe, and FastText is that each word gets exactly one vector. The word “bank” has one embedding, whether it appears as a riverbank or a financial institution. The word “light” has one embedding whether it means a physical lamp or “not heavy.” These are completely different senses that happen to share a spelling.
Contextual language models solve this. BERT, GPT, and their descendants produce a different embedding for each occurrence of a word, based on the full surrounding context. The embedding of “bank” in “the river bank was muddy” is different from its embedding in “the bank approved my mortgage.” These models don’t look up a fixed vector; they compute a vector by running the full network over the context.
This is the key advance that makes modern NLP systems so much more capable. Word2Vec embeddings are a useful baseline and a good way to understand the distributional hypothesis, but for any serious application you will want contextual embeddings.
Training Your Own vs. Using Pretrained
For most tasks, you should use pretrained embeddings rather than training from scratch. Word2Vec and GloVe embeddings trained on large corpora (Google News, Common Crawl) are publicly available. They encode knowledge from billions of words of text and generalize well to many downstream tasks.
Training from scratch makes sense when:
- Your domain is highly specialized (medical, legal, code) and standard corpora are not representative.
- Your corpus is large enough to support good estimation (hundreds of millions of tokens).
- You need subword-level representations that standard pretrained embeddings don’t provide.
In most settings, pretrained embeddings as initialization, fine-tuned on your downstream task, outperform both random initialization and frozen pretrained embeddings.
Summary
| Method | Key Idea | Input | Global Stats? | OOV Handling |
|---|---|---|---|---|
| One-hot | Index encoding | Word | No | No |
| Word2Vec | Predict context from word | Word | No | No |
| GloVe | Factorize co-occurrence matrix | Word | Yes | No |
| FastText | Sum of character n-gram vectors | Subword | No | Yes |
| BPE/SentencePiece | Subword tokenization | Subword | Varies | Yes |
| BERT/GPT | Context-dependent embeddings | Token | Yes (via pretraining) | Yes |
Word embeddings are the bridge between raw text and the mathematical operations of machine learning. They transform the question “what is a word?” into “what does this word predict, and what predicts it?” - a distributional question that is answerable from data. The surprising finding is that this distributional approach, applied at scale, recovers much of the structure we associate with meaning: analogies, synonymy, syntactic relationships.
The limitation is real: distributional similarity is not semantic similarity, and context matters. That limitation is what drove the field from static embeddings to contextual ones. But the intuition from word embeddings - that linguistic meaning can be encoded in geometric structure - carries through to every modern language model.
Read next: