Prerequisite:


The Distributional Hypothesis

The foundation of word embeddings is a linguistic observation: “You shall know a word by the company it keeps” (Firth, 1957). Words that appear in similar contexts tend to have similar meanings. This distributional hypothesis gives us a way to learn word representations from raw text without any human annotation - we simply observe which words co-occur with which other words.

One-Hot Encoding and Its Problems

The naive representation assigns each word a vector of length $|V|$ (vocabulary size) with a single 1 at its index. One-hot vectors are orthogonal: $\mathbf{e}{\text{cat}}^T \mathbf{e}{\text{dog}} = 0$, just as $\mathbf{e}{\text{cat}}^T \mathbf{e}{\text{table}} = 0$. There is no notion of similarity. Additionally, for vocabularies of 50,000+ words, these vectors are extremely high-dimensional and sparse - any model operating on them must learn from scratch that “cat” and “dog” share properties, for every downstream task.

The goal of word embeddings is to map each word to a dense vector $\mathbf{v} \in \mathbb{R}^d$ (typically $d \in {100, 200, 300}$) such that geometrically similar vectors correspond to semantically similar words.

Word2Vec

Word2Vec (Mikolov et al., 2013) learns embeddings by training a shallow neural network on a self-supervised prediction task. The skip-gram model predicts surrounding context words given a center word.

For center word $w_c$ and context word $w_o$, the model defines:

$$P(w_o \mid w_c) = \frac{\exp(\mathbf{u}{w_o}^T \mathbf{v}{w_c})}{\sum_{w=1}^{|V|} \exp(\mathbf{u}w^T \mathbf{v}{w_c})}$$

where $\mathbf{v}{w_c}$ is the center embedding of $w_c$ and $\mathbf{u}{w_o}$ is the context embedding of $w_o$. Each word thus has two embedding vectors; the center embeddings are typically used after training.

The training objective over a corpus is:

$$J = -\frac{1}{T}\sum_{t=1}^T \sum_{-m \le j \le m,, j \ne 0} \log P(w_{t+j} \mid w_t)$$

Negative Sampling

The softmax denominator requires summing over the entire vocabulary - $O(|V|)$ per step, which is prohibitively expensive. Negative sampling replaces the softmax with a binary classification objective: distinguish the true context word from $K$ randomly sampled “noise” words drawn from $p_n(w) \propto f(w)^{3/4}$ (unigram frequency raised to the 3/4 power, which down-weights very frequent words):

$$J_{\text{neg}} = \log \sigma(\mathbf{u}o^T \mathbf{v}c) + \sum{k=1}^K \mathbb{E}{w_k \sim p_n}\left[\log \sigma(-\mathbf{u}_k^T \mathbf{v}_c)\right]$$

This reduces per-step cost from $O(|V|)$ to $O(K)$, where $K \in {5, 15}$ for large corpora. Despite this approximation, the learned embeddings are empirically nearly identical to those trained with full softmax.

GloVe

GloVe (Pennington et al., 2014) takes a different approach: directly factorize the log co-occurrence matrix. Let $X_{ij}$ be the number of times word $j$ appears in the context of word $i$. GloVe minimizes:

$$J = \sum_{i,j=1}^{|V|} f(X_{ij})\left(\mathbf{v}_i^T \tilde{\mathbf{v}}_j + b_i + \tilde{b}j - \log X{ij}\right)^2$$

where $f(X_{ij}) = \min\left(1, (X_{ij}/X_{\max})^{3/4}\right)$ is a weighting function that down-weights rare and very frequent co-occurrences, and $b_i$, $\tilde{b}_j$ are bias terms. The key insight is that ratios of co-occurrence probabilities encode meaning: $P(\text{ice}|\text{solid}) / P(\text{steam}|\text{solid})$ is large, capturing that “solid” is more related to “ice” than “steam.”

Analogy Arithmetic

The most famous property of word embeddings is analogy solving: $\mathbf{v}{\text{king}} - \mathbf{v}{\text{man}} + \mathbf{v}{\text{woman}} \approx \mathbf{v}{\text{queen}}$.

Why does this work geometrically? If the embedding space encodes the “royalty” concept as a direction $\mathbf{r}$ and “gender” as a direction $\mathbf{g}$, then:

$$\mathbf{v}{\text{king}} \approx \mathbf{v}{\text{man}} + \mathbf{r}, \qquad \mathbf{v}{\text{queen}} \approx \mathbf{v}{\text{woman}} + \mathbf{r}$$

Subtracting the first and adding the second gives $\mathbf{v}{\text{king}} - \mathbf{v}{\text{man}} + \mathbf{v}{\text{woman}} \approx \mathbf{v}{\text{queen}}$. This relies on the distributional hypothesis: if “king” and “queen” appear in similar syntactic contexts (replacing “man” and “woman” respectively), their difference vectors will encode the same semantic shift. The approximation is not perfect - the embedding space is not perfectly linear - but it holds with remarkable frequency for well-trained embeddings.

fastText and Subword Embeddings

Word2Vec and GloVe assign one vector per word type, giving them no mechanism for handling out-of-vocabulary (OOV) words. fastText (Bojanowski et al., 2017) addresses this by representing each word as the sum of its character $n$-gram embeddings. For example, “where” with $n=3$ uses the $n$-grams: $\langle$wh, whe, her, ere, re$\rangle$ (with special boundary markers).

The embedding of a word $w$ is $\mathbf{v}w = \sum{g \in \mathcal{G}_w} \mathbf{z}_g$ where $\mathcal{G}_w$ is the set of $n$-grams in $w$ and $\mathbf{z}_g$ is the embedding of $n$-gram $g$. This:

  • Handles OOV words by composing known $n$-gram embeddings
  • Captures morphological structure (“play”, “playing”, “played” share subword grams)
  • Works well for morphologically rich languages (German, Finnish, Turkish)

Intrinsic vs Extrinsic Evaluation

Intrinsic evaluation assesses embeddings in isolation:

  • Word similarity: Pearson/Spearman correlation between embedding cosine similarity and human-annotated similarity scores (WordSim-353, SimLex-999 benchmarks)
  • Analogy tasks: Accuracy on analogy sets like Google’s analogy dataset

Extrinsic evaluation measures performance on downstream NLP tasks: named entity recognition, sentiment analysis, machine translation. This is ultimately what matters - embeddings that score well on intrinsic benchmarks do not always transfer best.

Static vs Contextual Embeddings

Word2Vec, GloVe, and fastText produce static embeddings: each word type maps to a fixed vector regardless of context. “Bank” gets the same representation in “river bank” and “investment bank.”

ELMo (Peters et al., 2018) was the first widely used contextual embedding - it runs a bidirectional LSTM over the sentence and uses intermediate layer representations. Later, BERT uses a Transformer encoder with masked language modeling, producing embeddings that are deeply contextual: the representation of each token depends on all other tokens in the sequence. Contextual embeddings substantially outperform static embeddings on nearly all downstream benchmarks.

Examples

Visualizing embeddings via PCA and t-SNE. Reducing 300-dimensional GloVe vectors to 2D with PCA reveals broad clusters: countries, capitals, and verbs group loosely. t-SNE (which preserves local neighborhood structure) reveals tighter semantic clusters - number words, colors, and animals separate clearly. Gender pairs (king/queen, actor/actress) appear as parallel translations in PCA plots, visually confirming the analogy structure.

Transfer learning baseline. Initializing the embedding layer of a text classifier with pretrained GloVe or fastText vectors, then fine-tuning, typically improves accuracy by 3-8 percentage points over randomly initialized embeddings on small datasets (under 10,000 examples). The gain diminishes as dataset size grows, because the model can learn adequate embeddings from scratch. This pattern motivates the move to pretrained language models like BERT, which transfer richer contextual knowledge and yield larger gains across dataset sizes.


Read Next: