Recommendation Systems - How Platforms Know What You Want Before You Do
Helpful context:
- Linear Algebra for ML - The Language Beneath the Algorithms
- Dimensionality Reduction - Keeping the Signal, Dropping the Noise
Every major platform runs on recommendations. Netflix’s homepage is a recommendation. Spotify’s Discover Weekly is a recommendation. Amazon’s “customers also bought” is a recommendation. In aggregate, recommendations drive somewhere between 30% and 80% of user engagement depending on the platform. They are arguably the highest-value ML systems in existence.
The core problem: you have millions of users and millions of items. Most users have interacted with a tiny fraction of items - the matrix of interactions is sparse. You need to predict which items a given user will find valuable, in real time, for items they have never seen.
Collaborative Filtering: The Wisdom of Crowds
Collaborative filtering is based on a single powerful assumption: users who agreed in the past will agree in the future. If you and another person have both watched and enjoyed the same 15 obscure films, they are probably a better predictor of your next watch than any description of the film itself.
This works entirely from interaction data - clicks, watches, purchases, ratings - without needing to understand anything about the items themselves. A recommendation system using collaborative filtering does not need to know that a film is a noir thriller from 1944. It just needs to know that users who liked it also liked other specific films.
User-Based Collaborative Filtering
Find users similar to the target user. Recommend items those similar users liked that the target user has not seen yet.
Similarity is measured by looking at the overlap in rating histories. If user A and user B have both rated many of the same items, and their ratings correlate strongly, they are similar. The Pearson correlation is common for explicit ratings (1-5 stars). For implicit feedback (watched/not watched, clicked/not clicked), cosine similarity on the interaction vectors works well.
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
# user_item_matrix: rows = users, columns = items, values = ratings (0 if unrated)
user_similarity = cosine_similarity(user_item_matrix)
def recommend_user_based(user_id, k_neighbors=20, n_recommendations=10):
similar_users = np.argsort(user_similarity[user_id])[::-1][1:k_neighbors+1]
# Aggregate items rated highly by similar users that target user hasn't rated
candidate_scores = {}
for neighbor in similar_users:
similarity = user_similarity[user_id][neighbor]
for item_id, rating in enumerate(user_item_matrix[neighbor]):
if rating > 0 and user_item_matrix[user_id][item_id] == 0:
candidate_scores[item_id] = candidate_scores.get(item_id, 0) + similarity * rating
return sorted(candidate_scores, key=candidate_scores.get, reverse=True)[:n_recommendations]
The problem with user-based CF at scale: you have 100 million users. To recommend for any user, you compute similarity against all other users - that is $O(n^2)$ similarities, and the user base changes constantly. Pre-computing all pairwise similarities requires enormous storage and becomes stale quickly.
Item-Based Collaborative Filtering
Amazon’s breakthrough (Sarwar et al., 2001 at Amazon) was to flip the perspective: instead of finding similar users, find similar items. Items change much more slowly than users. You can pre-compute all item-item similarities offline, store them, and serve recommendations in real time by looking up the user’s rated items and finding similar items.
“Customers who bought X also bought Y” is item-based CF. The model does not know what X or Y is - it knows only that the people who bought X have a high propensity to buy Y, based on the buying patterns of millions of past users.
item_similarity = cosine_similarity(user_item_matrix.T) # items x items
def recommend_item_based(user_id, n_recommendations=10):
rated_items = np.where(user_item_matrix[user_id] > 0)[0]
scores = np.zeros(user_item_matrix.shape[1])
for item in rated_items:
rating = user_item_matrix[user_id][item]
scores += rating * item_similarity[item]
# Zero out already-rated items
scores[rated_items] = 0
return np.argsort(scores)[::-1][:n_recommendations]
Item-based CF is more scalable because item similarities are stable (an action film stays similar to other action films), and you can pre-compute them in a batch job rather than on the fly.
Limitations of all neighborhood-based CF:
- Cold start: a new user with no history gets nothing useful. A new item with no interactions cannot be recommended to anyone.
- Popularity bias: items with many interactions (popular items) tend to dominate. A niche item with 50 extremely loyal fans gets lower similarity scores than a mediocre mainstream item.
- Sparsity: if users rate only 20 items out of 10 million, most user-user similarities are computed from near-zero overlap. The signal is weak.
Matrix Factorization: Finding the Hidden Dimensions
Matrix factorization solves the sparsity problem by learning a dense, low-dimensional representation for every user and every item. The idea: there are latent factors - things like “action film enthusiast,” “prefers slow-burn dramas,” “only watches short content” - that explain most of the variation in ratings. You do not know these factors upfront; you learn them from the data.
Represent the user-item rating matrix $R$ as the product of two low-rank matrices: $$R \approx U \cdot V^T$$
where $U \in \mathbb{R}^{m \times k}$ has one row per user (each row is a $k$-dimensional embedding), and $V \in \mathbb{R}^{n \times k}$ has one row per item. The predicted rating of user $u$ for item $i$ is the dot product of their embeddings: $$\hat{r}_{ui} = \mathbf{u}_u \cdot \mathbf{v}_i$$
You train by minimizing reconstruction error on observed ratings with regularization: $$\min_{U,V} \sum_{(u,i) \in \text{observed}} (r_{ui} - \mathbf{u}_u \cdot \mathbf{v}_i)^2 + \lambda(|\mathbf{u}_u|^2 + |\mathbf{v}_i|^2)$$
The regularization term $\lambda$ prevents the embeddings from growing arbitrarily large to fit noise.
ALS: Alternating Least Squares
ALS fixes one matrix and solves for the other analytically, then alternates:
- Fix $V$, solve for each row of $U$ (closed-form least squares per user).
- Fix $U$, solve for each row of $V$ (closed-form least squares per item).
- Repeat until convergence.
This parallelizes well - all user embeddings can be updated simultaneously given a fixed $V$, and vice versa. Spark MLlib’s ALS runs efficiently on clusters with billions of interactions. Netflix famously used ALS (under the name “weighted ALS”) as part of the ensemble that won the Netflix Prize.
Why Factorization Beats Neighborhood Methods
Factorization handles sparsity elegantly. Even if user A and user B share only 3 rated items, their embeddings are learned from the entire matrix jointly. User A’s embedding captures their taste in 50 dimensions; user B’s embedding captures their taste in the same 50 dimensions; the dot product measures their compatibility in latent taste space, even for items neither has rated.
The embeddings also capture transitivity: if A’s embedding is similar to B’s, and B’s is similar to C’s, A and C will have correlated embeddings even if they share no direct overlap.
Content-Based Filtering: The Item Itself
Content-based filtering recommends items similar to ones the user has liked, based on the item’s own features rather than other users' behavior.
A news recommendation system might represent each article as a TF-IDF vector of its words, compute cosine similarity between articles, and recommend articles similar to ones the user clicked. A music service might represent each song by audio features (tempo, energy, danceability) and recommend songs with similar feature profiles.
Pros: solves the item cold start problem - a new item can be recommended immediately once its features are known, without any interaction data.
Cons: creates filter bubbles. If you liked three jazz albums, you get jazz recommendations forever. It cannot surprise you with something outside your established taste profile. It also requires good item features, which are easy for text (use the content) but hard for other domains.
The Cold Start Problem
Every recommendation system hits the cold start problem from two directions:
New user cold start: the user has no history. Solutions include: asking for explicit preferences during onboarding (Spotify’s genre selection), using demographic information or geographic location as weak signals, recommending popular items as a safe default, and onboarding flows that collect a few explicit ratings.
New item cold start: the item has no interactions. Solutions include: content-based features to find similar items and inherit their distribution, manual curation for high-priority items (Netflix’s editorial team promotes new originals), exploration-exploitation strategies that deliberately show new items to some users to collect initial signal.
Two-Stage Retrieval and Ranking
In production, you almost never run a single model over all items. With millions of items, even a cheap per-item score computation takes too long. The standard architecture is a two-stage pipeline:
Stage 1 - Retrieval (Candidate Generation): select a few hundred or thousand candidate items from the millions available. This must be extremely fast - often done with approximate nearest neighbor search over embeddings. You pre-compute all item embeddings and build an ANN index (Faiss, ScaNN, HNSW). At serving time, look up items nearest to the user’s embedding in milliseconds. Retrieval models optimize for recall - it is okay to include some irrelevant items, but you must not miss good ones.
Stage 2 - Ranking: take the few hundred candidates from retrieval and score them carefully with a more expensive model that can use rich features. This is typically a LightGBM or neural network that incorporates: user features (demographics, session context), item features (content metadata, freshness), interaction features (user-item affinities), and context features (time of day, device, recent actions). Ranking optimizes for precision - you are selecting the top 10 to show from 200 candidates.
The retrieval-ranking split lets you use powerful models in stage 2 because you are only scoring hundreds of items, not millions.
Evaluation
Recommendation quality is hard to measure offline because the training data is not a random sample - it reflects what users were shown, which reflects previous recommendations.
Offline metrics:
- Precision@k: of the top-$k$ recommendations, what fraction did the user actually interact with?
- Recall@k: of the items the user interacted with, what fraction appear in the top-$k$?
- NDCG (Normalized Discounted Cumulative Gain): accounts for position - an interaction at rank 1 is worth more than one at rank 10.
- Mean Average Precision (MAP): average of precision at each relevant item’s position.
The fundamental offline problem: you can only evaluate on items the user was shown. If your model recommends item X and the user never saw it, you do not know whether they would have liked it. This leads to optimistic offline metrics that overestimate real performance.
Online metrics are the ground truth: click-through rate, watch time, purchase rate, long-term retention. A/B tests comparing recommendation variants are standard before deploying changes.
| Concept | Key point |
|---|---|
| Collaborative filtering | Recommend based on similar users' or items' behavior; no content features needed |
| User-based CF | Find similar users; doesn’t scale to millions of users |
| Item-based CF | Find similar items; pre-computable offline; scales well |
| Matrix factorization | Learn latent embeddings for users and items; handles sparsity |
| ALS | Alternating Least Squares; parallelizable; used at Netflix scale |
| Cold start | New users/items have no history; content features or onboarding solve item side |
| Two-stage retrieval-ranking | Retrieval: fast candidates via ANN; ranking: expensive model on small set |
| Precision@k / NDCG | Standard offline metrics; always complement with online A/B tests |
Read Next: