Helpful context:


The standard assumption in machine learning is comfortable: you have all your data, you train once, and you deploy. The model is frozen. The world stands still.

Now consider a fraud detection system that processes thousands of transactions per second. A news recommendation engine where yesterday’s model knows nothing about today’s breaking story. A stock price predictor operating in a market that shifts every millisecond. You cannot retrain from scratch every few minutes. The world does not wait for you to finish your training loop.

This is the setting that online learning is built for.

Batch vs Online Learning

Batch learning means training on the entire dataset at once, producing a fixed model, and deploying it. The model does not change after training ends. This is the default mode for almost every tutorial and course.

Online learning means updating the model parameters on each new example - or each small group of examples - as they arrive, without ever seeing the full dataset at once. The model is always changing. The model at time $t+1$ has seen one more example than the model at time $t$.

The trade-off is real. Batch learning allows multiple passes over the data, more stable gradient estimates, and careful validation before deployment. Online learning allows continuous adaptation, handles unbounded data streams, and can track a changing world. The right choice depends on whether your data arrives all at once or continuously, and whether the underlying distribution is stationary or shifting.

The Perceptron - the Original Online Learner

The perceptron algorithm (Rosenblatt, 1957) is one of the earliest machine learning algorithms, and it is inherently online. It classifies inputs into two classes: $+1$ and $-1$.

The setup: a weight vector $w \in \mathbb{R}^d$ and a bias $b$. Given an input $x$, the perceptron predicts $\hat{y} = \text{sign}(w \cdot x + b)$. That is the full model.

The online update rule is:

  • Receive example $(x_t, y_t)$ where $y_t \in {+1, -1}$.
  • Predict $\hat{y}_t = \text{sign}(w_t \cdot x_t + b)$.
  • If $\hat{y}_t = y_t$: do nothing. The model was right.
  • If $\hat{y}t \neq y_t$: update $w{t+1} \leftarrow w_t + \eta \cdot y_t \cdot x_t$.

Here $\eta > 0$ is the learning rate - a scalar controlling how large each update is. If the model predicted $-1$ but the true label was $+1$, then $y_t \cdot x_t$ points in the direction of $x_t$, pushing the weight vector toward classifying $x_t$ as $+1$ next time. If the model was right, the weights do not change at all. The perceptron is lazy - it only updates when it makes a mistake.

Worked example. Suppose $w = [0, 0]$, $\eta = 1$. We receive the example $x = [1, 2]$, $y = +1$. The prediction is $\text{sign}(0 \cdot 1 + 0 \cdot 2) = \text{sign}(0) = -1$ (treating ties as $-1$). We were wrong. The update is $w \leftarrow [0,0] + 1 \cdot (+1) \cdot [1,2] = [1, 2]$. On the next example with similar $x$, the model now predicts $\text{sign}(1 \cdot 1 + 2 \cdot 2) = \text{sign}(5) = +1$. Correct.

The perceptron convergence theorem says that if the training data is linearly separable, the perceptron is guaranteed to converge to a separating hyperplane in a finite number of updates. If the data is not linearly separable, it never converges and cycles forever.

SGD as Online Learning

Stochastic gradient descent (SGD) with batch size 1 is online learning. For each incoming example $(x_t, y_t)$, compute the gradient of the loss with respect to the parameters using only that single example, and take a gradient step. The parameters are updated after every single data point.

$$\theta_{t+1} \leftarrow \theta_t - \eta \cdot \nabla_\theta \mathcal{L}(\theta_t; x_t, y_t)$$

Here $\theta_t$ are the model parameters at step $t$, $\mathcal{L}$ is the loss function, and $\eta$ is the learning rate.

Mini-batch SGD (batch sizes of 32, 64, 256) is a middle ground - it averages the gradient over a small group of examples, reducing noise while still updating frequently. In an online setting, mini-batches correspond to processing a small buffer of recently arrived examples before updating.

The important consequence: any algorithm that can be optimized with SGD can in principle be run online. Linear regression, logistic regression, and neural networks all fit this description.

Which Algorithms Support Online Learning Natively

Not every algorithm can be updated incrementally. The key question is whether the algorithm can incorporate a new example without reprocessing all previous examples.

Online-compatible algorithms:

  • Perceptron - updates one example at a time by design.
  • SGD-based models - linear regression, logistic regression, neural networks - any model with a gradient-computable loss.
  • Naive Bayes - internally maintains counts of features per class. Adding a new example means incrementing counts. No retraining required.
  • Passive-aggressive algorithms - an online learning framework that updates aggressively when a mistake is made and remains passive otherwise, with a configurable aggressiveness parameter.

Algorithms that require all data at once:

  • Decision trees - the splitting criterion requires computing statistics over the entire dataset. Incremental variants exist but are complex and less efficient.
  • Standard SVMs - the quadratic programming formulation requires access to all support vectors simultaneously.
  • k-NN - not a parametric model at all; it stores all training examples and uses them at inference time. Every new example is simply added to the store, but inference cost grows linearly with the number of stored examples.

In scikit-learn, the online learning interface is partial_fit. Algorithms that support incremental learning implement partial_fit(X_batch, y_batch) in addition to the usual fit. Calling partial_fit updates the model on the provided batch without forgetting previous updates.

from sklearn.linear_model import SGDClassifier

clf = SGDClassifier(loss="log_loss", learning_rate="optimal")

# Streaming loop - data arrives one batch at a time
for X_batch, y_batch in data_stream():
    clf.partial_fit(X_batch, y_batch, classes=[0, 1])

# At any point the model can predict
predictions = clf.predict(X_new)

Algorithms in scikit-learn that support partial_fit: SGDClassifier, SGDRegressor, Perceptron, MultinomialNB, BernoulliNB, PassiveAggressiveClassifier, PassiveAggressiveRegressor, MiniBatchKMeans.

Concept Drift - the Enemy of Online Learning

Online learning solves the problem of continuous data arrival. But it introduces a new problem: the world changes. The distribution of the data at time $t + 1000$ may be very different from the distribution at time $t$.

Concept drift is when the conditional distribution $p(y \mid x)$ changes over time. The same input $x$ now maps to a different output $y$ than it did before. A model trained on pre-pandemic shopping patterns fails in 2023 because consumer behavior shifted. A fraud detection model trained before a new attack vector is introduced will not catch the new fraud because the distribution of fraud has changed.

Types of drift:

  • Sudden drift: the distribution changes abruptly. A new product launches and user preferences shift overnight.
  • Gradual drift: the distribution shifts slowly. Consumer tastes evolve over months.
  • Recurring drift: the distribution cycles. A seasonal pattern where shopping behavior in December differs from July, then returns.

Detecting drift. One approach is to monitor model performance on a rolling window of recent examples. If the validation accuracy on the last 1000 examples drops significantly relative to the previous window, drift has likely occurred. Two statistical tests designed specifically for streaming data:

  • Page-Hinkley test: tracks the cumulative sum of deviations of a monitored statistic from its running mean. A large deviation signals drift.
  • ADWIN (ADaptive WINdowing): maintains a sliding window of recent data and automatically shrinks the window when the distribution of the old portion differs significantly from the recent portion. The window size adapts to the rate of drift.

Handling drift. Plain online learning adapts slowly because all past examples contribute equally to the current weights. Better strategies:

  • Weight recent examples more heavily by using a decaying learning rate or explicit example weights.
  • Use a sliding window and discard examples older than $W$ steps. The model only reflects the last $W$ examples.
  • Ensemble approaches: maintain a pool of models trained on different time windows, blending them or selecting the best-performing one on recent data.

The Forgetting Problem

Naive online learning accumulates all updates forever. If the distribution shifts, the old examples continue pulling the weights in the wrong direction. This is a slow, persistent form of interference.

In neural networks, this becomes catastrophic forgetting: when you fine-tune a network on new data, gradient descent moves the weights toward the new data’s objective, which can erase the patterns learned from old data. A network fine-tuned to recognize a new class of images might lose accuracy on classes it learned earlier.

One principled solution is elastic weight consolidation (EWC). The idea: identify which weights are most important for previously learned tasks (using the Fisher information matrix as a proxy), and add a regularization term that penalizes large changes to those important weights. The model can still adapt to new data but is constrained from moving away from its previous knowledge on the dimensions that mattered most.

A simpler practical approach: keep a small replay buffer of old examples and mix them into each training batch. This way, the model is never trained only on new data; it always sees a mix of old and new.

Real Deployments

Ad click-through rate prediction at scale (Google, Meta) is one of the highest-stakes applications of online learning. Hundreds of thousands of model updates per minute. The model must capture the fact that a newly viral topic at 2pm today is irrelevant by 6pm, and a completely new topic will dominate tomorrow. Batch retraining would always be stale. The models used are typically large logistic regression models or gradient-boosted trees with online variants, updated on every click event in real time.

Recommendation systems use implicit feedback (did the user click? did they watch the video to completion?) to update their models continuously. A recommendation made at 9am reflects user behavior through 8:59am. By noon, the model has already absorbed three hours of new signal.

Autonomous vehicle perception improves as the fleet encounters edge cases. When a specific driving scenario causes prediction errors, the model is updated incrementally with that scenario. The fleet effectively learns from its own mistakes in deployment.


Concept Key point
Batch learning Train once on all data; model is frozen after deployment
Online learning Update on each new example or mini-batch; model evolves continuously
Perceptron update $w \leftarrow w + \eta \cdot y \cdot x$ when wrong; no update when right
partial_fit scikit-learn interface for online-compatible algorithms
Concept drift $p(y \mid x)$ changes over time; the model becomes stale
ADWIN Adaptive windowing for drift detection in streams
Catastrophic forgetting Fine-tuning erases old knowledge; EWC and replay buffers help

Read Next: