Prerequisite:

Transfer learning is the practice of initialising a model with weights learned on a source task and then adapting those weights - fully or partially - to a target task. It is the dominant paradigm in modern deep learning: virtually every practical deployment of a large model begins with a pretrained checkpoint rather than random initialisation.

The Feature Reuse Hypothesis

Why does transfer work? The feature reuse hypothesis proposes that lower layers of deep networks learn general-purpose representations that are useful across many tasks, while higher layers learn task-specific features that are more readily replaced.

In CNNs, visualisation studies confirm this hierarchy: early layers learn oriented edge detectors resembling Gabor filters, middle layers learn texture and part detectors, and late layers learn class-specific patterns. A similar hierarchy appears in transformers: earlier layers encode syntactic structure, middle layers encode semantic relationships, and later layers encode task-specific distributions.

Formally, let $\theta = (\theta_{\text{low}}, \theta_{\text{high}})$ be the parameter partition. Transfer learning conjectures that:

$$\theta_{\text{low}}^{(S)} \approx \theta_{\text{low}}^{(T)}$$

where superscripts denote source and target tasks - i.e., the low-level representations are approximately task-invariant.

Fine-Tuning and Feature Extraction

Feature extraction freezes the pretrained backbone $f_\theta$ and trains only a new task head $g_\phi$:

$$\hat{y} = g_\phi(f_\theta(x)), \quad \min_\phi , \mathcal{L}{\text{task}}(g\phi(f_\theta(x)), y)$$

This is appropriate when the target dataset is small (preventing overfitting) and the source and target domains are similar enough that frozen features are informative.

Full fine-tuning initialises from $\theta_0$ (the pretrained parameters) and minimises the task loss over all parameters:

$$\min_\theta , \mathcal{L}_{\text{task}}(\theta), \quad \theta \leftarrow \theta_0$$

Typically a lower learning rate is used for the pretrained layers (often $10\times$ lower) than for the task head, to avoid destroying the pretrained representations in early training steps.

The choice between the two depends on two axes: data size (more data favours fine-tuning all layers) and domain similarity (dissimilar domains may require deeper adaptation).

Catastrophic Forgetting

A fundamental problem with naive fine-tuning is catastrophic forgetting: gradient updates on the target task overwrite the weights that encode source-task knowledge, degrading performance on the source task. More subtly, even if performance on the source task is not evaluated, fine-tuning can destroy the generality of the representations, leading to worse performance on downstream tasks that were not explicitly targeted.

Elastic Weight Consolidation (EWC) addresses this by adding a regularisation term that penalises moving parameters that were important for the source task. The importance of parameter $\theta_i$ is estimated by the diagonal of the Fisher information matrix $F$, computed on the source task data:

$$F_i = \mathbb{E}!\left[\left(\frac{\partial \log p(y | x, \theta)}{\partial \theta_i}\right)^2\right]$$

The EWC loss for the target task is:

$$L_{\text{EWC}} = L_{\text{task}}(\theta) + \frac{\lambda}{2} \sum_i F_i (\theta_i - \theta_i^\ast)^2$$

where $\theta_i^\ast$ are the source-task optimal parameters. Parameters with high Fisher information (important for the source task) are penalised heavily for changing, while unimportant parameters are free to adapt.

BERT-Style Fine-Tuning

BERT demonstrated that a single pretrained transformer encoder, fine-tuned with only a task-specific classification head, could achieve state-of-the-art results across a wide range of NLP benchmarks with very little target-task data.

The key design choices enabling this:

  • Masked Language Modelling (MLM) pretraining creates rich bidirectional contextual representations.
  • A special $[\text{CLS}]$ token aggregates sentence-level information into a single vector.
  • Fine-tuning for 3–5 epochs at a learning rate of $2 \times 10^{-5}$ typically suffices.

The success of BERT-style fine-tuning validated the hypothesis that a single generalist encoder, trained at sufficient scale, encodes enough knowledge to serve as a near-universal starting point for language understanding tasks.

CLIP: Contrastive Cross-Modal Transfer

CLIP (Radford et al., 2021) extends transfer learning to the cross-modal setting, learning joint image-text representations via a contrastive objective. Given a batch of $N$ (image, text) pairs ${(x_i^I, x_i^T)}$, CLIP trains image encoder $f^I$ and text encoder $f^T$ to minimise:

$$L = -\frac{1}{N} \sum_{i=1}^{N} \log \frac{\exp!\left(\langle z_i^I, z_i^T \rangle / \tau\right)}{\sum_{j=1}^{N} \exp!\left(\langle z_i^I, z_j^T \rangle / \tau\right)}$$

where $z^I = f^I(x^I) / |f^I(x^I)|$, $z^T = f^T(x^T) / |f^T(x^T)|$ are unit-normalised embeddings, and $\tau$ is a learned temperature. The symmetric version also swaps roles of images and texts.

This objective aligns matching image-text pairs in the embedding space while pushing apart non-matching pairs. The resulting representations transfer remarkably well to downstream vision tasks via zero-shot classification: to classify an image, compute its embedding and find the nearest text embedding from a set of class descriptions (“a photo of a {class}").

Multi-Task Learning

Multi-task learning (MTL) trains a single model on several tasks simultaneously, sharing most parameters with task-specific heads. The joint loss is a weighted sum:

$$L_{\text{MTL}} = \sum_t w_t L_t(\theta_{\text{shared}}, \phi_t)$$

MTL acts as a regulariser: tasks constrain each other, preventing overfitting to any single task and encouraging the shared encoder to learn more general representations. It also reduces the risk of catastrophic forgetting since all tasks are trained together.

The main failure mode is negative transfer: if tasks are dissimilar or their gradients conflict, joint training can hurt performance on one or more tasks relative to single-task baselines. This is mitigated by gradient surgery (projecting conflicting gradients) or by only sharing lower layers.

Examples

ImageNet pretraining to medical imaging. A ResNet-50 pretrained on ImageNet and then fine-tuned on a chest X-ray classification dataset (CheXNet) with 112k images achieves radiologist-level performance on pneumonia detection. Training from scratch on the same 112k images yields substantially lower accuracy, demonstrating the value of the pretrained feature hierarchy even across domains as dissimilar as natural images and X-rays.

BERT to NLP tasks. Fine-tuning BERT-large on SQuAD 2.0 for 2 epochs achieves an F1 score of 83.1, compared with 78.0 for the then state-of-the-art model trained from scratch. The pattern repeats across GLUE tasks: fine-tuned BERT dominates every prior task-specific architecture, demonstrating that pretraining at scale subsumes task-specific architectural engineering.


Read Next: