Vision & Multimodal ML - From Pixels to Cross-Modal Understanding // Megha Bose

Helpful context:

An image is not a sequence of tokens. It is a 2D grid of pixels, each a triplet of RGB values, with no natural left-to-right order, rich with spatial structure at multiple scales, and containing objects that can appear at any position, size, and orientation. Teaching a model to understand images required a fundamentally different inductive bias than language modeling. Then, once that was solved, the next question was how to connect vision and language in a single model - which is what multimodal ML is about.

The Convolutional Inductive Bias

The dominant architecture in vision for a decade before transformers was the Convolutional Neural Network (CNN). A CNN applies the same filter (a small weight matrix) at every position in the image - this is weight sharing. The filter slides across the image computing dot products:

$$(f \ast I)[i, j] = \sum_{p} \sum_{q} f[p, q] \cdot I[i+p, j+q]$$

where $f$ is the filter and $I$ is the input. Stacking convolutions builds a hierarchy: early layers detect edges and corners; middle layers detect textures and parts; deep layers detect objects.

Why does convolution work so well? Two inductive biases are baked in:

Translation equivariance: if an object moves in the image, the activation map shifts by the same amount. A filter that detects a dog’s ear works regardless of where in the image the ear appears. This is not the case for a fully connected network, which would need to relearn the detector for every position.
Locality: each filter has a small receptive field (3×3, 5×5). The assumption is that nearby pixels are more related than distant ones - true of natural images. Fully connected layers ignore this structure.

Pooling (max or average over a local region) progressively reduces spatial resolution, building translation invariance (not just equivariance) and compressing the representation.

ResNets (He et al., 2016) solved the training instability of deep CNNs via residual connections: $\mathbf{h}_{l+1} = \mathbf{h}_l + F(\mathbf{h}_l)$. The gradient of the loss with respect to $\mathbf{h}_l$ always includes the identity term $I$, preventing vanishing gradients regardless of depth. ResNet-50 and ResNet-101 remain widely used as image backbones.

Discomfort check. If CNNs have such useful inductive biases, why replace them with transformers? CNNs have a limited receptive field at each layer - a 3×3 filter only sees 9 pixels. To relate distant pixels you need many layers, and even then the effective receptive field grows slowly. Transformers have global attention: every patch can directly attend to every other patch in a single layer. For tasks requiring global context (counting objects, understanding scene composition, relating distant parts of an image), global attention is strictly more powerful. The tradeoff is that transformers need more data to learn spatial inductive biases from scratch.

Vision Transformers (ViT)

The Vision Transformer (Dosovitskiy et al., 2020) applies the standard transformer encoder directly to images by treating patches as tokens.

Patch embedding: an image of size $H \times W$ is divided into $N = \frac{H \cdot W}{P^2}$ non-overlapping patches of size $P \times P$ (typically $P = 16$). Each patch is linearly projected from $\mathbb{R}^{P^2 \cdot C}$ (where $C$ is the number of channels) to $\mathbb{R}^d$:

$$\mathbf{z}_0 = [\mathbf{x}_{\text{cls}}; \mathbf{E}\mathbf{x}_1; \mathbf{E}\mathbf{x}_2; \ldots; \mathbf{E}\mathbf{x}_N] + \mathbf{E}_{\text{pos}}$$

A learnable [CLS] token $\mathbf{x}_{\text{cls}}$ is prepended; its final representation is used for classification (analogous to BERT). Positional embeddings $\mathbf{E}_{\text{pos}} \in \mathbb{R}^{(N+1) \times d}$ are added to encode 2D spatial position.

The patch sequence is then processed by a standard transformer encoder: alternating layers of multi-head self-attention and MLP blocks, with LayerNorm before each and residual connections around each.

Why $P = 16$? A $224 \times 224$ image with $P = 16$ gives $14 \times 14 = 196$ patches. At $P = 8$ you get 784 patches; the quadratic attention cost becomes $784^2 \approx 614k$ per layer. $P = 16$ is the sweet spot between resolution and compute.

ViT requires large-scale pretraining (JFT-300M, ImageNet-21k) to outperform CNNs. With only ImageNet-1k, it underperforms ResNets. The inductive biases of CNNs (locality, translation equivariance) matter when data is limited; transformers learn them from data given enough of it.

Object Detection: From Classification to Localization

Image classification asks “what is in the image?” Object detection asks “what is in the image, and where?” The output is a set of bounding boxes with class labels.

Two-stage detectors (R-CNN family): Region CNN (Girshick et al., 2014) and its descendants (Fast R-CNN, Faster R-CNN) operate in two stages: (1) propose candidate regions likely to contain objects, (2) classify and refine each region. Faster R-CNN uses a Region Proposal Network (RPN) - a small convolutional network that slides over the feature map and proposes anchor boxes. Accurate but slow.

Single-stage detectors (YOLO, SSD, RetinaNet): predict class and box coordinates in a single forward pass. YOLO divides the image into a grid; each cell predicts a fixed number of bounding boxes and class probabilities directly. Much faster than two-stage; competitive accuracy. YOLO is the go-to for real-time detection (robotics, surveillance, autonomous driving).

DETR (Detection Transformer): applies the encoder-decoder transformer to object detection. The encoder processes image features; the decoder attends to them with a fixed set of learned object queries, outputting one prediction per query. No anchor boxes, no non-maximum suppression - just set prediction with bipartite matching loss. Slower convergence than YOLO but elegant and extensible to segmentation.

Discomfort check. Why is object detection so much harder than classification? Classification collapses the whole image to a single vector, losing spatial information. Detection must output a variable number of objects at arbitrary positions and scales. This requires either: (1) searching all possible locations (sliding window - expensive), (2) proposing candidate regions (R-CNN), or (3) predicting all boxes in parallel from a feature grid (YOLO). The “variable number of outputs” is the crux - it is a fundamentally different problem from classification, which always outputs exactly one vector.

Image Segmentation

Segmentation goes further than detection: instead of bounding boxes, assign a class label to every pixel.

Semantic segmentation: every pixel gets a class label (sky, road, car, person). Instance segmentation: distinguish individual objects of the same class (car 1 vs car 2). Panoptic segmentation: unifies both.

U-Net: encoder-decoder with skip connections. The encoder (contracting path) downsamples and extracts features; the decoder (expanding path) upsamples back to full resolution. Skip connections concatenate encoder feature maps at each resolution to the decoder, preserving spatial detail that would otherwise be lost in the bottleneck. Dominant in medical image segmentation.

Segment Anything Model (SAM): a foundation model for segmentation from Meta. Takes a prompt (point, box, or text) and segments the corresponding region. Trained on 1 billion masks. A ViT image encoder processes the image; a lightweight decoder generates masks conditioned on the prompt. SAM demonstrates that segmentation can be formulated as a promptable task, enabling zero-shot transfer to novel segmentation tasks.

CLIP: Learning Visual Representations from Text

CLIP (Contrastive Language-Image Pretraining) (Radford et al., 2021) learns aligned image and text representations by training on 400 million image-text pairs scraped from the web.

Architecture: a vision encoder (ViT or ResNet) maps images to embeddings; a text encoder (transformer) maps text to embeddings. Both encoders project to the same $d$-dimensional space.

Contrastive loss: given a batch of $N$ (image, text) pairs, the loss maximizes the similarity between matched pairs and minimizes similarity between unmatched pairs:

$$\mathcal{L} = -\frac{1}{2N}\sum_{i=1}^{N}\left[\log \frac{e^{s_{ii}/\tau}}{\sum_j e^{s_{ij}/\tau}} + \log \frac{e^{s_{ii}/\tau}}{\sum_j e^{s_{ji}/\tau}}\right]$$

where $s_{ij} = \mathbf{v}_i^\top \mathbf{t}_j$ is the cosine similarity between image embedding $\mathbf{v}_i$ and text embedding $\mathbf{t}_j$, and $\tau$ is a learned temperature.

Why CLIP is powerful: by aligning image and text representations, CLIP enables zero-shot classification. To classify an image into $k$ categories, compute the similarity of the image embedding to each category’s text embedding (“a photo of a cat”, “a photo of a dog”, …) and take the argmax. No task-specific fine-tuning required. CLIP achieves competitive ImageNet accuracy without seeing a single labeled ImageNet example during training.

CLIP embeddings are used as image features in virtually every subsequent multimodal model. The text-image alignment makes them a universal connector between vision and language.

Discomfort check. What is the temperature $\tau$ doing? Without scaling, the dot product $s_{ij}$ can be arbitrarily large, making the softmax distribution extremely peaked (one item gets essentially all the probability). $\tau$ controls the sharpness: small $\tau$ → sharp, large $\tau$ → flat. Learnable $\tau$ allows the model to calibrate how “confident” to be in its similarities. In practice $\tau$ starts around 0.07 and is learned to be slightly above this. The logit scale $1/\tau$ is logged to prevent it from collapsing to 0.

Generating Images: Diffusion Models

Diffusion models are the current state of the art in image generation, powering Stable Diffusion, DALL-E 3, Midjourney, and Imagen.

Forward process: add Gaussian noise to an image over $T$ timesteps until it becomes pure noise:

$$q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1 - \beta_t}, \mathbf{x}_{t-1}, \beta_t \mathbf{I})$$

Reverse process: a neural network (U-Net or DiT - Diffusion Transformer) learns to denoise: given noisy $\mathbf{x}_t$ and timestep $t$, predict the noise $\epsilon$ that was added:

$$p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \mu_\theta(\mathbf{x}_t, t), \sigma_t^2 \mathbf{I})$$

The training objective is simply $\mathbb{E}_{t, \mathbf{x}_0, \epsilon}\left[|\epsilon - \epsilon_\theta(\mathbf{x}_t, t)|^2\right]$ - predict the added noise.

Text conditioning is injected via cross-attention. CLIP text embeddings condition every attention layer of the U-Net: the denoising network attends to the text representation while denoising, steering the generation toward the text description.

Latent diffusion (Stable Diffusion): instead of diffusing in pixel space (expensive at $512 \times 512 \times 3$), first encode images to a lower-dimensional latent space with a VAE, diffuse there, then decode. The latent space is $64 \times 64 \times 4$ - much cheaper.

Discomfort check. Why are diffusion models better than GANs at generation? GANs use a generator-discriminator game that is notoriously unstable: mode collapse (generating only a few types of images), training instability, and sensitivity to hyperparameters are common. Diffusion models optimize a simple regression objective (predict the noise), which is stable and well-conditioned. The tradeoff is that diffusion requires $T$ forward passes to generate one image ($T$ is typically 50-1000); GANs generate in a single pass. DDIM and other accelerated samplers reduce the required steps to 20-50.

Vision-Language Models (VLMs)

VLMs combine a vision encoder with a language model to enable image understanding and generation in a single model. The key engineering challenge is the modality gap: image features and text tokens live in different spaces.

LLaVA (Large Language and Vision Assistant): a simple and effective design. A CLIP ViT encoder encodes the image; a lightweight linear projection (MLP) maps image features to the LLM’s token embedding space; these visual tokens are prepended to the text tokens and processed by a decoder-only LLM (LLaMA, Vicuna). Instruction tuning teaches the model to answer questions about images. The projection layer is the critical bridge - it must map CLIP’s visual feature space to the LLM’s token space.

GPT-4V, Claude Sonnet (multimodal): proprietary VLMs with undisclosed architectures but likely similar principles: a vision encoder (possibly CLIP-based), a projection to align visual features with text embeddings, and a decoder-only LLM.

Flamingo (DeepMind): interleaves image and text tokens in the same sequence. Cross-attention layers at each LLM layer attend to pooled image features. Supports interleaved image-text inputs in the context window (multiple images and text, any order).

The alignment tax: all VLMs face the question of how much visual information the LLM can actually use. Simply projecting CLIP features to the LLM’s embedding space often works well for object recognition but struggles with fine-grained spatial reasoning, counting, and text in images (OCR). More capable VLMs use higher-resolution vision encoders, more projection parameters, or specialized visual tokenizers.

Key Multimodal Concepts

Contrastive pretraining (CLIP, ALIGN): align two modalities by maximizing similarity between matched pairs. The resulting embeddings are universal connectors used downstream.

Generative alignment (DALL-E, Stable Diffusion): train a model to generate one modality conditioned on another. Image generation from text; audio from text; video from text.

Fusion strategies:

Early fusion: concatenate raw features before any processing. Simple but requires aligned feature spaces.
Late fusion: process each modality independently, combine predictions. Ignores cross-modal interactions.
Cross-attention fusion (most common): let one modality’s features query another’s via cross-attention. Used in Flamingo, LLaVA, Whisper (audio encoder → text decoder).

Tokenization of non-text modalities: VQ-VAE or similar discretization maps continuous signals (images, audio) to discrete tokens, allowing generative models to treat them like text tokens. Used in DALL-E 1, AudioLM, SoundStream.

Summary

Concept	Core idea	Key model
CNN	Local filters + pooling; translation equivariance	ResNet, EfficientNet
ViT	Image patches as tokens; global self-attention	ViT-B/16, ViT-L/14
Object detection	Predict boxes + classes; anchor-based or DETR	YOLO, Faster R-CNN, DETR
Segmentation	Per-pixel classification	U-Net, SAM
CLIP	Contrastive image-text alignment on 400M pairs	CLIP ViT-L/14
Diffusion models	Iterative denoising; U-Net or DiT; text via cross-attention	Stable Diffusion, DALL-E 3
VLMs	CLIP encoder + projection + decoder LLM	LLaVA, GPT-4V, Flamingo
Contrastive loss	Maximize matched similarity, minimize unmatched	CLIP, ALIGN, SigLIP

Read next: