LoRA & Quantization - Fine-Tuning at a Fraction of the Cost // Megha Bose

Helpful context:

GPT-3 has 175 billion parameters. Each one is stored as a 32-bit float - four bytes. That’s 700 GB of memory just to hold the weights, before you’ve done any computation at all.

Fine-tuning GPT-3 requires more. You need gradients: another 700 GB. You need optimizer states: Adam stores first and second moment estimates, so another 1.4 TB. Add them up and you’re looking at roughly 2.8 TB of memory for training - which means around 100 high-end A100 GPUs (each with 80 GB of memory) just to get started.

Very few organizations have that. Most people don’t.

LoRA (Low-Rank Adaptation) changes this equation dramatically. The mathematical idea is simple enough to write on a napkin, yet it makes fine-tuning 175B-parameter models possible on a handful of consumer GPUs. Understanding why it works requires nothing more than linear algebra you already know - specifically, the idea of rank.

The Fine-Tuning Memory Problem

Let’s be precise about the numbers.

Suppose your model has $N$ parameters. Fine-tuning in full precision (FP32) requires:

Weights: $4N$ bytes (4 bytes per FP32 parameter)
Gradients: $4N$ bytes (one gradient per parameter)
Adam optimizer states: $8N$ bytes (4 bytes each for first moment $m$ and second moment $v$)

Total: $16N$ bytes. For GPT-3 with $N = 175 \times 10^9$:

$$16 \times 175 \times 10^9 \approx 2.8 \text{ TB}.$$

Even if you only want to do inference (no training), storing the model in FP32 costs $4N = 700$ GB. In FP16 (half precision), that’s $2N = 350$ GB. Still roughly five A100-80GB GPUs just to load the thing.

The fundamental question is: do all $N$ parameters actually need to change during fine-tuning?

The empirical answer, backed by decades of transfer learning research, is: no. When you fine-tune a large pre-trained model on a new task, you’re not changing it fundamentally - you’re nudging it. The pre-trained weights already encode most of the relevant knowledge. Adaptation is a small perturbation on top of a very good starting point.

This observation has a precise mathematical consequence: the weight updates during fine-tuning tend to live in a low-dimensional subspace. Most singular values of the gradient matrix $\Delta W$ are small or zero. The update has low effective rank.

LoRA exploits this.

LoRA: The Idea

Consider a single weight matrix $W \in \mathbb{R}^{d \times k}$ inside your transformer (say, the query projection in an attention layer). During fine-tuning, we want to learn an update $\Delta W$ such that the effective weight becomes $W + \Delta W$.

Full fine-tuning: train $\Delta W$ directly. Memory cost: $d \times k$ extra parameters.

LoRA’s proposal (Hu et al., 2021): instead of learning $\Delta W$ freely, constrain it to be a product of two low-rank matrices:

$$\Delta W = BA,$$

where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$, with $r \ll \min(d, k)$.

The parameter $r$ is called the rank. The full weight $W$ is frozen. Only $A$ and $B$ are trained.

The parameter count comparison is stark. Let $d = k = 4096$ (typical for large transformers) and $r = 16$:

Approach	Parameters
Full $\Delta W$	$d \times k = 4096^2 = 16{,}777{,}216$
LoRA ($r=16$)	$r(d + k) = 16 \times 8192 = 131{,}072$
Reduction	$\approx 128\times$ fewer

For an entire model with many such matrices, the savings compound. Instead of storing 16N bytes for training, you store the frozen weights (in low precision, as we’ll see shortly) plus a tiny fraction of trainable parameters.

Initialization: Starting from the Pre-Trained Weights

There’s a subtlety in how $A$ and $B$ are initialized that makes LoRA work cleanly.

We want: at the start of training, $\Delta W = BA = 0$. This ensures the model begins exactly at the pre-trained checkpoint, with no random perturbation.

The initialization is:

$A \sim \mathcal{N}(0, \sigma^2)$ - drawn from a Gaussian
$B = 0$ - the zero matrix

Then $\Delta W = BA = B \cdot A = 0 \cdot A = 0$. Training begins from the pre-trained weights.

As training proceeds, $B$ and $A$ both move from their initial values. The update $\Delta W = BA$ picks up a low-rank structure.

Discomfort check. Why not initialize both $A$ and $B$ randomly? If both are nonzero at initialization, $\Delta W = BA \neq 0$ and the model starts somewhere other than the pre-trained checkpoint. For fine-tuning (where the pre-trained weights are already good), this is wasteful. The asymmetric initialization - $A$ random, $B$ zero - is a deliberate choice that preserves the starting point while allowing $B$ to develop nontrivially during training.

The Scaling Factor $\alpha$

In practice, the LoRA update is scaled:

$$\Delta W = \frac{\alpha}{r} BA.$$

The hyperparameter $\alpha$ controls the effective learning rate of the LoRA update relative to the frozen weights. Common practice: set $\alpha = 2r$, so the scaling factor $\alpha/r = 2$.

Why does this matter? The magnitude of $BA$ grows with $r$ (more parameters, more potential signal). The $\alpha/r$ scaling keeps the update magnitude roughly independent of the rank choice, making it easier to transfer hyperparameters between different $r$ settings.

Why Low Rank Works: The SVD Perspective

Let’s think about this more carefully. Any matrix $W \in \mathbb{R}^{d \times k}$ has a singular value decomposition:

$$W = U \Sigma V^T,$$

where $U \in \mathbb{R}^{d \times d}$, $\Sigma \in \mathbb{R}^{d \times k}$ is diagonal with non-negative entries $\sigma_1 \geq \sigma_2 \geq \cdots \geq 0$, and $V \in \mathbb{R}^{k \times k}$.

The rank of $W$ is the number of nonzero singular values. A low-rank approximation keeps only the top $r$ singular values:

$$W \approx U_r \Sigma_r V_r^T = \sum_{i=1}^{r} \sigma_i u_i v_i^T,$$

where $U_r, V_r$ contain the first $r$ columns of $U, V$ and $\Sigma_r = \text{diag}(\sigma_1, \ldots, \sigma_r)$. This is exactly the form $\Delta W = BA$: if you let $B = U_r \Sigma_r^{1/2}$ and $A = \sigma_r^{1/2} V_r^T$, you recover the rank-$r$ approximation.

The empirical claim underlying LoRA is that the weight update $\Delta W$ needed for task adaptation has low effective rank. That is: most of the “useful” change in $W$ is captured by just a few singular directions. The rest is noise. LoRA constrains $\Delta W$ to live in this low-rank subspace, which is where the update would have ended up anyway.

This has been verified empirically: when you fine-tune a large model fully and then examine the singular values of $\Delta W$, the spectrum drops off sharply. A rank-8 or rank-16 approximation captures most of the useful update.

Which Layers and What Rank?

LoRA is typically applied to the attention weight matrices: the query, key, value, and output projection matrices ($W_Q, W_K, W_V, W_O$) in each transformer layer. Sometimes the feed-forward network (FFN) matrices are included too.

The rank $r$ is a hyperparameter:

$r = 1$ to $r = 4$: very aggressive compression, works for simple tasks
$r = 8$ or $r = 16$: typical for instruction tuning and chat fine-tuning
$r = 32$ to $r = 64$: more expressive, more parameters, used when task requires more adaptation

For most practical fine-tuning tasks, $r = 8$ to $r = 16$ with LoRA on attention layers gives results competitive with full fine-tuning while using a small fraction of the memory.

QLoRA: LoRA Meets 4-Bit Quantization

LoRA alone reduces the number of trainable parameters dramatically, but the frozen base model still needs to live in GPU memory. For a 65B parameter model in FP16, that’s still 130 GB.

QLoRA (Dettmers et al., 2023) solves this by quantizing the frozen base model to 4 bits, reducing its memory footprint by 8× compared to FP16.

The setup:

Load the frozen base model in 4-bit NF4 quantization (explained below). Memory: $\frac{1}{2} N$ bytes ≈ 87 GB for a 175B model.
Apply LoRA adapters on top in 16-bit precision. Memory: tiny (a few hundred MB).
During the forward pass, dequantize weights on the fly to 16-bit for computation.

For LLaMA-65B: the quantized model uses ~35 GB, LoRA adapters add ~0.3 GB, activations and optimizer states for the small LoRA parameters are negligible. Total: roughly 40 GB - fits on a single A100-80GB or two consumer-grade RTX 3090s.

Discomfort check. The LoRA adapters change which parameters are optimized, not the model’s expressive power. A LoRA-fine-tuned model is mathematically equivalent to a full model where $\Delta W = BA$ - the low-rank constraint is a computational restriction on the update during training, not a permanent limitation on the model’s capabilities. At inference time, you can merge the LoRA update into the base weights: $W_{\text{new}} = W + BA$. The merged model has no extra overhead - it runs identically to a fully fine-tuned model. The low-rank structure was just a training efficiency trick.

4-Bit Quantization: NF4 Format

Quantization maps continuous weight values to a discrete set of representable values. With 4 bits, you have $2^4 = 16$ levels.

The key question: where do you put those 16 levels?

Naive uniform quantization: evenly space the 16 levels between $[\min(W), \max(W)]$. This wastes levels in the tails of the weight distribution where few weights live.

NF4 (NormalFloat4) is designed for weights that follow an approximately normal distribution (which neural network weights typically do after training). It places the 16 quantization levels at the quantiles of a standard normal distribution, so each interval contains an equal fraction of weights:

$$q_i = \Phi^{-1}\left(\frac{i}{16}\right), \quad i = 0, 1, \ldots, 15,$$

where $\Phi^{-1}$ is the inverse normal CDF. This is information-theoretically optimal for Gaussian-distributed inputs - it minimizes the expected squared quantization error when the underlying distribution is Gaussian.

To quantize a weight $w$:

Normalize: $\hat{w} = w / \max(|W|)$ so values lie in $[-1, 1]$.
Map to nearest $q_i$.
Store the 4-bit index.

To dequantize: look up $q_i$, multiply by $\max(|W|)$.

Double quantization (a QLoRA contribution): the normalization constants $\max(|W|)$ are themselves quantized from FP32 to 8-bit. This meta-quantization saves another ~0.37 bits per parameter on average.

The combined effect: NF4 + double quantization uses approximately 4.5 bits per parameter, compared to 16 bits for FP16. An 8× memory reduction for the base model, with quantization error small enough that fine-tuning quality is nearly indistinguishable from FP16.

The Full Memory Picture

Let’s make the memory accounting concrete for a practical example: fine-tuning LLaMA-13B.

Component	Full FP16 Fine-Tuning	QLoRA
Model weights	26 GB (FP16)	7 GB (NF4)
Gradients	26 GB	~0 (frozen)
Optimizer states	52 GB	~0.1 GB (LoRA params only)
LoRA adapters	-	~0.1 GB
Total	~104 GB	~7.2 GB

QLoRA reduces training memory from ~104 GB to ~7.2 GB - a 14× reduction. LLaMA-13B training fits on a single consumer RTX 3090 (24 GB) with room to spare.

Practical Implications

Per-task adapters. LoRA fine-tuning produces a small set of adapter weights ($A$ and $B$ matrices for each LoRA layer). For a 7B model with $r=16$ LoRA on all attention layers, the adapters are roughly 30 MB - a file you can email. The base model is shared. Switching between tasks means swapping adapters, not reloading the full model.

Serving multiple tasks. If you have 100 specialized fine-tunes (one for legal, one for medical, one for coding, …), you need one copy of the base model in GPU memory plus 100 × 30 MB of adapter weights. Much more efficient than 100 full model copies.

Merging adapters. Because $W_{\text{new}} = W + BA$ is a simple matrix addition, you can merge multiple LoRA adapters via weight arithmetic. Models like SLERP-merged and task-arithmetic models combine capabilities of different fine-tunes - conceptually like “averaging” experts in weight space.

Limitations. LoRA constrains the update to low rank, which may not suffice for tasks that require fundamentally new behaviors (as opposed to adapting existing ones). Very low ranks ($r=1$ or $r=2$) may underfit complex tasks. The choice of which layers to apply LoRA to matters and is often determined empirically.

Summary

Concept	What It Means
LoRA	Parameterize weight update as $\Delta W = BA$, $B \in \mathbb{R}^{d \times r}$, $A \in \mathbb{R}^{r \times k}$, $r \ll \min(d,k)$
Rank $r$	Controls expressivity vs. parameter count; typical values 8 - 16
Initialization	$B = 0$, $A \sim \mathcal{N}(0, \sigma^2)$ so $\Delta W = 0$ at start
Scaling	Update is $(\alpha/r) \cdot BA$; $\alpha = 2r$ is common
Why it works	Fine-tuning updates are empirically low-rank; LoRA exploits this
QLoRA	Load base model in NF4 4-bit; apply LoRA in FP16; dequantize on the fly
NF4	4-bit quantization with quantiles of normal distribution; optimal for Gaussian weights
Memory reduction	Full fine-tune: $16N$ bytes. QLoRA: $\approx 0.5N$ bytes. $\approx 32\times$ reduction
Inference merging	$W_{\text{new}} = W + BA$; LoRA adapters can be merged, leaving a single model

LoRA and QLoRA are currently the dominant methods for fine-tuning large language models on consumer and research hardware. The mathematics is just SVD and rank - familiar from linear algebra - applied to the observation that adaptation is inherently low-dimensional.

Read next:

RLHF - Aligning Models by Learning What Humans Prefer