LoRA & 4-Bit Quantization
Prerequisite:
Fine-tuning large language models is prohibitively expensive in its naive form: a 70B parameter model in fp16 requires 140 GB of GPU memory just for the weights, and storing the optimiser states for Adam doubles or triples that. LoRA and quantization are the two main tools for making fine-tuning and inference tractable on commodity hardware.
LoRA: Low-Rank Adaptation
LoRA (Hu et al., 2021) addresses the memory and compute cost of fine-tuning by observing that the update to a pretrained weight matrix $W_0 \in \mathbb{R}^{d \times k}$ during adaptation has low intrinsic rank. Rather than training $W_0$ directly, LoRA freezes it and learns a low-rank decomposition of the update:
$$\Delta W = BA$$
where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$ with rank $r \ll \min(d, k)$.
The forward pass becomes:
$$h = W_0 x + \frac{\alpha}{r} B A x$$
where $\alpha$ is a scaling hyperparameter (typically set equal to $r$, making the scaling factor 1, but sometimes tuned separately). $A$ is initialised with a Gaussian, $B$ with zeros, so $\Delta W = 0$ at the start of training and training begins from the pretrained model’s predictions.
Parameter efficiency. A full weight matrix has $dk$ parameters. The LoRA factorisation has $r(d + k)$ parameters, which for $r = 8$, $d = k = 4096$ is $8 \times 8192 = 65536$ vs $16{,}777{,}216$ - a $256\times$ reduction. In practice $r \in {4, 8, 16, 32}$ and LoRA is applied to the attention projection matrices $W_Q$, $W_K$, $W_V$, and $W_O$.
Merging at inference. Once training is complete, the adaptation can be merged into the original weights at zero additional inference cost:
$$W = W_0 + \frac{\alpha}{r} BA$$
This means LoRA-fine-tuned models have identical inference latency to the base model.
LoRA+
LoRA+ (Hayou et al., 2024) observes that the optimal learning rates for $A$ and $B$ differ. In the standard LoRA parametrisation, the feature learning signal is dominated by $B$ (which starts at zero and must grow), while $A$ (which starts with significant magnitude) needs a lower learning rate to avoid overshooting. LoRA+ sets $\eta_B = \lambda \eta_A$ with $\lambda > 1$ (typically $\lambda \approx 16$), improving convergence speed without changing the parameter count.
QLoRA: 4-Bit Fine-Tuning
QLoRA (Dettmers et al., 2023) combines LoRA with aggressive quantization, enabling fine-tuning of 65B-parameter models on a single 48 GB GPU.
NF4: NormalFloat Quantization
The key innovation is the NormalFloat (NF4) data type, designed for weights that follow an approximately normal distribution (as pretrained LLM weights typically do). For a $k$-bit quantization, NF4 assigns $2^k$ quantization levels such that each level covers an equal probability mass under the standard normal distribution.
Formally, the quantization levels $q_1 < q_2 < \cdots < q_{2^k}$ satisfy:
$$\Phi(q_{i+1}) - \Phi(q_i) = \frac{1}{2^k}$$
where $\Phi$ is the standard normal CDF. Equivalently, $q_i = \Phi^{-1}(i / 2^k)$, so the levels are the quantiles of $\mathcal{N}(0,1)$.
This is optimal in the sense that it minimises the expected squared quantization error for normally distributed weights. A weight block is normalised by its absolute maximum before quantization (block-wise quantization), and the scale factors are themselves quantized - double quantization - reducing the memory overhead of the quantization constants from 32 bits per block to about 8 bits per block.
QLoRA Training
The pretrained model is loaded in NF4 (4-bit), but forward and backward passes are computed in bf16. Dequantization happens on the fly: for each weight block, the stored 4-bit integer is mapped to the corresponding NF4 level and rescaled. Gradients flow through the LoRA adapters (in bf16) but not through the frozen quantized weights.
GPTQ: Post-Training Weight Quantization
GPTQ (Frantar et al., 2022) is a post-training quantization method that quantizes each weight matrix layer by layer, using the Optimal Brain Surgeon (OBS) framework.
For a weight matrix $W$ and a batch of calibration inputs, let $H = 2XX^T$ be the layer’s Hessian (where $X$ is the matrix of layer inputs). When quantizing weight $w_q$ to $\hat{w}_q$, the quantization error $\delta_q = \hat{w}_q - w_q$ propagates a correction to the remaining unquantized weights:
$$\delta_F = -\frac{w_q - \hat{w}q}{[H^{-1}]{qq}} H^{-1}_{:,q}$$
This weight update minimises the increase in output error due to quantizing $w_q$, as measured by the Hessian. The algorithm processes weights column by column, updating the remaining weights after each quantization step. In practice, a Cholesky decomposition of $H^{-1}$ is precomputed and rows are processed in blocks for efficiency.
GPTQ achieves near-lossless 4-bit quantization on models with 30B+ parameters and can reach 3-bit quantization with acceptable degradation.
AWQ: Activation-Aware Weight Quantization
AWQ (Lin et al., 2023) observes that not all weights are equally important: salient weights - those corresponding to large input activations - cause disproportionately large quantization errors. AWQ identifies these weights by examining activation magnitudes $|x_j|$ across a calibration set.
Rather than keeping salient weights at higher precision, AWQ scales them before quantization: if weight column $j$ is multiplied by $s_j > 1$ before quantization and the corresponding input is divided by $s_j$, the quantization error for that weight is reduced by a factor of $s_j$ while the computation remains equivalent. The optimal scale $s_j$ is found by grid search over the calibration set to minimise the output error.
AWQ is hardware-friendly because it requires no mixed-precision arithmetic at inference time - all weights are at the same bit-width.
Quantization-Aware Training vs Post-Training Quantization
Post-training quantization (PTQ) - including GPTQ and AWQ - quantizes a fully trained model without further gradient updates. It is fast but accuracy degrades at very low bit-widths (below 4 bits).
Quantization-aware training (QAT) simulates quantization during training using the straight-through estimator: the forward pass rounds weights to their quantized values, but the backward pass passes gradients through as if the weights were continuous. QAT consistently achieves better accuracy than PTQ at the same bit-width but requires full training, which is expensive for LLMs. QLoRA is a form of QAT applied only to the LoRA adapters while the base model uses PTQ.
Examples
Fine-tuning LLaMA 2 70B on a consumer GPU. With QLoRA at 4-bit NF4, the 70B model occupies about 35 GB in memory. Adding LoRA adapters ($r=64$ applied to all attention projections) adds roughly 500M trainable parameters (0.7% of total). Fine-tuning on a single A100 80 GB GPU is feasible, with throughput of about 200 tokens/second. The resulting model matches or approaches full fine-tuning quality on instruction-following benchmarks.
Inference latency and memory savings. GPTQ 4-bit quantization of LLaMA 2 13B reduces memory from 26 GB (fp16) to 7 GB - fitting on a single consumer-grade GPU - with a throughput increase of approximately $1.5\times$ due to reduced memory bandwidth requirements, and perplexity degradation of less than 0.5 on Wikitext-2.
Read Next: