LoRA & qLoRA
How a one-page math trick let people fine-tune 70-billion-parameter models on a single consumer GPU — and what they’re really doing when they do.
The five-bullet version
- Full fine-tuning updates every weight — billions of parameters, all needing optimizer state. It costs more memory than the model itself.
- LoRA freezes the model and inserts small trainable matrices alongside each weight to be adapted. Train only those.
- The trick: weight updates during fine-tuning tend to be low-rank, so a tiny pair of matrices is enough.
- qLoRA quantizes the frozen base to 4 bits while keeping the adapter in higher precision. Same recipe, a quarter the base memory.
- For ~1% of the params, you get most of the fine-tune quality. The headline result of the last three years of practical AI.
§ 00 · WHY FULL FINE-TUNING IS SO EXPENSIVEThe accountant’s view of training
A model with N parameters needs roughly 16N bytesto train. That isn’t a typo. The model weights themselves are 2 bytes each (bf16). You also need a copy in fp32 for stable updates (4 bytes), gradients (2 bytes), and the optimizer’s running averages — Adam stores two of them in fp32 (4 + 4 bytes). Add it up: roughly 16× the parameter count, in bytes, just to hold a training step in memory.
For a 70-billion-parameter model, that’s about 1.1 terabytes of VRAM. No single GPU exists with that much memory. Even cutting corners aggressively — bf16 throughout, gradient checkpointing, smaller batches — full fine-tuning a 70B model needs a cluster of eight H100s, minimum, and a serious networking budget. That price tag stopped most teams from fine-tuning at all.
§ 01 · THE LOW-RANK INSIGHTWhat a fine-tune is really doing
When you fine-tune a pretrained model on a specific task — say, a medical chatbot, or a code completion specialty — what are you actually changing? You’re shifting weights from their pretrained values to new task-specific values. Call the original weight matrix W and the change ΔW. Full fine-tuning learns ΔW as a full matrix the same size as W.
The empirical observation behind LoRA: that change is almost always low-rank. The fine-tuning signal lives in a small subspace. You don’t need a 4096×4096 matrix to express it — a much smaller pair of matrices, when multiplied, recovers almost the same thing.
A matrix ΔW of shape d × d has d² parameters. If you write ΔW = B·A where A is r × d and B is d × r, the product has rank at most r, and the parameter count drops to 2·r·d. For d = 4096 and r = 8: a full matrix is 16,777,216 params; the low-rank pair is 65,536 — a 256× reduction. With the same accuracy on the tasks people care about.
§ 02 · WHAT LORA ACTUALLY DOESFreeze the base. Inject adapters. Train only those.
The full algorithm fits in three rules:
- For each weight matrix you want to adapt (typically the Q/K/V/O projections in attention), freeze the original
W. - Add two new matrices alongside:
Ainitialized to small random values,Binitialized to zero. The model output at step 0 is exactly the same as the frozen base. - At every layer, the new forward pass is
y = W·x + B·A·x. Train the loss with backprop, but only let gradients flow intoAandB.
That’s it. The model has the same number of forward-pass parameters (the frozen W stays the size it was), but only the adapters get updated. Optimizer state, gradients, and activations for backprop are sized by the adapter, not by W. The 16-byte-per-parameter accounting collapses on the part of the model you’re actually training.
Numbers are approximate — they assume adapters on Q/K/V/O only and Adam optimizer states. The point: rank moves trainable params from billions to millions, and qLoRA cuts the frozen base by a further 4×.
Slide the rank up: trainable params grow, but stay tiny next to the base. Slide model size up: the cost of a full fine-tune explodes; LoRA barely notices.
§ 03 · QLORA = LORA + 4-BIT BASEThe trick that makes 70B fit on one card
LoRA already cut the trainablecost. The remaining bottleneck is the frozen base itself — even at bf16 (2 bytes), a 70B model needs ~140 GB. That doesn’t fit on a 24 GB consumer GPU. Or 48 GB. Or 80 GB.
qLoRAqLoRA. A 2023 method that quantizes the frozen base model to 4-bit precision (NF4 specifically), pages optimizer state to CPU, and trains LoRA adapters on top. Combined, it makes single-GPU fine-tuning of 65B+ models tractable. is LoRA with a more aggressive base: quantize the frozen weights to 4 bitsper parameter using a format called NF4 (a custom 4-bit float designed for the bell-curve distribution that trained weights actually have). The adapter stays in higher precision — usually bf16 — because that’s where the gradients flow.
4 bits is one quarter of 2 bytes (16 bits). A 70B base goes from 140 GB to ~35 GB. Add a few GB of adapter and optimizer state, and you can train on a single 48 GB A100 — or, with one more trick (paging optimizer state to CPU), on a 24 GB consumer card.
Three subtleties that make qLoRA work in practice rather than just on paper:
- Double quantization. The quantization constants themselves take a few hundred MB. qLoRA quantizes those too. Petty, but adds up at scale.
- Paged optimizers. Adam states for the (tiny) adapter can spill to CPU RAM when the GPU is busy, and stream back on demand. Lets you survive memory spikes from long sequences.
- Compute happens in bf16. The base weights are stored in 4-bit and dequantized on-the-fly for each matmul. You pay a small throughput cost for storage savings — a trade-off that lines up well with consumer-GPU economics.
§ 04 · PICKING R AND WHEN TO SHIPPractical defaults
Three knobs decide a LoRA run: the rank r, the scaling factor α (the contribution of the adapter is scaled by α / r), and which layers to adapt. Defaults that work most of the time:
- Rank. Start at
r = 8for instruction tuning,r = 16for harder domains (code, math). Going higher rarely helps; going lower is fine for style adaptation. - Alpha. Set
α = 2ras a default. Sor = 8,α = 16. Tune later if needed. - Targets.Q and V projections in attention is the minimum that works. Adding K, O, and the MLP up/down projections (everything in the “target_modules” list) helps for harder tasks — at a 3–5× cost in trainable params, still tiny.
Shipping a LoRA-trained model: you don’t have to merge the adapter into the base. Many serving stacks support running the base and adapter separately, swapping adapters per request — useful for multi-tenant fine-tunes. If you do merge (W + B·A), the result is bit-identical to a model that was fine-tuned conventionally, from the user’s point of view.
§ 05 · TAKING THIS FORWARDWhere this goes
LoRA is the foundation. Two threads run downstream of it:
- DoRA, AdaLoRA, LoftQ — variants that allocate rank per-layer rather than uniformly, or initialize differently. Marginal gains on hard benchmarks; pick if your evals justify it.
- Adapter merging & routing. If you train many adapters on the same base, you can serve them as a library and route requests to the right one — or interpolate between two for a blended behavior. Powers most modern character/persona AI products.
The headline: pretraining stays expensive and centralized. Fine-tuning is now something a serious individual can do on their own hardware. The last few years of practical, useful, niche models — the medical chatbots, the legal RAGs, the code reviewers — exist because the cost of adapting a great base model went from “university grant” to “a weekend.”
§ · GOING DEEPERWhat rank to pick and where to apply LoRA
Three knobs determine LoRA quality and they interact. Rank sets the capacity of the adapter — 8 is a common default, but tasks that require new knowledge (rather than new style) benefit from 16–64.Scale α (the multiplier on the BA product) controls how loud the adapter is at inference; the standard recipe sets α/r ≈ 2.Target modules — which layers actually get adapters — usually matters more than rank. Hu et al. (2021) showed Q and V projections in attention are enough for many tasks; modern recipes also adapt K, O, and the MLP projections.
qLoRA (Dettmers et al. 2023) was the trick that made fine-tuning a 65B model on a single 48GB GPU practical: keep the base model in 4-bit NF4 quantization, train the LoRA adapter in higher precision, use paged optimizers for memory spikes. The cost is some throughput per step. The benefit is fitting at all. DoRA (2024) decomposes the weight update into magnitude and direction components and gets meaningfully better quality at the same rank.
§ · FURTHER READINGReferences & deeper sources
- (2021). LoRA: Low-Rank Adaptation of Large Language Models · ICLR
- (2023). QLoRA: Efficient Finetuning of Quantized LLMs · NeurIPS
- (2024). DoRA: Weight-Decomposed Low-Rank Adaptation · ICML
- (2020). Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning · ACL
- (2023). A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA (rsLoRA) · arXiv
Original figures live in the linked sources — open the papers for the canonical visuals in their full context.