Core Concepts · Module 04·11 min read

LoRA & qLoRA

How a one-page math trick let people fine-tune 70-billion-parameter models on a single consumer GPU — and what they’re really doing when they do.

The five-bullet version

  • Full fine-tuning updates every weight — billions of parameters, all needing optimizer state. It costs more memory than the model itself.
  • LoRA freezes the model and inserts small trainable matrices alongside each weight to be adapted. Train only those.
  • The trick: weight updates during fine-tuning tend to be low-rank, so a tiny pair of matrices is enough.
  • qLoRA quantizes the frozen base to 4 bits while keeping the adapter in higher precision. Same recipe, a quarter the base memory.
  • For ~1% of the params, you get most of the fine-tune quality. The headline result of the last three years of practical AI.

§ 00 · WHY FULL FINE-TUNING IS SO EXPENSIVEThe accountant’s view of training

A model with N parameters needs roughly 16N bytesto train. That isn’t a typo. The model weights themselves are 2 bytes each (bf16). You also need a copy in fp32 for stable updates (4 bytes), gradients (2 bytes), and the optimizer’s running averages — Adam stores two of them in fp32 (4 + 4 bytes). Add it up: roughly 16× the parameter count, in bytes, just to hold a training step in memory.

For a 70-billion-parameter model, that’s about 1.1 terabytes of VRAM. No single GPU exists with that much memory. Even cutting corners aggressively — bf16 throughout, gradient checkpointing, smaller batches — full fine-tuning a 70B model needs a cluster of eight H100s, minimum, and a serious networking budget. That price tag stopped most teams from fine-tuning at all.

§ 01 · THE LOW-RANK INSIGHTWhat a fine-tune is really doing

When you fine-tune a pretrained model on a specific task — say, a medical chatbot, or a code completion specialty — what are you actually changing? You’re shifting weights from their pretrained values to new task-specific values. Call the original weight matrix W and the change ΔW. Full fine-tuning learns ΔW as a full matrix the same size as W.

The empirical observation behind LoRA: that change is almost always low-rank. The fine-tuning signal lives in a small subspace. You don’t need a 4096×4096 matrix to express it — a much smaller pair of matrices, when multiplied, recovers almost the same thing.

Bd × r×Ar × d=ΔWd × dd·r paramsr·d paramsd² params2·d·rvsd² · for d=4096, r=8 → 256× reductionrank ≤ r
Fig 1Low-rank decomposition. Two thin matrices, multiplied, can reproduce a much larger square one — exactly when the larger one has low rank to begin with.

A matrix ΔW of shape d × d has parameters. If you write ΔW = B·A where A is r × d and B is d × r, the product has rank at most r, and the parameter count drops to 2·r·d. For d = 4096 and r = 8: a full matrix is 16,777,216 params; the low-rank pair is 65,536 — a 256× reduction. With the same accuracy on the tasks people care about.

§ 02 · WHAT LORA ACTUALLY DOESFreeze the base. Inject adapters. Train only those.

The full algorithm fits in three rules:

  1. For each weight matrix you want to adapt (typically the Q/K/V/O projections in attention), freeze the original W.
  2. Add two new matrices alongside: A initialized to small random values, B initialized to zero. The model output at step 0 is exactly the same as the frozen base.
  3. At every layer, the new forward pass is y = W·x + B·A·x. Train the loss with backprop, but only let gradients flow into A and B.

That’s it. The model has the same number of forward-pass parameters (the frozen W stays the size it was), but only the adapters get updated. Optimizer state, gradients, and activations for backprop are sized by the adapter, not by W. The 16-byte-per-parameter accounting collapses on the part of the model you’re actually training.

Lab · cost of trainingMemory for a fine-tune · model size × adapter rank
Base model
LoRA rank · r = 8Trainable params: 8.1 M (0.063% of base)
Full fine-tune
156 GB
all params trainable · bf16 + Adam states
LoRA
26.1 GB
frozen bf16 base + tiny adapter
qLoRA
6.60 GB
frozen 4-bit base + tiny adapter

Numbers are approximate — they assume adapters on Q/K/V/O only and Adam optimizer states. The point: rank moves trainable params from billions to millions, and qLoRA cuts the frozen base by a further 4×.

Slide the rank up: trainable params grow, but stay tiny next to the base. Slide model size up: the cost of a full fine-tune explodes; LoRA barely notices.

§ 03 · QLORA = LORA + 4-BIT BASEThe trick that makes 70B fit on one card

LoRA already cut the trainablecost. The remaining bottleneck is the frozen base itself — even at bf16 (2 bytes), a 70B model needs ~140 GB. That doesn’t fit on a 24 GB consumer GPU. Or 48 GB. Or 80 GB.

qLoRAqLoRA. A 2023 method that quantizes the frozen base model to 4-bit precision (NF4 specifically), pages optimizer state to CPU, and trains LoRA adapters on top. Combined, it makes single-GPU fine-tuning of 65B+ models tractable. is LoRA with a more aggressive base: quantize the frozen weights to 4 bitsper parameter using a format called NF4 (a custom 4-bit float designed for the bell-curve distribution that trained weights actually have). The adapter stays in higher precision — usually bf16 — because that’s where the gradients flow.

4 bits is one quarter of 2 bytes (16 bits). A 70B base goes from 140 GB to ~35 GB. Add a few GB of adapter and optimizer state, and you can train on a single 48 GB A100 — or, with one more trick (paging optimizer state to CPU), on a 24 GB consumer card.

Three subtleties that make qLoRA work in practice rather than just on paper:

§ 04 · PICKING R AND WHEN TO SHIPPractical defaults

Three knobs decide a LoRA run: the rank r, the scaling factor α (the contribution of the adapter is scaled by α / r), and which layers to adapt. Defaults that work most of the time:

Qualityaccuracy (%)LoRA rank · r048163264plateau zoner=8 default
Fig 2The empirical rank-quality curve. For most fine-tuning tasks, anything past r=8 buys you almost nothing — until your task is closer to retraining than adapting.

Shipping a LoRA-trained model: you don’t have to merge the adapter into the base. Many serving stacks support running the base and adapter separately, swapping adapters per request — useful for multi-tenant fine-tunes. If you do merge (W + B·A), the result is bit-identical to a model that was fine-tuned conventionally, from the user’s point of view.

CHECKYou want to fine-tune a 13B model on a 24 GB consumer GPU. Which configuration is most likely to fit and train successfully?

§ 05 · TAKING THIS FORWARDWhere this goes

LoRA is the foundation. Two threads run downstream of it:

The headline: pretraining stays expensive and centralized. Fine-tuning is now something a serious individual can do on their own hardware. The last few years of practical, useful, niche models — the medical chatbots, the legal RAGs, the code reviewers — exist because the cost of adapting a great base model went from “university grant” to “a weekend.”

§ · GOING DEEPERWhat rank to pick and where to apply LoRA

Three knobs determine LoRA quality and they interact. Rank sets the capacity of the adapter — 8 is a common default, but tasks that require new knowledge (rather than new style) benefit from 16–64.Scale α (the multiplier on the BA product) controls how loud the adapter is at inference; the standard recipe sets α/r ≈ 2.Target modules — which layers actually get adapters — usually matters more than rank. Hu et al. (2021) showed Q and V projections in attention are enough for many tasks; modern recipes also adapt K, O, and the MLP projections.

qLoRA (Dettmers et al. 2023) was the trick that made fine-tuning a 65B model on a single 48GB GPU practical: keep the base model in 4-bit NF4 quantization, train the LoRA adapter in higher precision, use paged optimizers for memory spikes. The cost is some throughput per step. The benefit is fitting at all. DoRA (2024) decomposes the weight update into magnitude and direction components and gets meaningfully better quality at the same rank.

§ · FURTHER READINGReferences & deeper sources

  1. Hu, Shen, Wallis, Allen-Zhu, Li, Wang, Wang, Chen (2021). LoRA: Low-Rank Adaptation of Large Language Models · ICLR
  2. Dettmers, Pagnoni, Holtzman, Zettlemoyer (2023). QLoRA: Efficient Finetuning of Quantized LLMs · NeurIPS
  3. Liu et al. (2024). DoRA: Weight-Decomposed Low-Rank Adaptation · ICML
  4. Aghajanyan, Zettlemoyer, Gupta (2020). Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning · ACL
  5. Kalajdzievski (2023). A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA (rsLoRA) · arXiv

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.