Menu

Fine-Tuning LLMs with LoRA and QLoRA in Python — A Complete Guide

Written by Selva Prabhakaran | 27 min read

GPT-4 doesn’t know your company’s internal vocabulary. Llama 3 can’t answer questions about your proprietary dataset. Full retraining would take tens of thousands of GPU-hours and cost more than most teams can spend.

LoRA and QLoRA close that gap. Fine-tune a 7B model on a single consumer GPU in a few hours — and outperform the base model on your specific task. I’ve seen this pattern work reliably across dozens of domain adaptation projects: the key is knowing which technique to use, and why.


Before we write a single line of code, here’s how the pieces fit together.

You start with a massive pre-trained model — billions of frozen weights that already understand language. The model’s knowledge isn’t the problem. It’s that the model has never seen your domain.

Full fine-tuning unfreezes every weight and retrains them all. For a 7B model, that means roughly 98 GB of GPU memory for weights, gradients, and optimizer states. Most teams can’t afford that hardware.

LoRA takes a different approach. It asks: what is the smallest possible change that teaches the model your task? It injects two tiny matrices next to each frozen weight layer. During training, only those tiny matrices update. The base model never changes.

QLoRA goes one step further. It compresses the frozen base model to 4-bit precision — cutting its memory footprint by 75%. The LoRA adapters still train in 16-bit precision, just as before. The result: a 7B model fits under 6 GB of GPU memory.

By the end of this guide, you’ll understand the math behind both techniques, configure them from scratch, train a model end to end, and know when to choose LoRA over QLoRA.


Prerequisites

  • Python version: 3.9+
  • Required libraries: torch (2.0+), transformers (4.40+), peft (0.10+), trl (0.8+), bitsandbytes (0.43+), datasets (2.18+), accelerate (0.28+)
  • Install:
bash
pip install torch transformers peft trl bitsandbytes datasets accelerate
  • Hardware: LoRA — GPU with 12+ GB VRAM. QLoRA — 6+ GB VRAM (free Colab T4 works).
  • Background: Python, basic PyTorch tensors, familiarity with transformer models.
  • Time to complete: 60 minutes

python
# All imports for this tutorial — run this cell first
import os
import torch
import numpy as np

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)
from peft import (
    LoraConfig,
    get_peft_model,
    TaskType,
    PeftModel,
)
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

print("PyTorch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
python
PyTorch version: 2.2.0
CUDA available: True
GPU: NVIDIA GeForce RTX 4090
GPU memory: 24.0 GB

Why Full Fine-Tuning Is Impractical (And What LoRA Does About It)

Here’s a useful way to think about this. Imagine a world-class general physician who knows everything about medicine. You need them to specialise in rare pediatric diseases found only in your region. Two options:

  1. Put them through another full medical school — 8 years, enormous cost, risk of forgetting general knowledge.
  2. Give them a 3-month intensive rotation in your pediatric ward — targeted, cheap, and they keep their existing knowledge.

Full fine-tuning is option 1. LoRA is option 2.

The memory problem is concrete. Let’s calculate it for a 7-billion-parameter model — the most common size for fine-tuning experiments today.

python
# Calculate GPU memory needed for full fine-tuning vs LoRA
# Pure arithmetic — no GPU needed to run this cell

model_params = 7_000_000_000  # 7 billion parameters

bytes_per_param_bf16 = 2  # 16-bit = 2 bytes
bytes_per_param_fp32 = 4  # 32-bit = 4 bytes

# Full fine-tuning: weights + gradients + Adam optimizer states
# Adam stores momentum + variance = 2 extra fp32 copies of every gradient
weights_gb = model_params * bytes_per_param_bf16 / 1e9
gradients_gb = model_params * bytes_per_param_fp32 / 1e9
optimizer_gb = model_params * bytes_per_param_fp32 * 2 / 1e9
total_full_ft_gb = weights_gb + gradients_gb + optimizer_gb

print("=== Memory for Full Fine-Tuning a 7B Model ===")
print(f"  Weights (bf16):         {weights_gb:.1f} GB")
print(f"  Gradients (fp32):       {gradients_gb:.1f} GB")
print(f"  Adam optimizer states:  {optimizer_gb:.1f} GB")
print(f"  TOTAL:                  {total_full_ft_gb:.1f} GB")

# LoRA fine-tuning: only adapter matrices update
# 32 transformer layers × 4 attention matrices = 128 adapter pairs
lora_rank = 16
num_adapter_pairs = 32 * 4
hidden_dim = 4096

lora_params = num_adapter_pairs * 2 * (hidden_dim * lora_rank)

lora_weights_gb = lora_params * bytes_per_param_bf16 / 1e9
lora_grads_gb = lora_params * bytes_per_param_fp32 / 1e9
lora_optim_gb = lora_params * bytes_per_param_fp32 * 2 / 1e9
base_model_gb = model_params * bytes_per_param_bf16 / 1e9
total_lora_gb = base_model_gb + lora_weights_gb + lora_grads_gb + lora_optim_gb

print(f"\n=== Memory for LoRA Fine-Tuning (rank={lora_rank}, attention only) ===")
print(f"  Frozen base model (bf16):  {base_model_gb:.1f} GB")
print(f"  LoRA trainable params:     {lora_params:,}")
print(f"  LoRA weights + grad + opt: {(lora_weights_gb + lora_grads_gb + lora_optim_gb):.2f} GB")
print(f"  TOTAL:                     {total_lora_gb:.1f} GB")
print(f"  Reduction vs full FT:      {total_full_ft_gb / total_lora_gb:.1f}x less memory")
python
=== Memory for Full Fine-Tuning a 7B Model ===
  Weights (bf16):         14.0 GB
  Gradients (fp32):       28.0 GB
  Adam optimizer states:  56.0 GB
  TOTAL:                  98.0 GB

=== Memory for LoRA Fine-Tuning (rank=16, attention only) ===
  Frozen base model (bf16):  14.0 GB
  LoRA trainable params:     16,777,216
  LoRA weights + grad + opt: 0.24 GB
  TOTAL:                     14.2 GB
  Reduction vs full FT:      6.9x less memory

That 6.9x reduction is from LoRA alone — and we’re only targeting attention layers. Include MLP layers and the adapter overhead is still under 0.5 GB.

With QLoRA’s 4-bit compression of the base model, total memory drops to under 6 GB for a 7B model.

LoRA does NOT compress the base model — it keeps training memory low by having far fewer trainable parameters. The base model still occupies its full memory. QLoRA is what actually compresses the base model. They solve different parts of the memory problem.

Quick check: If you apply LoRA to a 7B model with rank=8 instead of rank=16, would the base model memory change? (Answer: No. The frozen base model is unchanged regardless of rank. Only the tiny adapter matrices change in size.)

How LoRA Works — The Math in Plain Terms

In a transformer, the weight matrices that matter most are the attention projections — Query (Q), Key (K), Value (V), and Output (O). For a 7B model, each is 4096 × 4096, which is 16.7 million values. Full fine-tuning adjusts all of them.

LoRA’s insight: you don’t need to change all 16.7 million values. You only need to change a low-rank approximation of the update.

When fine-tuning shifts a weight matrix $W$ by some change $\Delta W$, LoRA approximates that change as:

$$\Delta W = B \cdot A$$

Where:
– $W$ is the original frozen weight matrix of shape $(d_{out} \times d_{in})$ — e.g., (4096 × 4096)
– $A$ is a new trainable matrix of shape $(r \times d_{in})$ — e.g., (16 × 4096)
– $B$ is a new trainable matrix of shape $(d_{out} \times r)$ — e.g., (4096 × 16)
– $r$ is the rank — a small number like 4, 8, 16, or 32

Instead of training 16.7M parameters, you train $r \times d_{in} + d_{out} \times r$ = 131,072 parameters. That’s a 128x reduction for this single layer alone.

Not a math person? Skip ahead — the practical setup is all you need.

The simulation below shows this concretely. Watch the parameter count reduction and understand why B must start at zero.

python
# Simulate LoRA's low-rank decomposition with numpy
np.random.seed(42)

d_out, d_in = 512, 512

# What full fine-tuning would produce as a weight update
delta_W_full = np.random.randn(d_out, d_in).astype(np.float32) * 0.01

# LoRA: approximate delta_W with two small matrices A and B
# B initialized to ZERO — model starts identical to the pre-trained base
rank = 16
A = np.random.randn(rank, d_in).astype(np.float32) * 0.01
B = np.zeros((d_out, rank), dtype=np.float32)

# Simulate trained adapter using SVD — keep only top 'rank' singular vectors
# U and Vt capture the most important directions; S contains their magnitudes
U, S, Vt = np.linalg.svd(delta_W_full, full_matrices=False)
A_trained = np.diag(np.sqrt(S[:rank])) @ Vt[:rank, :]
B_trained = U[:, :rank] @ np.diag(np.sqrt(S[:rank]))
delta_W_lora = B_trained @ A_trained

print(f"Original weight matrix: {d_out} × {d_in} = {d_out * d_in:,} values")
print(f"\nLoRA decomposition (rank={rank}):")
print(f"  Matrix A: {A_trained.shape}  →  {A_trained.size:,} params")
print(f"  Matrix B: {B_trained.shape}  →  {B_trained.size:,} params")
print(f"  Total LoRA params: {A_trained.size + B_trained.size:,}")
print(f"  vs full update:    {delta_W_full.size:,}")
print(f"  Parameter reduction: {delta_W_full.size / (A_trained.size + B_trained.size):.0f}x")

This part is genuinely surprising at first: the reconstruction error looks high. LoRA doesn’t perfectly replicate a full update — it approximates the most important directions of change. The original LoRA paper [1] showed that rank 4–16 matches or exceeds full fine-tuning quality on most language tasks, despite this approximation. The parameter count is what you’re really here for: 16,384 trainable values instead of 262,144 for a 512×512 layer.

B is always initialized to zero. At training start, $\Delta W = B \cdot A = 0$. The model behaves identically to the pre-trained base. Training begins from a stable, known starting point. Matrix A is initialized with small random values so gradients flow from the first step.

LoRA Hyperparameters — What Each One Does

Four parameters control your LoRA setup. Getting these right matters more than most people realise.

r (rank) — the size of the adapter matrices. Higher rank = more parameters = more expressive adaptation. Common values: 4, 8, 16, 32. I almost always start with r=16 — it’s the right default for 90% of tasks, and there’s rarely a reason to go above 32.

lora_alpha — a scaling factor applied during the forward pass. The model computes $W + (\alpha / r) \cdot B \cdot A$. The ratio $\alpha / r$ controls how much the adapter influences the output. Use alpha = 2 × r as your starting point. If r=16, set alpha=32. This rule comes from hundreds of experiments at Lightning AI [3] and matches the original paper’s recommendation [2].

lora_dropout — dropout probability on LoRA layers. Prevents overfitting on small datasets. Use 0.05–0.1 for datasets under 10,000 samples. Set to 0 for large datasets (50,000+).

target_modules — which weight matrices get LoRA adapters. Targeting all linear layers (Q, K, V, O, plus MLP gates) consistently outperforms targeting only Q and V [4]. The extra memory cost is tiny; the quality gain is real. I stopped targeting only Q+V after seeing consistent improvements from full-layer coverage.

Sound familiar? Most tutorials still recommend Q+V only. That was the original paper’s approach. The experimental evidence since then points clearly to full-layer targeting.

python
# How LoRA rank and target choice affect trainable parameter count
hidden_dim = 4096
intermediate_dim = 11008
num_layers = 32

def count_lora_params(rank, target="all"):
    attn_matrices = 4
    attn_params_per_layer = attn_matrices * 2 * (hidden_dim * rank)
    mlp_params_per_layer = (
        2 * (hidden_dim * rank + rank * intermediate_dim) +
        2 * (intermediate_dim * rank + rank * hidden_dim)
    )
    if target == "qv_only":
        params_per_layer = 2 * 2 * (hidden_dim * rank)
    elif target == "attention":
        params_per_layer = attn_params_per_layer
    else:
        params_per_layer = attn_params_per_layer + mlp_params_per_layer
    return params_per_layer * num_layers

print(f"{'Rank':>6} | {'QV only':>12} | {'All attn':>12} | {'All linear':>12}")
print("-" * 55)
for rank in [4, 8, 16, 32, 64]:
    qv = count_lora_params(rank, "qv_only")
    attn = count_lora_params(rank, "attention")
    all_lin = count_lora_params(rank, "all")
    print(f"{rank:>6} | {qv/1e6:>10.2f}M | {attn/1e6:>10.2f}M | {all_lin/1e6:>10.2f}M")

total_rank16_all = count_lora_params(16, 'all')
print(f"\n7B model total params: ~7,000M")
print(f"LoRA rank=16, all linear: {total_rank16_all/1e6:.1f}M trainable")
print(f"That's {total_rank16_all / 7e9 * 100:.2f}% of total parameters")
python
  Rank |      QV only |     All attn |   All linear
-------------------------------------------------------
     4 |       2.10M |       4.19M |      11.93M
     8 |       4.19M |       8.39M |      23.86M
    16 |       8.39M |      16.78M |      47.71M
    32 |      16.78M |      33.55M |      95.42M
    64 |      33.55M |      67.11M |     190.84M

7B model total params: ~7,000M
LoRA rank=16, all linear: 47.7M trainable
That's 0.68% of total parameters

[TRY IT YOURSELF] Exercise 1: Calculate LoRA Parameters for a 13B Model

The Llama 2 13B model has: hidden_dim = 5120, intermediate_dim = 13824, num_layers = 40.

Your task: Modify the count_lora_params function for the 13B model. Calculate trainable parameters with rank=16 targeting all linear layers. Calculate the percentage of 13 billion total parameters.

Hint 1: Copy the function and update the three dimension constants at the top.
Hint 2: Expected result is under 1% of total parameters.

python
# Exercise 1: Count LoRA params for Llama 2 13B
hidden_dim_13b = ___
intermediate_dim_13b = ___
num_layers_13b = ___
rank = 16

def count_lora_params_13b(rank):
    attn_params = 4 * 2 * (hidden_dim_13b * rank)
    mlp_params = (
        2 * (hidden_dim_13b * rank + rank * intermediate_dim_13b) +
        2 * (intermediate_dim_13b * rank + rank * hidden_dim_13b)
    )
    return (attn_params + mlp_params) * num_layers_13b

total_lora = count_lora_params_13b(rank)
total_model = 13_000_000_000
print(f"Trainable LoRA params: {total_lora / 1e6:.1f}M")
print(f"LoRA percentage: {total_lora / total_model * 100:.2f}%")

# Solution: hidden_dim_13b=5120, intermediate_dim_13b=13824, num_layers_13b=40
# Expected: ~78.6M trainable, 0.60% of 13B

Fine-Tuning with LoRA — Step by Step

Theory is useful. Working code is better. Let’s fine-tune a real model.

We’ll use facebook/opt-125m — a small 125M-parameter model — on an instruction-following dataset. OPT-125M isn’t state-of-the-art, but it’s small enough to run on a free Colab GPU. The code you write here is identical for Llama 3.1 8B. You swap one string and nothing else changes.

The pipeline: load model → format your dataset → configure LoRA → train → save adapter.

Step 1 — Load the Base Model and Tokenizer

We load in bfloat16 precision — half the memory of float32, with better numerical stability than float16. The device_map="auto" argument places the model on GPU automatically, splitting across multiple GPUs if needed.

python
model_name = "facebook/opt-125m"  # swap with "meta-llama/Llama-3.1-8B" for production

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token  # OPT has no separate pad token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

total_params = sum(p.numel() for p in model.parameters())
print(f"Model: {model_name}")
print(f"Total parameters: {total_params / 1e6:.1f}M")
print(f"Trainable (before LoRA): {sum(p.numel() for p in model.parameters() if p.requires_grad) / 1e6:.1f}M")
print(f"Model dtype: {next(model.parameters()).dtype}")
python
Model: facebook/opt-125m
Total parameters: 125.2M
Trainable (before LoRA): 125.2M
Model dtype: torch.bfloat16

Step 2 — Prepare Your Dataset

For instruction fine-tuning, each training example needs a prompt-response pair formatted as a single text string. SFTTrainer automatically masks the prompt tokens in the loss so the model learns to generate responses, not copy prompts.

The simplest format for your own data: a CSV with a text column where each row contains the full conversation. The code below shows how to convert a raw Q&A list into this format, then load a pre-formatted example dataset.

python
# Converting your own raw data to the SFTTrainer format
# Each text string combines prompt + response in one field
raw_qa_pairs = [
    {"question": "What is LoRA?", "answer": "LoRA is a parameter-efficient fine-tuning method..."},
    {"question": "How does QLoRA work?", "answer": "QLoRA combines 4-bit quantization with LoRA..."},
]

def format_instruction(sample):
    return f"### Human: {sample['question']}\n### Assistant: {sample['answer']}"

# Show what the formatted text looks like
print("Custom dataset format:")
print(format_instruction(raw_qa_pairs[0])[:200])

# For this tutorial, use a pre-formatted public dataset
dataset = load_dataset("timdettmers/openassistant-guanaco", split="train")
print(f"\nDataset size: {len(dataset)} examples")
print(f"\nPre-formatted example (first 250 chars):")
print(dataset[0]["text"][:250])
python
Custom dataset format:
### Human: What is LoRA?
### Assistant: LoRA is a parameter-efficient fine-tuning method...

Dataset size: 9846

Pre-formatted example (first 250 chars):
### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics?
### Assistant: "Monopsony" refers to a market structure where there is only one buyer...

Step 3 — Configure LoRA with PEFT

This is the heart of the setup. LoraConfig tells PEFT exactly how to inject adapter matrices. get_peft_model() modifies the model in-place — freezing all original weights and attaching LoRA adapters.

Watch the parameter counts change after applying LoRA. That shift from 125M trainable to under 800K is the efficiency gain in action.

python
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,                 # alpha/r = 2.0 — standard rule
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
)

model = get_peft_model(model, lora_config)

trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())

print("After applying LoRA:")
print(f"  Trainable:  {trainable:,}  ({trainable / total * 100:.2f}%)")
print(f"  Frozen:     {total - trainable:,}")
model.print_trainable_parameters()
python
After applying LoRA:
  Trainable:  786,432  (0.63%)
  Frozen:     125,197,312

trainable params: 786,432 || all params: 125,983,744 || trainable%: 0.6242

Step 4 — Train with SFTTrainer

SFTTrainer handles the instruction-tuning workflow: prompt/response masking, sequence packing, and clean PEFT integration.

Key training arguments: num_train_epochs (1–3 is standard; more risks overfitting), gradient_accumulation_steps (simulates larger batches: effective batch = batch_size × this), learning_rate (2e-4 is the reliable starting point for LoRA fine-tuning).

python
training_args = SFTConfig(
    output_dir="./opt-125m-lora",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,      # effective batch size = 16
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    save_steps=100,
    logging_steps=10,
    bf16=True,
    max_seq_length=512,
    dataset_text_field="text",
    report_to="none",
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
)

# Expected: ~15 minutes on a T4 GPU for 1 epoch
trainer.train()
print("Training complete!")

Step 5 — Save and Evaluate

LoRA saves only the adapter weights — not the full model. For OPT-125M with rank=16, the adapter folder is about 3 MB versus 250 MB for the full model.

A quick sanity check after training: compare the model’s response on a test prompt before and after fine-tuning. If the output meaningfully improves on your task, training worked.

python
# Save only the adapter (tiny)
adapter_save_path = "./opt-125m-lora-adapter"
model.save_pretrained(adapter_save_path)
tokenizer.save_pretrained(adapter_save_path)

adapter_size_mb = sum(
    os.path.getsize(os.path.join(root, f))
    for root, dirs, files in os.walk(adapter_save_path)
    for f in files
) / 1e6
print(f"Adapter saved: {adapter_size_mb:.1f} MB  (vs ~250 MB for the full model)")
python
Adapter saved: 3.1 MB  (vs ~250 MB for the full model)
python
# Load adapter and run a quick evaluation
base = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype=torch.bfloat16, device_map="auto"
)
fine_tuned = PeftModel.from_pretrained(base, adapter_save_path)
fine_tuned.eval()

test_prompt = "### Human: Explain the concept of gradient descent.\n### Assistant:"
inputs = tokenizer(test_prompt, return_tensors="pt").to(fine_tuned.device)

with torch.no_grad():
    output_ids = fine_tuned.generate(
        **inputs,
        max_new_tokens=100,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
    )

generated = tokenizer.decode(
    output_ids[0][inputs["input_ids"].shape[1]:],
    skip_special_tokens=True,
)
print("Response:", generated)

Quick check: Why do we decode only output_ids[0][inputs["input_ids"].shape[1]:] instead of the full output? Because generate() includes the original prompt in its output — we slice it off to show only the newly generated tokens.

QLoRA — Fine-Tuning Billion-Parameter Models on Consumer Hardware

LoRA reduces training memory dramatically — but the base model itself still takes 14 GB for a 7B model in bfloat16. That’s too large for most consumer GPUs.

QLoRA compresses the frozen base model to 4-bit precision — cutting it from 14 GB to roughly 3.5 GB. The LoRA adapters remain in 16-bit precision. You get 4-bit storage with 16-bit training quality.

Three techniques make this work without sacrificing model quality:

1. 4-bit NormalFloat (NF4)
Standard 4-bit integer quantization spreads levels evenly across the full number range. But neural network weights cluster around zero, following a normal distribution. NF4 allocates more quantization levels near zero (where most weights live) and fewer at the extremes. The same 4 bits represent the distribution far more accurately.

2. Double Quantization
Quantization requires “calibration constants” that map quantized weights back to real values. These constants take memory too. Double quantization quantizes those constants as well, saving ~0.37 bits per parameter across the full model.

3. Paged Optimizers
When GPU memory spikes unexpectedly (long sequences, variable batch lengths), NVIDIA’s unified memory pages optimizer states to CPU RAM. This prevents out-of-memory crashes at the worst moment — like virtual memory for your GPU.

Together, a 7B model fits under 6 GB. A 65B model fits on a single A100.

QLoRA = 4-bit base model + 16-bit LoRA adapters. Frozen weights are stored in 4-bit, dequantized to bfloat16 for each forward pass computation, then discarded. Adapter gradients flow entirely in bfloat16. You get 4-bit memory with bfloat16 training stability.

The Only Difference from LoRA: How You Load the Model

The QLoRA setup is almost identical to the LoRA setup. The only change is adding a BitsAndBytesConfig when loading the model — it compresses the model to 4-bit on the fly.

Two extra lines are required after loading a quantized model. use_cache = False disables the KV cache, which conflicts with gradient checkpointing. enable_input_require_grads() ensures gradients can flow through the quantized model into the LoRA adapters. Forgetting either one causes confusing errors during training.

python
# QLoRA Step 1: Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",                 # NormalFloat4
    bnb_4bit_compute_dtype=torch.bfloat16,     # dequantize to bfloat16 for compute
    bnb_4bit_use_double_quant=True,            # double quantization: saves ~0.37 bits/param
)

# QLoRA Step 2: Load with quantization — only line that differs from LoRA
model_qlora = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
)

# Always set these two lines after loading a quantized model
model_qlora.config.use_cache = False        # incompatible with gradient checkpointing
model_qlora.enable_input_require_grads()    # lets gradients reach LoRA adapters

print("Model loaded in 4-bit precision")
print(f"Model weight dtype: {next(model_qlora.parameters()).dtype}")
python
Model loaded in 4-bit precision
Model weight dtype: torch.uint8
python
# QLoRA Step 3: Apply LoRA — identical config to standard LoRA
qlora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)
model_qlora = get_peft_model(model_qlora, qlora_config)
model_qlora.print_trainable_parameters()

# QLoRA Step 4: Training — same as LoRA, plus paged_adamw_32bit
qlora_args = SFTConfig(
    output_dir="./opt-125m-qlora",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    bf16=True,
    max_seq_length=512,
    dataset_text_field="text",
    optim="paged_adamw_32bit",   # paged optimizer — prevents OOM during training spikes
    report_to="none",
)

qlora_trainer = SFTTrainer(
    model=model_qlora,
    args=qlora_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
)
qlora_trainer.train()
print("QLoRA training complete!")
python
trainable params: 786,432 || all params: 125,983,744 || trainable%: 0.6242

Speed up training with Flash Attention 2. If your GPU supports it (Ampere/Hopper or newer), add attn_implementation="flash_attention_2" to your from_pretrained() call. This uses a memory-efficient attention algorithm that can reduce training time by 2–4x for long sequences with no loss in quality.

python
model = AutoModelForCausalLM.from_pretrained(
model_name,
attn_implementation="flash_attention_2", # requires: pip install flash-attn
torch_dtype=torch.bfloat16,
device_map="auto",
)

[TRY IT YOURSELF] Exercise 2: Configure QLoRA for Llama 3.1 8B

You want to fine-tune meta-llama/Llama-3.1-8B-Instruct on a customer support dataset. Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj.

Your task: Create a BitsAndBytesConfig for 4-bit NF4 with double quantization, and a LoraConfig with r=64, targeting all 7 modules.

Hint 1: bnb_4bit_quant_type="nf4" and bnb_4bit_use_double_quant=True.
Hint 2: If r=64, then lora_alpha=128 (the 2×r rule).

python
bnb_config_llama = BitsAndBytesConfig(
    load_in_4bit=___,
    bnb_4bit_quant_type=___,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=___,
)

qlora_config_llama = LoraConfig(
    r=___,
    lora_alpha=___,
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
    target_modules=___,
)

print(f"r: {qlora_config_llama.r}, alpha: {qlora_config_llama.lora_alpha}")
print(f"target_modules: {sorted(qlora_config_llama.target_modules)}")

# Solution:
# bnb_config_llama = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
#     bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True)
# qlora_config_llama = LoraConfig(r=64, lora_alpha=128, ...,
#     target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"])

Merging LoRA Adapters into the Base Model

For production deployment, you can bake the adapter into the base model permanently. merge_and_unload() computes $W^* = W + (\alpha / r) \cdot B \cdot A$ for every adapted layer and stores the result. The output is a standard HuggingFace model — no PEFT wrapper, no runtime overhead.

One limitation to know: you cannot merge directly onto a QLoRA model. The 4-bit base model’s weights aren’t precise enough for the merge arithmetic. Reload the base model in bfloat16 first, then merge.

python
# Reload base model in full precision
base_for_merge = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype=torch.bfloat16, device_map="auto"
)

# Stack the saved adapter
peft_model = PeftModel.from_pretrained(base_for_merge, adapter_save_path)

# Bake adapter weights into base model (W* = W + alpha/r * B @ A)
merged_model = peft_model.merge_and_unload()

print(f"Merged model type: {type(merged_model).__name__}")
print(f"Is PEFT model: {hasattr(merged_model, 'peft_config')}")

# Save like any standard HuggingFace model
merged_model.save_pretrained("./opt-125m-merged")
tokenizer.save_pretrained("./opt-125m-merged")
print("Merged model saved.")
python
Merged model type: OPTForCausalLM
Is PEFT model: False
Merged model saved.

LoRA vs QLoRA — When to Use Which

Both work. The right choice depends on your hardware.

Feature LoRA QLoRA
Base model precision bfloat16 (full) 4-bit NF4 (quantized)
VRAM for 7B model training ~18+ GB ~6 GB
Training speed Baseline ~30% slower (dequant overhead) [2]
Final model quality Slightly higher Matches LoRA on most benchmarks [3]
Can merge adapter? Yes, directly Yes, after reloading in bf16
Best hardware fit ≥16 GB VRAM 6–16 GB VRAM

Decision guide:
– Free Colab T4 (15 GB) + 7B model → QLoRA
– RTX 3080 10 GB + 7B model → QLoRA
– RTX 4090 24 GB + 7B model → LoRA
– Any setup + 13B model → QLoRA

Avoid double-quantizing. If you fine-tune with QLoRA, merge, and then requantize the merged model to 4-bit for deployment, you compound quantization errors from two rounds of lossy compression. Instead: fine-tune with LoRA in bfloat16, merge, then quantize once using a dedicated tool like llama.cpp/GGUF format.

When NOT to Use LoRA or QLoRA

LoRA isn’t always the right tool. Three scenarios where you should reconsider:

1. Your dataset covers a completely new domain the base model has never seen.
LoRA adapts existing knowledge — it amplifies and redirects what the model already knows. If you’re teaching a model about a specialised domain with entirely unique terminology and concepts it has never encountered, a small-rank adapter may lack capacity. Consider full fine-tuning or increasing rank to 64–128.

2. You need bit-for-bit reproducible model weights.
Merging LoRA adapters introduces floating-point rounding that differs from a fully fine-tuned model. In practice this is imperceptible. But in regulated environments requiring exact weight checksums, this matters.

3. Your task requires substantially overwriting base model behavior.
LoRA preserves most of the base model’s behavior. If fine-tuning needs to fundamentally change how the model behaves — not refine it — full fine-tuning gives you more control. The practical test: if LoRA outputs show the base model’s style bleeding into your task, increase rank or switch to full fine-tuning.

Common Mistakes and How to Fix Them

Mistake 1: Setting lora_alpha too low (wrong results)

Wrong:

python
LoraConfig(r=16, lora_alpha=1, ...)  # alpha/r = 0.0625

Why it fails: The scaling factor $\alpha / r = 1/16 = 0.0625$. The adapter updates are so small they barely affect the model. Training loss decreases, but the fine-tuned model behaves nearly identically to the base model.

Correct:

python
LoraConfig(r=16, lora_alpha=32, ...)  # alpha/r = 2.0

Mistake 2: Forgetting use_cache = False for QLoRA (crashes training)

Wrong: Loading a quantized model without disabling the KV cache.

Why it crashes: Gradient checkpointing is incompatible with the KV cache. The error message is confusing and points nowhere near the root cause.

Correct:

python
model.config.use_cache = False
model.enable_input_require_grads()

Mistake 3: Targeting only Q and V projections (suboptimal quality)

Suboptimal:

python
LoraConfig(target_modules=["q_proj", "v_proj"], ...)

Experiments across hundreds of LoRA runs [3] showed that targeting all linear layers consistently beats partial targeting by 1–3 percentage points on downstream benchmarks.

Better:

python
LoraConfig(target_modules=["q_proj","k_proj","v_proj","o_proj",
                            "gate_proj","up_proj","down_proj"], ...)

Mistake 4: Training too many epochs on a small dataset (overfitting)

Wrong: 10 epochs on a 1,000-sample dataset.

Why it fails: The model memorises training examples exactly. For datasets under 10,000 samples, use 1–2 epochs maximum. Watch the evaluation loss — if it rises while training loss falls, stop immediately.

[TRY IT YOURSELF] Exercise 3: Debug a Misconfigured QLoRA Setup

The code below has three bugs — one in BitsAndBytesConfig, one in LoraConfig, and one in model loading. Find and fix all three.

Hint 1: Check the compute dtype and the alpha/rank relationship.
Hint 2: The third bug is a missing line — look at what always follows quantized model loading.

python
# BUGGY CODE — identify the three bugs

bnb_config_buggy = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float32,  # Bug 1: wrong — use bfloat16
    bnb_4bit_use_double_quant=True,
)

lora_config_buggy = LoraConfig(
    r=32,
    lora_alpha=8,           # Bug 2: alpha/r = 0.25 — should be 64 (= 2 × r)
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
    target_modules=["q_proj", "v_proj"],
)

# model = AutoModelForCausalLM.from_pretrained(...)
# Bug 3: missing model.config.use_cache = False and enable_input_require_grads()

# Solution:
# Bug 1: bnb_4bit_compute_dtype=torch.bfloat16
# Bug 2: lora_alpha=64  (= 2 × r=32)
# Bug 3: add model.config.use_cache = False and model.enable_input_require_grads()

Complete Code

Click to expand the full script (copy-paste and run)
python
# Complete code: Fine-Tuning LLMs with LoRA and QLoRA in Python
# pip install torch transformers peft trl bitsandbytes datasets accelerate
# Python 3.9+ | GPU with 6+ GB VRAM required

import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType, PeftModel
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

MODEL_NAME = "facebook/opt-125m"
ADAPTER_PATH = "./opt-125m-lora-adapter"
MERGED_PATH = "./opt-125m-merged"

# ── LoRA fine-tuning ─────────────────────────────────────────────
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME, torch_dtype=torch.bfloat16, device_map="auto"
)
dataset = load_dataset("timdettmers/openassistant-guanaco", split="train")

model = get_peft_model(model, LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.05, bias="none",
    task_type=TaskType.CAUSAL_LM,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
))
model.print_trainable_parameters()

SFTTrainer(
    model=model, tokenizer=tokenizer, train_dataset=dataset,
    args=SFTConfig(
        output_dir="./opt-125m-lora", num_train_epochs=1,
        per_device_train_batch_size=4, gradient_accumulation_steps=4,
        learning_rate=2e-4, bf16=True, max_seq_length=512,
        dataset_text_field="text", report_to="none",
    ),
).train()
model.save_pretrained(ADAPTER_PATH)
tokenizer.save_pretrained(ADAPTER_PATH)

# ── Merge adapter into base model ────────────────────────────────
base = AutoModelForCausalLM.from_pretrained(MODEL_NAME, torch_dtype=torch.bfloat16, device_map="auto")
PeftModel.from_pretrained(base, ADAPTER_PATH).merge_and_unload().save_pretrained(MERGED_PATH)
tokenizer.save_pretrained(MERGED_PATH)

# ── QLoRA fine-tuning ────────────────────────────────────────────
model_q = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME, device_map="auto",
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True, bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True,
    ),
)
model_q.config.use_cache = False
model_q.enable_input_require_grads()
model_q = get_peft_model(model_q, LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.05, bias="none",
    task_type=TaskType.CAUSAL_LM,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
))
SFTTrainer(
    model=model_q, tokenizer=tokenizer, train_dataset=dataset,
    args=SFTConfig(
        output_dir="./opt-125m-qlora", num_train_epochs=1,
        per_device_train_batch_size=4, gradient_accumulation_steps=4,
        learning_rate=2e-4, bf16=True, max_seq_length=512,
        dataset_text_field="text", optim="paged_adamw_32bit", report_to="none",
    ),
).train()

print("Pipeline complete.")

Frequently Asked Questions

What is LoRA in machine learning?

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method for large language models. It injects two small trainable matrices next to each frozen weight layer. During training, only these adapter matrices update — the original model weights never change. This reduces trainable parameters from billions to millions, making fine-tuning feasible on consumer hardware.

How much GPU memory does QLoRA need for a 7B model?

QLoRA requires approximately 6 GB of GPU VRAM for a 7B parameter model. It achieves this by compressing the frozen base model from 14 GB (bfloat16) to ~3.5 GB (4-bit NF4), while keeping LoRA adapters in 16-bit precision. A free Colab T4 GPU (15 GB) is sufficient.

Can I use LoRA for encoder-only models like BERT?

Yes. Set task_type=TaskType.SEQ_CLS for classification or TaskType.TOKEN_CLS for NER. Target modules for BERT-style models are named query, value inside BertAttention. The LoRA math is identical — only the task type and module names change.

How do I merge LoRA weights into the base model?

Load the base model in bfloat16 precision, load the LoRA adapter on top with PeftModel.from_pretrained(), then call .merge_and_unload(). This bakes the adapter formula $W^* = W + (\alpha/r) \cdot B \cdot A$ into each layer. You cannot merge directly onto a 4-bit QLoRA model — you must first reload in bfloat16.

What do I do when I run out of GPU memory during training?

Try in order: (1) Switch from LoRA to QLoRA. (2) Halve per_device_train_batch_size and double gradient_accumulation_steps — effective batch size stays the same, memory drops. (3) Reduce max_seq_length — memory scales quadratically with sequence length. (4) Add gradient_checkpointing=True to SFTConfig — saves ~40% memory at ~20% slower training.

How do I choose the right LoRA rank?

Start with r=16. This works well for 90% of fine-tuning tasks. Increase to 32 or 64 if evaluation loss plateaus early and you have GPU memory to spare. Decrease to 4 or 8 for very small datasets (under 1,000 samples) to reduce overfitting risk. Ranks above 64 rarely help and always cost more memory.


What to Do Next

You’ve fine-tuned a model with both LoRA and QLoRA. Here are three natural next steps:

  1. Try fine-tuning on your own dataset — take a CSV of domain-specific Q&A pairs, format them with the ### Human: / ### Assistant: template shown above, and run the pipeline with your data.
  2. Explore DPO fine-tuning — Direct Preference Optimization aligns the model with human preferences by training on pairs of good/bad responses. It builds directly on the LoRA setup you’ve learned here.
  3. Add Flash Attention 2 — if you have an Ampere GPU, the attn_implementation="flash_attention_2" parameter can cut training time by 2–4x.

References

  1. Hu, E. J., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685. Link

  2. Dettmers, T., et al. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. arXiv:2305.14314. Link

  3. Biderman, D., et al. (2024). LoRA Learns Less and Forgets Less. Finetuning LLMs with LoRA and QLoRA: Insights from Hundreds of Experiments. Lightning AI. Link

  4. Dettmers, T., et al. (2023). Ablation on target modules: QLoRA paper Table 6. Link

  5. Hugging Face PEFT Library — official documentation. Link

  6. Hugging Face TRL Library — SFTTrainer documentation. Link

  7. bitsandbytes — 4-bit quantization documentation. Link

  8. Raschka, S. (2023). Practical Tips for Finetuning LLMs Using LoRA. Link

Free Course
Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Related Course
Master Deep Learning — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Get the full course,
completely free.
Join 57,000+ students learning Python, SQL & ML. One year of access, all resources included.
📚 10 Courses
🐍 Python & ML
🗄️ SQL
📦 Downloads
📅 1 Year Access
No thanks
🎓
Free AI/ML Starter Kit
Python · SQL · ML · 10 Courses · 57,000+ students
🎉   You're in! Check your inbox (or Promotions/Spam) for the access link.
⚡ Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

🐍
Core Python — from first line to expert level
📈
NumPy & Pandas — the #1 libraries every DS job needs
🗃️
SQL Levels I–III — basics to Window Functions
📄
Real industry data — Jupyter notebooks included
R A M S K
57,000+ students
★★★★★ Rated 4.9/5
⚡ Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  ★★★★★ 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
🔒 100% free ☕ No spam, ever ✓ Instant access
🚀
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course →
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science