How to Fine-Tune LLMs with LoRA in Python — A Complete Guide
Fine-tuning a 7B-parameter model used to mean renting a cluster of A100 GPUs for a week. LoRA changes this entirely — by freezing every original weight and injecting just a handful of trainable matrices, you can adapt a billion-parameter LLM on a single consumer GPU in a few hours, not weeks.
Why LoRA Changes Everything for LLM Fine-Tuning
The first time I tried to fine-tune a language model on domain-specific data, I gave up after seeing the GPU memory requirements. Full fine-tuning updates every parameter in the model. For a 7-billion-parameter LLM, that means storing fp32 weights, gradients, and Adam optimizer states for 7 billion numbers simultaneously. A standard fp32 training run needs roughly 16 bytes per parameter. That puts a 7B model at around 112GB during training — more GPU memory than most research labs have in a single machine.
LoRA (Low-Rank Adaptation) takes a different path. It freezes every original weight and injects small, trainable matrices alongside specific layers. The base model stays completely intact — you only train the small adapter matrices appended to selected layers.
In practical terms:
- Train a 7B model on a single 24GB GPU (RTX 3090 or RTX 4090)
- Train a 13B model with 4-bit quantization (QLoRA) on the same hardware
- Keep the base model untouched so multiple adapters can share it and be swapped at inference time
- Adapter files are typically 10–50MB instead of 14–26GB for full model copies
You’ll need Python 3.9+, a CUDA-compatible GPU (or CPU for TinyLlama), and these libraries:
pip install transformers peft trl bitsandbytes datasets accelerate -q
The Intuition Behind LoRA
Before any math, here’s the mental picture. Imagine updating a 2048×2048 weight matrix — that’s over four million numbers. But what if the “important part” of that update could be captured with just two small matrices: one of size 2048×8 and another of size 8×2048? Their product is still 2048×2048, but you only need to train 32,768 numbers instead of four million.
That is the core idea behind LoRA. Most weight updates during fine-tuning have low intrinsic rank — they don’t need all four million values to express the meaningful adaptation. LoRA exploits this by forcing the update to live in a much lower-dimensional subspace.
The result is remarkable. You train 0.1–1% of the model’s parameters and get 95–98% of the quality of full fine-tuning. And because the base model never changes, you can load the same base weights for 20 different tasks and swap adapters in milliseconds. I think this is genuinely one of the more elegant ideas in recent deep learning research — it feels obvious in retrospect, but it unlocked an entire new workflow.
The Math Behind LoRA
If you want to jump straight to code, skip to “Setting Up Your Environment.” This section explains why the code works the way it does.
During full fine-tuning, the weight matrix $W$ is updated directly:
$$W_{\text{new}} = W + \Delta W$$
LoRA keeps $W$ frozen and adds a learnable low-rank decomposition alongside it:
$$W’ = W + \Delta W = W + BA$$
Where:
- $W \in \mathbb{R}^{d \times k}$ is the frozen pretrained weight matrix
- $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$ are the small trainable LoRA matrices
- $r \ll \min(d, k)$ is the rank — the key hyperparameter controlling parameter count
The forward pass becomes:
$$h = Wx + \frac{\alpha}{r} \cdot BAx$$
Where $\frac{\alpha}{r}$ is the scaling factor. lora_alpha ($\alpha$) controls how strongly the LoRA update influences the output. The heuristic alpha = 2 × r gives a scaling factor of 2, which empirically works well as a starting point.
Why initialize B to zero? At the start of training, $B$ is all zeros, so $BA = 0$ — the model’s output is identical to the pretrained base. Training then learns the residual update needed for your specific task. This clean initialization avoids disrupting pretrained knowledge before a single gradient step has been taken. PEFT handles this automatically; you don’t manage it yourself.
If you want to go deeper on the mathematics, the original LoRA paper by Hu et al. (2022) has an excellent discussion of intrinsic dimensionality in fine-tuning.
Setting Up Your Environment
I’ll use TinyLlama/TinyLlama-1.1B-Chat-v1.0 throughout this guide. It uses the same transformer architecture as Llama 2 — identical attention and MLP layers — making it a perfect low-cost stand-in for testing LoRA configurations before scaling up to 7B or 13B models. It runs on any GPU with 4GB+ VRAM, or on CPU.
import torch
from datasets import Dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
BitsAndBytesConfig,
)
from peft import (
LoraConfig,
TaskType,
get_peft_model,
prepare_model_for_kbit_training,
PeftModel,
)
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
if torch.cuda.is_available():
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
Preparing Your Dataset
The key decision in fine-tuning is data quality, not quantity. I’ve seen models fine-tuned on 200 carefully curated examples outperform models trained on 5,000 noisy ones — format and consistency matter as much as volume.
Fine-tuning for instruction following requires (instruction, response) pairs formatted in the model’s chat template. Here’s a small Python Q&A dataset for a domain-specific assistant that gives concise, code-focused answers:
raw_data = [
{
"instruction": "What does the enumerate() function do in Python?",
"response": "enumerate() adds a counter to an iterable and returns it as enumerate objects. "
"Use it when you need both the index and value: `for i, val in enumerate(my_list): ...`"
},
{
"instruction": "Explain the difference between a list and a tuple in Python.",
"response": "Lists are mutable (you can add, remove, or modify elements) while tuples are immutable. "
"Use tuples for data that should not change and lists when you need to modify the collection."
},
{
"instruction": "How do you handle exceptions in Python?",
"response": "Use try/except blocks. Wrap the risky code in `try`, then catch specific exceptions "
"with `except ExceptionType as e`. Always catch specific types — bare `except:` hides bugs."
},
{
"instruction": "What is a list comprehension in Python?",
"response": "A list comprehension creates a new list by applying an expression to each item in an iterable, "
"optionally filtering with a condition: `[expr for item in iterable if condition]`."
},
{
"instruction": "How do you read a file in Python?",
"response": "Use `with open('filename.txt', 'r') as f: content = f.read()`. "
"The context manager closes the file automatically when the block exits."
},
{
"instruction": "What is the difference between deepcopy and copy in Python?",
"response": "`copy.copy()` creates a shallow copy — nested objects still reference the same memory. "
"`copy.deepcopy()` recursively copies all nested objects. Use deepcopy when modifying "
"nested structures independently."
},
{
"instruction": "How do you sort a list of dictionaries by a key in Python?",
"response": "Use `sorted()` with a `key` argument: "
"`sorted(data, key=lambda x: x['age'])`. "
"For reverse order, add `reverse=True`."
},
{
"instruction": "What are *args and **kwargs in Python functions?",
"response": "`*args` collects extra positional arguments as a tuple. "
"`**kwargs` collects extra keyword arguments as a dictionary. "
"They allow functions to accept any number of arguments."
},
]
dataset = Dataset.from_list(raw_data)
print(dataset)
which gives us:
Dataset({
features: ['instruction', 'response'],
num_rows: 8
})
TinyLlama uses the ChatML format. The apply_chat_template() method handles the conversion automatically — this is model-specific, so other models (Llama 3, Mistral) produce different delimiter strings, but the call is identical:
tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
tokenizer.pad_token = tokenizer.eos_token # TinyLlama has no separate pad token
def format_chat(example):
messages = [
{"role": "user", "content": example["instruction"]},
{"role": "assistant", "content": example["response"]},
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=False,
)
return {"text": text}
dataset = dataset.map(format_chat)
print(dataset[0]["text"])
Output:
<|system|>
</s>
<|user|>
What does the enumerate() function do in Python?</s>
<|assistant|>
enumerate() adds a counter to an iterable and returns it as enumerate objects. Use it when you need both the index and value: `for i, val in enumerate(my_list): ...`</s>
The <|assistant|> delimiter is critical — DataCollatorForCompletionOnlyLM will mask the loss on everything before it, so the model only learns from the response tokens.
Configuring LoRA with LoraConfig
LoraConfig is where you decide how large and where to place the adapters. The five parameters that matter most — r, lora_alpha, target_modules, lora_dropout, and bias — are shown below with the values I use as defaults for most tasks:
lora_config = LoraConfig(
r=8, # Rank: controls adapter size
lora_alpha=16, # Scaling factor (2× rank is the standard heuristic)
target_modules=[ # Which weight matrices to adapt
"q_proj",
"v_proj",
],
lora_dropout=0.05, # Light regularisation on LoRA layers
bias="none", # Do not train bias terms
task_type=TaskType.CAUSAL_LM, # Causal language modeling
)
What each parameter controls:
r (rank): With r=8, a 2048×2048 weight matrix decomposes into 2048×8 and 8×2048 matrices — 32,768 trainable numbers instead of 4,194,304. Higher rank = more expressive adapter + more parameters. The hyperparameter guide at the end covers how to choose rank for different tasks.
lora_alpha: The actual scaling applied is $\alpha / r$. With alpha=16 and r=8, LoRA outputs are scaled by 2. Setting alpha = 2 × r is the heuristic I start with — if the model learns too slowly, try alpha = 4 × r; if it overfits fast, try alpha = r.
target_modules: q_proj and v_proj are the query and value projections in each attention head. These are the two most impactful layers for task adaptation. The hyperparameter section below has architecture-specific module names for Llama 3, Mistral, Falcon, and GPT-2.
lora_dropout: Dropout applied to LoRA layers during training. A value of 0.05 adds light regularisation. On small datasets (< 500 examples), increase to 0.1.
bias: Whether to train bias terms. "none" keeps adapter files small and is the standard. "lora_only" adds a few hundred KB if you need the biases to adapt.
Now apply LoRA to the base model:
base_model = AutoModelForCausalLM.from_pretrained(
"TinyLlama/TinyLlama-1.1B-Chat-v1.0",
torch_dtype=torch.float16,
device_map="auto",
)
base_model.config.pad_token_id = tokenizer.eos_token_id
model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()
Exercise 1: Expand LoRA to All Attention Layers
Modify the LoraConfig to also target k_proj and o_proj (key projection and output projection). Apply it to the model and call print_trainable_parameters() to see how the trainable percentage changes.
# Starter code — add k_proj and o_proj to target_modules
lora_config_v2 = LoraConfig(
r=8,
lora_alpha=16,
target_modules=[
"q_proj",
"v_proj",
# Add "k_proj" and "o_proj" here
],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
model_v2 = get_peft_model(
AutoModelForCausalLM.from_pretrained(
"TinyLlama/TinyLlama-1.1B-Chat-v1.0",
torch_dtype=torch.float16,
device_map="auto",
),
lora_config_v2,
)
model_v2.print_trainable_parameters()
Solution
lora_config_v2 = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
model_v2 = get_peft_model(
AutoModelForCausalLM.from_pretrained(
"TinyLlama/TinyLlama-1.1B-Chat-v1.0",
torch_dtype=torch.float16,
device_map="auto",
),
lora_config_v2,
)
model_v2.print_trainable_parameters()
Adding `k_proj` and `o_proj` roughly doubles the trainable parameter count compared to `q_proj` + `v_proj` only, since you’re adding two more adapter pairs per attention layer. For most instruction-tuning tasks on medium-sized datasets, this extra capacity isn’t necessary. For complex reasoning tasks or larger datasets (10K+ examples), it can meaningfully improve results.
Training with SFTTrainer
The most important detail before writing a single training line: we want the model to learn to generate responses, not to memorize instruction tokens. Without the right data collator, loss is computed on both the instruction and the response — which wastes model capacity and produces a model that sometimes generates instruction-style text as part of its responses.
DataCollatorForCompletionOnlyLM fixes this by masking the loss on everything before the <|assistant|> delimiter. Gradients only flow through the response tokens:
response_template = "<|assistant|>\n"
collator = DataCollatorForCompletionOnlyLM(
response_template=response_template,
tokenizer=tokenizer,
)
Five settings in TrainingArguments matter most for a LoRA run: gradient_accumulation_steps to simulate larger batches on limited VRAM, fp16 for mixed-precision training, lr_scheduler_type for smooth convergence, remove_unused_columns=False to prevent a silent data-dropping bug, and warmup_ratio to stabilize early training:
training_args = TrainingArguments(
output_dir="./tinyllama-lora-output",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=4, # Effective batch size: 2 × 4 = 8
learning_rate=2e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.1,
fp16=True, # Mixed precision — halves activation memory
logging_steps=10,
save_steps=100,
save_total_limit=2,
remove_unused_columns=False, # Critical: do not drop the "text" column
report_to="none",
)
The gradient_accumulation_steps=4 setting is one I almost always keep. With a per-device batch size of 2, it gives an effective batch of 8 — large enough for stable gradients without needing more VRAM. If you have a 40GB GPU, increasing the per-device batch size directly is faster, but on consumer cards this trick buys you the same effect.
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
data_collator=collator,
dataset_text_field="text",
max_seq_length=512,
)
trainer.train()
model.save_pretrained("./tinyllama-lora-adapter")
tokenizer.save_pretrained("./tinyllama-lora-adapter")
print("Adapter saved.")
Output:
Adapter saved.
The output directory contains adapter_config.json and adapter_model.safetensors — typically 5–50MB total. This is what you version-control, share, or deploy. The base model stays where it is.
Exercise 2: Add Validation Monitoring
Modify the setup to log validation loss by splitting the dataset into train/validation and adding evaluation_strategy="epoch" to catch overfitting early.
# Starter code
split = dataset.train_test_split(test_size=0.2, seed=42)
train_data = split["train"]
eval_data = split["test"]
training_args_eval = TrainingArguments(
output_dir="./tinyllama-lora-eval",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
learning_rate=2e-4,
fp16=True,
logging_steps=5,
# Add evaluation_strategy and load_best_model_at_end here
remove_unused_columns=False,
report_to="none",
)
trainer_eval = SFTTrainer(
model=model,
args=training_args_eval,
train_dataset=train_data,
# Pass eval_dataset here
data_collator=collator,
dataset_text_field="text",
max_seq_length=512,
)
Solution
split = dataset.train_test_split(test_size=0.2, seed=42)
train_data = split["train"]
eval_data = split["test"]
training_args_eval = TrainingArguments(
output_dir="./tinyllama-lora-eval",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
learning_rate=2e-4,
fp16=True,
logging_steps=5,
evaluation_strategy="epoch", # Evaluate at end of each epoch
save_strategy="epoch",
load_best_model_at_end=True, # Keep the checkpoint with lowest eval loss
remove_unused_columns=False,
report_to="none",
)
trainer_eval = SFTTrainer(
model=model,
args=training_args_eval,
train_dataset=train_data,
eval_dataset=eval_data,
data_collator=collator,
dataset_text_field="text",
max_seq_length=512,
)
Watching `eval_loss` alongside `train_loss` is the simplest overfitting detector. If training loss keeps dropping but eval loss rises or plateaus, you’re fitting noise — reduce epochs or increase `lora_dropout`.
QLoRA: Fine-Tuning on Consumer Hardware
LoRA reduces trainable parameters, but the base model still loads in fp16 — roughly 14GB for a 7B model. QLoRA adds a second optimization: load the base model in 4-bit quantization, cutting VRAM requirements roughly in half.
With QLoRA, a single GPU can handle what previously required several:
- 7B models on a 12GB GPU (RTX 3060, RTX 4070)
- 13B models on a 16–24GB GPU
- 70B models on a 48GB GPU
The only addition to the standard LoRA setup is BitsAndBytesConfig. It tells the model loader to quantize weights as they load — the model activations and LoRA adapter still compute in bfloat16:
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4: optimal for normal-distribution weights
bnb_4bit_compute_dtype=torch.bfloat16, # Compute in bfloat16 for stability
bnb_4bit_use_double_quant=True, # Double quantization: saves extra ~0.4GB
)
I consistently use nf4 over int4. NF4 (NormalFloat4) was specifically designed for normally-distributed values — which neural network weights tend to be. The QLoRA paper showed it achieves lower reconstruction error than standard int4 at the same bit width. In practice, it’s the standard choice and there’s rarely a reason to switch.
Load the model in 4-bit, then prepare it for gradient computation with LoRA:
model_4bit = AutoModelForCausalLM.from_pretrained(
"TinyLlama/TinyLlama-1.1B-Chat-v1.0",
quantization_config=bnb_config,
device_map="auto",
)
# Enable gradient flow through quantized layers
model_4bit = prepare_model_for_kbit_training(model_4bit)
# Apply LoRA on top of the quantized base
model_4bit = get_peft_model(model_4bit, lora_config)
model_4bit.print_trainable_parameters()
The rest of the training setup is identical to standard LoRA — pass model_4bit to SFTTrainer exactly as before.
A note on Unsloth: If training speed is your bottleneck, the Unsloth library wraps the QLoRA stack and delivers roughly 2× faster training and 60% less VRAM through custom CUDA kernels. The tradeoff is an additional dependency and a slightly different API. For most use cases, standard PEFT + bitsandbytes works well. Unsloth becomes worth it when you’re iterating on large datasets and every training run takes hours.
| Configuration | GPU Memory (7B model) | Trainable Params | Relative Quality |
|---|---|---|---|
| Full fine-tuning (fp32) | ~112GB | 100% | Baseline |
| LoRA (fp16) | ~16–20GB | 0.1–1% | 95–98% |
| QLoRA (4-bit + LoRA) | ~6–8GB | 0.1–1% | 90–95% |
| QLoRA + Unsloth | ~5–7GB | 0.1–1% | 90–95% (2× faster) |
Running Inference and Evaluating Results
Here’s where you see whether fine-tuning actually worked. The key detail that trips up almost everyone the first time: model.generate() returns the full sequence including the input prompt. The slice outputs[0][inputs["input_ids"].shape[1]:] strips the prompt and returns only the newly generated tokens.
PeftModel.from_pretrained() handles loading cleanly — the base model loads first, then the adapter layers are applied on top:
base = AutoModelForCausalLM.from_pretrained(
"TinyLlama/TinyLlama-1.1B-Chat-v1.0",
torch_dtype=torch.float16,
device_map="auto",
)
fine_tuned = PeftModel.from_pretrained(base, "./tinyllama-lora-adapter")
tokenizer_eval = AutoTokenizer.from_pretrained("./tinyllama-lora-adapter")
tokenizer_eval.pad_token = tokenizer_eval.eos_token
def generate_response(model, tokenizer, instruction, max_new_tokens=200):
messages = [{"role": "user", "content": instruction}]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
temperature=0.7,
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
)
# Strip the input prompt from the output
new_tokens = outputs[0][inputs["input_ids"].shape[1]:]
return tokenizer.decode(new_tokens, skip_special_tokens=True)
response = generate_response(
fine_tuned,
tokenizer_eval,
"What is a list comprehension in Python?",
)
print(response)
Exercise 3: Side-by-Side Before/After Comparison
Load the base model (no adapter) and the fine-tuned model separately. Run both on the same question using generate_response() and print their answers side by side to see how the style changed.
# Starter code
question = "How do you sort a list of dictionaries by a key in Python?"
base_only = AutoModelForCausalLM.from_pretrained(
"TinyLlama/TinyLlama-1.1B-Chat-v1.0",
torch_dtype=torch.float16,
device_map="auto",
)
tokenizer_base = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
tokenizer_base.pad_token = tokenizer_base.eos_token
# Fill in: generate from base model
base_answer = ???
# Fill in: generate from fine-tuned model
ft_answer = ???
print("BASE MODEL:")
print(base_answer)
print("\nFINE-TUNED:")
print(ft_answer)
Solution
question = "How do you sort a list of dictionaries by a key in Python?"
base_only = AutoModelForCausalLM.from_pretrained(
"TinyLlama/TinyLlama-1.1B-Chat-v1.0",
torch_dtype=torch.float16,
device_map="auto",
)
tokenizer_base = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
tokenizer_base.pad_token = tokenizer_base.eos_token
base_answer = generate_response(base_only, tokenizer_base, question)
ft_answer = generate_response(fine_tuned, tokenizer_eval, question)
print("BASE MODEL:")
print(base_answer)
print("\nFINE-TUNED:")
print(ft_answer)
With fine-tuning data that includes this question, the fine-tuned model should mirror the concise, code-first style of the training examples. The base model typically gives longer, more varied responses. With only 8 training examples, the shift is subtle — but the mechanics are identical to production fine-tuning runs on thousands of examples.
Merging the LoRA Adapter into the Base Model
The adapter-based setup is flexible: many tasks share one base model, and adapters can be swapped at runtime. My general rule: keep adapters separate during experimentation (easy to iterate, nothing changes in the base), merge when you’re ready for production or want to share a standalone model.
merge_and_unload() computes $W’ = W + BA$ for every adapted layer and returns a standard HuggingFace model with no PEFT dependency:
# Load base model on CPU — avoids VRAM overflow during merge on large models
merge_base = AutoModelForCausalLM.from_pretrained(
"TinyLlama/TinyLlama-1.1B-Chat-v1.0",
torch_dtype=torch.float16,
device_map="cpu",
)
to_merge = PeftModel.from_pretrained(merge_base, "./tinyllama-lora-adapter")
merged_model = to_merge.merge_and_unload()
merged_model.save_pretrained("./tinyllama-merged")
tokenizer_eval.save_pretrained("./tinyllama-merged")
print("Merged model saved.")
Output:
Merged model saved.
To share the merged model on Hugging Face Hub, push it directly after merging:
from huggingface_hub import login
login() # Prompts for your HF token
merged_model.push_to_hub("your-username/tinyllama-python-qa")
tokenizer_eval.push_to_hub("your-username/tinyllama-python-qa")
print("Model pushed to Hugging Face Hub.")
After merging, load and use the model like any standard HuggingFace model — no peft import needed:
standalone = AutoModelForCausalLM.from_pretrained(
"./tinyllama-merged",
torch_dtype=torch.float16,
device_map="auto",
)
For multi-task deployments where you want to keep adapters separate, skip merging and use hot-swapping instead:
# Load multiple adapters once — no reloading needed
model.load_adapter("./adapter-summarize", adapter_name="summarize")
model.load_adapter("./adapter-code", adapter_name="code")
model.set_adapter("summarize")
summary = generate_response(model, tokenizer_eval, "Summarize this article...")
model.set_adapter("code")
code_answer = generate_response(model, tokenizer_eval, "Write a Python function that...")
Choosing the Right LoRA Hyperparameters
This is where I see most practitioners get stuck — not in the code, but in knowing what to change when results are disappointing. Let me walk through the most impactful settings.
Rank (r) — Start Low, Increase Deliberately
Rank controls the size of each adapter pair. Here’s how to choose:
| Rank | Use Case | Approx. Trainable Params (7B model) |
|---|---|---|
| 4 | Simple style adaptation, tone shift | ~5M (0.07%) |
| 8 | Standard task adaptation (my default) | ~10M (0.14%) |
| 16 | Complex task with diverse dataset | ~20M (0.28%) |
| 32 | Multi-domain reasoning, large dataset | ~40M (0.57%) |
| 64 | Near-full fine-tuning territory | ~80M (1.14%) |
I start at r=8 for every new task. I increase to 16 or 32 only if the model clearly fails to learn after a few hundred steps. Beyond r=64, you’re spending memory on capacity you almost certainly don’t need.
Alpha (lora_alpha) — Keep It at 2× Rank
Set alpha = 2 × r as your starting point. This produces a scaling factor of 2.0, which gives LoRA meaningful influence without overwhelming the pretrained weights. If training loss barely moves, try alpha = 4 × r. If the model overfits within the first epoch, try alpha = r (scaling factor of 1.0).
Target Modules by Architecture
Different architectures name their linear layers differently. Use attention-only targeting for simple tasks; add MLP layers for complex reasoning:
| Architecture | Attention Only | Attention + MLP |
|---|---|---|
| Llama 2/3, TinyLlama | q_proj, v_proj |
q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Mistral, Mixtral | q_proj, v_proj |
q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Falcon | query_key_value |
query_key_value, dense, dense_h_to_4h, dense_4h_to_h |
| GPT-2, GPT-NeoX | c_attn, c_proj |
c_attn, c_proj, c_fc |
The QLoRA paper found that applying LoRA to all linear layers — including MLP — consistently outperforms attention-only LoRA for complex tasks. For simple style adaptation, attention-only is sufficient and trains faster.
[UNDER THE HOOD]
How PEFT finds target modules. When you pass target_modules=["q_proj", "v_proj"], PEFT iterates every named module in the model and applies a LoRA adapter to any nn.Linear layer whose name ends with those strings. You can pass a regex pattern instead of a list. To see all linear layer names in your model: {name for name, mod in model.named_modules() if isinstance(mod, torch.nn.Linear)}.
Common Mistakes and How to Fix Them
These are the five errors I see most often in LoRA implementations — each one causes a real failure, not just a style issue.
Mistake 1: Not Setting the Pad Token
❌ Wrong:
tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
trainer = SFTTrainer(model=model, ...)
# ValueError: Asking to pad but the tokenizer does not have a padding token.
✅ Correct:
tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.eos_token_id
Why it breaks: Causal language models were designed for left-to-right generation, not batched training. The Trainer needs a padding token to batch examples of different lengths. Using EOS as the padding token is the standard workaround.
Mistake 2: Training on the Full Prompt Instead of Just the Response
❌ Wrong — loss computed on instruction + response:
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=512,
# No data_collator — model learns to predict instruction tokens
)
✅ Correct — mask the instruction, train on response only:
response_template = "<|assistant|>\n"
collator = DataCollatorForCompletionOnlyLM(
response_template=response_template,
tokenizer=tokenizer,
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
data_collator=collator,
dataset_text_field="text",
max_seq_length=512,
)
Why it matters: Without the completion-only collator, loss is computed on both the instruction and response. The model learns to predict instruction tokens — wasting capacity and producing responses that sometimes mirror instruction phrasing.
Mistake 3: Wrong Target Module Names for the Architecture
❌ Wrong — using Llama names on a GPT-2 model:
# PEFT silently trains zero parameters when no modules match
lora_config = LoraConfig(
target_modules=["q_proj", "v_proj"], # Llama-style — doesn't exist in GPT-2
...
)
model = get_peft_model(gpt2_model, lora_config)
model.print_trainable_parameters()
# trainable params: 0 || all params: 124,439,808 || trainable%: 0.0000
✅ Correct — inspect first:
# Find all Linear layer names in your specific model
linear_names = {
name
for name, module in model.named_modules()
if isinstance(module, torch.nn.Linear)
}
print(linear_names)
# Use the correct names from this output in target_modules
This is the most insidious mistake — PEFT doesn’t raise an error. It just trains nothing, and your loss stays flat with no indication of why. Always verify with print_trainable_parameters().
Mistake 4: Rank Too High on a Small Dataset
High rank + few examples = fast overfitting. A model with r=64 and 100 training examples often reaches near-zero training loss within one epoch while generating incoherent responses on new inputs. Fix: start at r=8, use lora_dropout=0.1, and monitor validation loss.
Mistake 5: Skipping gradient_checkpointing for 7B+ Models
For 7B+ models, training without gradient checkpointing can exhaust VRAM even with LoRA:
# Required before gradient checkpointing when using PEFT
model.enable_input_require_grads()
training_args = TrainingArguments(
...,
gradient_checkpointing=True,
gradient_checkpointing_kwargs={"use_reentrant": False}, # PEFT compatibility
)
Gradient checkpointing recomputes activations during the backward pass instead of storing them — roughly 20–30% slower training in exchange for ~40% less VRAM. On a 7B model, this is often the difference between training fitting in 24GB and crashing at 18GB.
LoRA vs QLoRA vs Full Fine-Tuning — When to Use Which
| Full Fine-Tuning | LoRA | QLoRA | QLoRA + Unsloth | |
|---|---|---|---|---|
| GPU (7B model) | ~4× A100 (320GB) | 1–2× A100 (40–80GB) | 1× RTX 4090 (24GB) | 1× RTX 4090 (24GB) |
| Training Speed | Slowest | Moderate | Moderate | 2× faster than QLoRA |
| Quality | Baseline | 95–98% | 90–95% | 90–95% |
| Deployment | Full model copy | Shared base + adapters | Shared base + adapters | Shared base + adapters |
| Best For | Max quality, large budget | Good GPUs, fast iteration | Consumer hardware | Large datasets, speed matters |
Use full fine-tuning when you have the compute budget and maximum quality is critical. A 2–5% quality improvement over LoRA may justify the 10× cost increase for customer-facing production models.
Use LoRA when you have access to decent GPUs and want near-full quality with fast experimentation. The ability to swap adapters in production without maintaining multiple full model copies is also a meaningful operational advantage.
Use QLoRA when you’re on consumer hardware or a tight budget. The quality difference from LoRA is typically 1–3%, and you gain the ability to work with models that simply don’t fit in fp16.
Use QLoRA + Unsloth when training speed is the bottleneck and you’re iterating on many adapter versions or large datasets.
Frequently Asked Questions
Can I fine-tune LoRA on multiple tasks and switch between them at inference?
Yes. Train a separate adapter for each task and save each to its own directory. At inference, load all adapters once with load_adapter(), then switch between them instantly with set_adapter() — no model reload required. This is the architecture used in production serving systems that need to handle many specialized tasks from a single base model.
Does LoRA work with all LLM architectures?
LoRA works with any architecture that has linear layers — which includes every transformer-based LLM. PEFT officially supports Llama 1/2/3, Mistral, Mixtral, Falcon, GPT-2/J/NeoX, T5, BERT, and many more. For unsupported architectures, inspect module names manually using the linear layer inspection snippet from Common Mistakes.
How do I know if my fine-tuned model actually improved?
Decreasing training loss is necessary but not sufficient — the model might be memorizing training examples. Evaluate on a held-out validation set using task-specific metrics: perplexity for language modeling quality, ROUGE for summarization, exact match or F1 for extractive QA, pass@k for code generation, or accuracy for classification. For open-ended generation, human preference evaluation is often the most meaningful signal.
What is the difference between LoRA and prefix tuning?
Both are parameter-efficient fine-tuning methods, but they differ in mechanism. Prefix tuning prepends learnable “virtual token” embeddings to the input sequence and trains only those vectors. LoRA injects learnable matrices into specific weight layers. In practice, LoRA is more stable to train, more memory-efficient, and achieves higher quality — which is why it became the dominant PEFT method after 2022.
Can I use LoRA to fine-tune vision transformers or multimodal models?
Yes. LoRA works on any linear layer. For vision transformers, target the attention projections in the vision encoder. For multimodal models like LLaVA or Qwen-VL, you can apply LoRA to the language model backbone, the vision encoder, or both. The PEFT code is identical — only the target_modules list changes.
Does the quantized base model need to be re-quantized after merging?
No — merge_and_unload() merges the fp16 LoRA weights into the dequantized base. The resulting merged model is in fp16. If you want to re-quantize for Ollama or llama.cpp deployment, convert to GGUF format after merging using the llama.cpp conversion tools.
What to Learn Next
Fine-tuning is one piece of the LLM engineering stack. Here are the natural next steps:
- Evaluating LLM outputs — how to measure whether your fine-tuned model is actually better
- Hugging Face Transformers guide — deep dive into the base library powering everything in this article
- LLM inference optimization — quantization, speculative decoding, and batching for production serving
- Building LLM applications with LangChain — integrating your fine-tuned model into a full retrieval-augmented pipeline
Complete Code
Click to expand the full LoRA fine-tuning script (copy-paste and run)
# Complete code: How to Fine-Tune LLMs with LoRA in Python
# Requirements: pip install transformers peft trl bitsandbytes datasets accelerate
# Python 3.9+ | transformers>=4.44 | peft>=0.12 | trl>=0.11 | bitsandbytes>=0.44
import torch
from datasets import Dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
BitsAndBytesConfig,
)
from peft import (
LoraConfig,
TaskType,
get_peft_model,
prepare_model_for_kbit_training,
PeftModel,
)
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM
# --- Configuration ---
MODEL_ID = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
ADAPTER_DIR = "./tinyllama-lora-adapter"
MERGED_DIR = "./tinyllama-merged"
# --- Section 1: Dataset ---
raw_data = [
{"instruction": "What does the enumerate() function do in Python?",
"response": "enumerate() adds a counter to an iterable and returns it as enumerate objects. "
"Use it when you need both the index and value: `for i, val in enumerate(my_list): ...`"},
{"instruction": "Explain the difference between a list and a tuple.",
"response": "Lists are mutable while tuples are immutable. Use tuples for fixed data "
"and lists when you need to modify the collection."},
{"instruction": "How do you handle exceptions in Python?",
"response": "Use try/except blocks. Wrap risky code in `try`, then catch specific exceptions "
"with `except ExceptionType as e`. Always catch specific types — bare `except:` hides bugs."},
{"instruction": "What is a list comprehension in Python?",
"response": "A list comprehension creates a new list by applying an expression to each item in an iterable: "
"`[expr for item in iterable if condition]`."},
{"instruction": "How do you read a file in Python?",
"response": "Use `with open('filename.txt', 'r') as f: content = f.read()`. "
"The context manager closes the file automatically."},
{"instruction": "What is the difference between deepcopy and copy?",
"response": "`copy.copy()` is a shallow copy — nested objects share memory. "
"`copy.deepcopy()` recursively copies all nested objects."},
{"instruction": "How do you sort a list of dictionaries by a key?",
"response": "Use `sorted(data, key=lambda x: x['age'])`. For reverse order, add `reverse=True`."},
{"instruction": "What are *args and **kwargs?",
"response": "`*args` collects extra positional arguments as a tuple. "
"`**kwargs` collects extra keyword arguments as a dictionary."},
]
dataset = Dataset.from_list(raw_data)
# --- Section 2: Tokenizer and Formatting ---
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token
def format_chat(example):
messages = [
{"role": "user", "content": example["instruction"]},
{"role": "assistant", "content": example["response"]},
]
return {"text": tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=False
)}
dataset = dataset.map(format_chat)
# --- Section 3: LoRA Configuration ---
lora_config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
# --- Section 4: Load Model + Apply LoRA ---
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, torch_dtype=torch.float16, device_map="auto"
)
model.config.pad_token_id = tokenizer.eos_token_id
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# --- Section 5: Training ---
response_template = "<|assistant|>\n"
collator = DataCollatorForCompletionOnlyLM(
response_template=response_template,
tokenizer=tokenizer,
)
training_args = TrainingArguments(
output_dir="./tinyllama-lora-output",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
learning_rate=2e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.1,
fp16=True,
logging_steps=10,
save_steps=100,
save_total_limit=2,
remove_unused_columns=False,
report_to="none",
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
data_collator=collator,
dataset_text_field="text",
max_seq_length=512,
)
trainer.train()
model.save_pretrained(ADAPTER_DIR)
tokenizer.save_pretrained(ADAPTER_DIR)
# --- Section 6: Inference ---
def generate_response(model, tokenizer, instruction, max_new_tokens=200):
messages = [{"role": "user", "content": instruction}]
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
temperature=0.7,
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
)
new_tokens = outputs[0][inputs["input_ids"].shape[1]:]
return tokenizer.decode(new_tokens, skip_special_tokens=True)
base_reload = AutoModelForCausalLM.from_pretrained(
MODEL_ID, torch_dtype=torch.float16, device_map="auto"
)
fine_tuned = PeftModel.from_pretrained(base_reload, ADAPTER_DIR)
tok_eval = AutoTokenizer.from_pretrained(ADAPTER_DIR)
tok_eval.pad_token = tok_eval.eos_token
print(generate_response(fine_tuned, tok_eval, "What is a list comprehension?"))
# --- Section 7: Merge Adapter ---
merge_base = AutoModelForCausalLM.from_pretrained(
MODEL_ID, torch_dtype=torch.float16, device_map="cpu"
)
merged = PeftModel.from_pretrained(merge_base, ADAPTER_DIR).merge_and_unload()
merged.save_pretrained(MERGED_DIR)
tok_eval.save_pretrained(MERGED_DIR)
print("Script completed successfully.")
References
- Hu, E. J., Shen, Y., Wallis, P., et al. — LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022. arXiv:2106.09685
- Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. — QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS 2023. arXiv:2305.14314
- Hugging Face PEFT Documentation — Parameter-Efficient Fine-Tuning. Link
- Hugging Face PEFT — LoRA Conceptual Guide. Link
- Hugging Face TRL — SFTTrainer Documentation. Link
- Hugging Face Transformers — BitsAndBytesConfig. Link
- Databricks Blog — Efficient Fine-Tuning with LoRA: A Guide to Optimal Parameter Selection. Link
- He, J., Zhou, C., Ma, X., et al. — Towards a Unified View of Parameter-Efficient Transfer Learning. ICLR 2022. arXiv:2110.04366
- Lialin, V., Deshpande, V., & Rumshisky, A. — Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning. 2023. arXiv:2303.15647
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →