How to Fine-Tune LLMs with LoRA in Python — A Complete Guide

Learn how to fine-tune large language models with LoRA in Python using PEFT and TRL — covers LoraConfig, QLoRA, SFTTrainer, model merging, and common mistakes on consumer GPUs.

Written by Selva Prabhakaran | 30 min read

Fine-tuning a 7B-parameter model used to mean renting a cluster of A100 GPUs for a week. LoRA changes this entirely — by freezing every original weight and injecting just a handful of trainable matrices, you can adapt a billion-parameter LLM on a single consumer GPU in a few hours, not weeks.

Why LoRA Changes Everything for LLM Fine-Tuning

The first time I tried to fine-tune a language model on domain-specific data, I gave up after seeing the GPU memory requirements. Full fine-tuning updates every parameter in the model. For a 7-billion-parameter LLM, that means storing fp32 weights, gradients, and Adam optimizer states for 7 billion numbers simultaneously. A standard fp32 training run needs roughly 16 bytes per parameter. That puts a 7B model at around 112GB during training — more GPU memory than most research labs have in a single machine.

LoRA (Low-Rank Adaptation) takes a different path. It freezes every original weight and injects small, trainable matrices alongside specific layers. The base model stays completely intact — you only train the small adapter matrices appended to selected layers.

In practical terms:

Train a 7B model on a single 24GB GPU (RTX 3090 or RTX 4090)
Train a 13B model with 4-bit quantization (QLoRA) on the same hardware
Keep the base model untouched so multiple adapters can share it and be swapped at inference time
Adapter files are typically 10–50MB instead of 14–26GB for full model copies

You’ll need Python 3.9+, a CUDA-compatible GPU (or CPU for TinyLlama), and these libraries:

bash

pip install transformers peft trl bitsandbytes datasets accelerate -q

Note: This article was tested with transformers==4.44.0, peft==0.12.0, trl==0.11.0, bitsandbytes==0.44.0, and datasets==3.0.0. APIs can differ in older versions — especially `SFTTrainer`, which changed significantly between trl 0.8 and 0.11.

The Intuition Behind LoRA

Before any math, here’s the mental picture. Imagine updating a 2048×2048 weight matrix — that’s over four million numbers. But what if the “important part” of that update could be captured with just two small matrices: one of size 2048×8 and another of size 8×2048? Their product is still 2048×2048, but you only need to train 32,768 numbers instead of four million.

That is the core idea behind LoRA. Most weight updates during fine-tuning have low intrinsic rank — they don’t need all four million values to express the meaningful adaptation. LoRA exploits this by forcing the update to live in a much lower-dimensional subspace.

The result is remarkable. You train 0.1–1% of the model’s parameters and get 95–98% of the quality of full fine-tuning. And because the base model never changes, you can load the same base weights for 20 different tasks and swap adapters in milliseconds. I think this is genuinely one of the more elegant ideas in recent deep learning research — it feels obvious in retrospect, but it unlocked an entire new workflow.

Key Insight: **LoRA works because weight updates during fine-tuning tend to be low-rank.** The meaningful adaptation of a pretrained model to a new task can usually be captured by matrices of rank 8, 16, or 32 — even when the original weight matrices have thousands of dimensions.

The Math Behind LoRA

If you want to jump straight to code, skip to “Setting Up Your Environment.” This section explains why the code works the way it does.

During full fine-tuning, the weight matrix $W$ is updated directly:

$$W_{\text{new}} = W + \Delta W$$

LoRA keeps $W$ frozen and adds a learnable low-rank decomposition alongside it:

$$W’ = W + \Delta W = W + BA$$

Where:

$W \in \mathbb{R}^{d \times k}$ is the frozen pretrained weight matrix
$B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$ are the small trainable LoRA matrices
$r \ll \min(d, k)$ is the rank — the key hyperparameter controlling parameter count

The forward pass becomes:

$$h = Wx + \frac{\alpha}{r} \cdot BAx$$

Where $\frac{\alpha}{r}$ is the scaling factor. lora_alpha ($\alpha$) controls how strongly the LoRA update influences the output. The heuristic alpha = 2 × r gives a scaling factor of 2, which empirically works well as a starting point.

Why initialize B to zero? At the start of training, $B$ is all zeros, so $BA = 0$ — the model’s output is identical to the pretrained base. Training then learns the residual update needed for your specific task. This clean initialization avoids disrupting pretrained knowledge before a single gradient step has been taken. PEFT handles this automatically; you don’t manage it yourself.

If you want to go deeper on the mathematics, the original LoRA paper by Hu et al. (2022) has an excellent discussion of intrinsic dimensionality in fine-tuning.

Setting Up Your Environment

I’ll use TinyLlama/TinyLlama-1.1B-Chat-v1.0 throughout this guide. It uses the same transformer architecture as Llama 2 — identical attention and MLP layers — making it a perfect low-cost stand-in for testing LoRA configurations before scaling up to 7B or 13B models. It runs on any GPU with 4GB+ VRAM, or on CPU.

python

import torch
from datasets import Dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    BitsAndBytesConfig,
)
from peft import (
    LoraConfig,
    TaskType,
    get_peft_model,
    prepare_model_for_kbit_training,
    PeftModel,
)
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM

python

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

Preparing Your Dataset

The key decision in fine-tuning is data quality, not quantity. I’ve seen models fine-tuned on 200 carefully curated examples outperform models trained on 5,000 noisy ones — format and consistency matter as much as volume.

Fine-tuning for instruction following requires (instruction, response) pairs formatted in the model’s chat template. Here’s a small Python Q&A dataset for a domain-specific assistant that gives concise, code-focused answers:

python

raw_data = [
    {
        "instruction": "What does the enumerate() function do in Python?",
        "response": "enumerate() adds a counter to an iterable and returns it as enumerate objects. "
                    "Use it when you need both the index and value: `for i, val in enumerate(my_list): ...`"
    },
    {
        "instruction": "Explain the difference between a list and a tuple in Python.",
        "response": "Lists are mutable (you can add, remove, or modify elements) while tuples are immutable. "
                    "Use tuples for data that should not change and lists when you need to modify the collection."
    },
    {
        "instruction": "How do you handle exceptions in Python?",
        "response": "Use try/except blocks. Wrap the risky code in `try`, then catch specific exceptions "
                    "with `except ExceptionType as e`. Always catch specific types — bare `except:` hides bugs."
    },
    {
        "instruction": "What is a list comprehension in Python?",
        "response": "A list comprehension creates a new list by applying an expression to each item in an iterable, "
                    "optionally filtering with a condition: `[expr for item in iterable if condition]`."
    },
    {
        "instruction": "How do you read a file in Python?",
        "response": "Use `with open('filename.txt', 'r') as f: content = f.read()`. "
                    "The context manager closes the file automatically when the block exits."
    },
    {
        "instruction": "What is the difference between deepcopy and copy in Python?",
        "response": "`copy.copy()` creates a shallow copy — nested objects still reference the same memory. "
                    "`copy.deepcopy()` recursively copies all nested objects. Use deepcopy when modifying "
                    "nested structures independently."
    },
    {
        "instruction": "How do you sort a list of dictionaries by a key in Python?",
        "response": "Use `sorted()` with a `key` argument: "
                    "`sorted(data, key=lambda x: x['age'])`. "
                    "For reverse order, add `reverse=True`."
    },
    {
        "instruction": "What are *args and **kwargs in Python functions?",
        "response": "`*args` collects extra positional arguments as a tuple. "
                    "`**kwargs` collects extra keyword arguments as a dictionary. "
                    "They allow functions to accept any number of arguments."
    },
]

dataset = Dataset.from_list(raw_data)
print(dataset)

which gives us:

python

Dataset({
    features: ['instruction', 'response'],
    num_rows: 8
})

Note: This 8-example dataset is intentionally small for illustration. A real fine-tuning run needs at minimum 100–500 high-quality examples to see meaningful adaptation, and ideally 1,000–10,000+ for strong results. For production use, consider `tatsu-lab/alpaca` (52K examples) or `databricks/databricks-dolly-15k` (15K carefully curated examples).

TinyLlama uses the ChatML format. The apply_chat_template() method handles the conversion automatically — this is model-specific, so other models (Llama 3, Mistral) produce different delimiter strings, but the call is identical:

python

tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
tokenizer.pad_token = tokenizer.eos_token  # TinyLlama has no separate pad token

def format_chat(example):
    messages = [
        {"role": "user", "content": example["instruction"]},
        {"role": "assistant", "content": example["response"]},
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=False,
    )
    return {"text": text}

dataset = dataset.map(format_chat)
print(dataset[0]["text"])

Output:

python

<|system|>
</s>
<|user|>
What does the enumerate() function do in Python?</s>
<|assistant|>
enumerate() adds a counter to an iterable and returns it as enumerate objects. Use it when you need both the index and value: `for i, val in enumerate(my_list): ...`</s>

The <|assistant|> delimiter is critical — DataCollatorForCompletionOnlyLM will mask the loss on everything before it, so the model only learns from the response tokens.

Configuring LoRA with LoraConfig

LoraConfig is where you decide how large and where to place the adapters. The five parameters that matter most — r, lora_alpha, target_modules, lora_dropout, and bias — are shown below with the values I use as defaults for most tasks:

python

lora_config = LoraConfig(
    r=8,                           # Rank: controls adapter size
    lora_alpha=16,                 # Scaling factor (2× rank is the standard heuristic)
    target_modules=[               # Which weight matrices to adapt
        "q_proj",
        "v_proj",
    ],
    lora_dropout=0.05,             # Light regularisation on LoRA layers
    bias="none",                   # Do not train bias terms
    task_type=TaskType.CAUSAL_LM,  # Causal language modeling
)

What each parameter controls:

r (rank): With r=8, a 2048×2048 weight matrix decomposes into 2048×8 and 8×2048 matrices — 32,768 trainable numbers instead of 4,194,304. Higher rank = more expressive adapter + more parameters. The hyperparameter guide at the end covers how to choose rank for different tasks.

lora_alpha: The actual scaling applied is $\alpha / r$. With alpha=16 and r=8, LoRA outputs are scaled by 2. Setting alpha = 2 × r is the heuristic I start with — if the model learns too slowly, try alpha = 4 × r; if it overfits fast, try alpha = r.

target_modules: q_proj and v_proj are the query and value projections in each attention head. These are the two most impactful layers for task adaptation. The hyperparameter section below has architecture-specific module names for Llama 3, Mistral, Falcon, and GPT-2.

lora_dropout: Dropout applied to LoRA layers during training. A value of 0.05 adds light regularisation. On small datasets (< 500 examples), increase to 0.1.

bias: Whether to train bias terms. "none" keeps adapter files small and is the standard. "lora_only" adds a few hundred KB if you need the biases to adapt.

Now apply LoRA to the base model:

python

base_model = AutoModelForCausalLM.from_pretrained(
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    torch_dtype=torch.float16,
    device_map="auto",
)
base_model.config.pad_token_id = tokenizer.eos_token_id

model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()

Tip: **Always call `model.print_trainable_parameters()` after applying LoRA.** A correctly configured setup trains 0.1%–2% of total parameters. If you see `trainable%: 0.0000`, PEFT found no modules matching your `target_modules` names — use the inspection trick in Common Mistakes to find the right names for your architecture.

Try It Yourself

Exercise 1: Expand LoRA to All Attention Layers

Modify the LoraConfig to also target k_proj and o_proj (key projection and output projection). Apply it to the model and call print_trainable_parameters() to see how the trainable percentage changes.

python

# Starter code — add k_proj and o_proj to target_modules
lora_config_v2 = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=[
        "q_proj",
        "v_proj",
        # Add "k_proj" and "o_proj" here
    ],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)
model_v2 = get_peft_model(
    AutoModelForCausalLM.from_pretrained(
        "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
        torch_dtype=torch.float16,
        device_map="auto",
    ),
    lora_config_v2,
)
model_v2.print_trainable_parameters()

Solution

python

lora_config_v2 = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)
model_v2 = get_peft_model(
    AutoModelForCausalLM.from_pretrained(
        "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
        torch_dtype=torch.float16,
        device_map="auto",
    ),
    lora_config_v2,
)
model_v2.print_trainable_parameters()

Adding `k_proj` and `o_proj` roughly doubles the trainable parameter count compared to `q_proj` + `v_proj` only, since you’re adding two more adapter pairs per attention layer. For most instruction-tuning tasks on medium-sized datasets, this extra capacity isn’t necessary. For complex reasoning tasks or larger datasets (10K+ examples), it can meaningfully improve results.

Training with SFTTrainer

The most important detail before writing a single training line: we want the model to learn to generate responses, not to memorize instruction tokens. Without the right data collator, loss is computed on both the instruction and the response — which wastes model capacity and produces a model that sometimes generates instruction-style text as part of its responses.

DataCollatorForCompletionOnlyLM fixes this by masking the loss on everything before the <|assistant|> delimiter. Gradients only flow through the response tokens:

python

response_template = "<|assistant|>\n"
collator = DataCollatorForCompletionOnlyLM(
    response_template=response_template,
    tokenizer=tokenizer,
)

Five settings in TrainingArguments matter most for a LoRA run: gradient_accumulation_steps to simulate larger batches on limited VRAM, fp16 for mixed-precision training, lr_scheduler_type for smooth convergence, remove_unused_columns=False to prevent a silent data-dropping bug, and warmup_ratio to stabilize early training:

python

training_args = TrainingArguments(
    output_dir="./tinyllama-lora-output",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,    # Effective batch size: 2 × 4 = 8
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    fp16=True,                         # Mixed precision — halves activation memory
    logging_steps=10,
    save_steps=100,
    save_total_limit=2,
    remove_unused_columns=False,       # Critical: do not drop the "text" column
    report_to="none",
)

The gradient_accumulation_steps=4 setting is one I almost always keep. With a per-device batch size of 2, it gives an effective batch of 8 — large enough for stable gradients without needing more VRAM. If you have a 40GB GPU, increasing the per-device batch size directly is faster, but on consumer cards this trick buys you the same effect.

python

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    data_collator=collator,
    dataset_text_field="text",
    max_seq_length=512,
)

trainer.train()

python

model.save_pretrained("./tinyllama-lora-adapter")
tokenizer.save_pretrained("./tinyllama-lora-adapter")
print("Adapter saved.")

Output:

python

Adapter saved.

The output directory contains adapter_config.json and adapter_model.safetensors — typically 5–50MB total. This is what you version-control, share, or deploy. The base model stays where it is.

Warning: **`remove_unused_columns=False` is mandatory when using a custom data collator.** By default, the Trainer drops any dataset columns it doesn’t recognize. Without this flag, the `”text”` column disappears silently before the collator sees it, causing a `KeyError` on the first training step — with no clear indication of why.

Try It Yourself

Exercise 2: Add Validation Monitoring

Modify the setup to log validation loss by splitting the dataset into train/validation and adding evaluation_strategy="epoch" to catch overfitting early.

python

# Starter code
split = dataset.train_test_split(test_size=0.2, seed=42)
train_data = split["train"]
eval_data  = split["test"]

training_args_eval = TrainingArguments(
    output_dir="./tinyllama-lora-eval",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=5,
    # Add evaluation_strategy and load_best_model_at_end here
    remove_unused_columns=False,
    report_to="none",
)

trainer_eval = SFTTrainer(
    model=model,
    args=training_args_eval,
    train_dataset=train_data,
    # Pass eval_dataset here
    data_collator=collator,
    dataset_text_field="text",
    max_seq_length=512,
)

Solution

python

split = dataset.train_test_split(test_size=0.2, seed=42)
train_data = split["train"]
eval_data  = split["test"]

training_args_eval = TrainingArguments(
    output_dir="./tinyllama-lora-eval",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=5,
    evaluation_strategy="epoch",     # Evaluate at end of each epoch
    save_strategy="epoch",
    load_best_model_at_end=True,     # Keep the checkpoint with lowest eval loss
    remove_unused_columns=False,
    report_to="none",
)

trainer_eval = SFTTrainer(
    model=model,
    args=training_args_eval,
    train_dataset=train_data,
    eval_dataset=eval_data,
    data_collator=collator,
    dataset_text_field="text",
    max_seq_length=512,
)

Watching `eval_loss` alongside `train_loss` is the simplest overfitting detector. If training loss keeps dropping but eval loss rises or plateaus, you’re fitting noise — reduce epochs or increase `lora_dropout`.

QLoRA: Fine-Tuning on Consumer Hardware

LoRA reduces trainable parameters, but the base model still loads in fp16 — roughly 14GB for a 7B model. QLoRA adds a second optimization: load the base model in 4-bit quantization, cutting VRAM requirements roughly in half.

With QLoRA, a single GPU can handle what previously required several:

7B models on a 12GB GPU (RTX 3060, RTX 4070)
13B models on a 16–24GB GPU
70B models on a 48GB GPU

The only addition to the standard LoRA setup is BitsAndBytesConfig. It tells the model loader to quantize weights as they load — the model activations and LoRA adapter still compute in bfloat16:

python

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",                # NormalFloat4: optimal for normal-distribution weights
    bnb_4bit_compute_dtype=torch.bfloat16,    # Compute in bfloat16 for stability
    bnb_4bit_use_double_quant=True,           # Double quantization: saves extra ~0.4GB
)

I consistently use nf4 over int4. NF4 (NormalFloat4) was specifically designed for normally-distributed values — which neural network weights tend to be. The QLoRA paper showed it achieves lower reconstruction error than standard int4 at the same bit width. In practice, it’s the standard choice and there’s rarely a reason to switch.

Load the model in 4-bit, then prepare it for gradient computation with LoRA:

python

model_4bit = AutoModelForCausalLM.from_pretrained(
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    quantization_config=bnb_config,
    device_map="auto",
)

# Enable gradient flow through quantized layers
model_4bit = prepare_model_for_kbit_training(model_4bit)

# Apply LoRA on top of the quantized base
model_4bit = get_peft_model(model_4bit, lora_config)
model_4bit.print_trainable_parameters()

The rest of the training setup is identical to standard LoRA — pass model_4bit to SFTTrainer exactly as before.

Key Insight: **`prepare_model_for_kbit_training()` is not optional.** It enables gradient checkpointing for quantized layers and casts certain normalization layers back to fp32 for numerical stability. Skipping this call can cause flat loss curves or NaN gradients — often silently, with no error message to explain why.

A note on Unsloth: If training speed is your bottleneck, the Unsloth library wraps the QLoRA stack and delivers roughly 2× faster training and 60% less VRAM through custom CUDA kernels. The tradeoff is an additional dependency and a slightly different API. For most use cases, standard PEFT + bitsandbytes works well. Unsloth becomes worth it when you’re iterating on large datasets and every training run takes hours.

Configuration	GPU Memory (7B model)	Trainable Params	Relative Quality
Full fine-tuning (fp32)	~112GB	100%	Baseline
LoRA (fp16)	~16–20GB	0.1–1%	95–98%
QLoRA (4-bit + LoRA)	~6–8GB	0.1–1%	90–95%
QLoRA + Unsloth	~5–7GB	0.1–1%	90–95% (2× faster)

Running Inference and Evaluating Results

Here’s where you see whether fine-tuning actually worked. The key detail that trips up almost everyone the first time: model.generate() returns the full sequence including the input prompt. The slice outputs[0][inputs["input_ids"].shape[1]:] strips the prompt and returns only the newly generated tokens.

PeftModel.from_pretrained() handles loading cleanly — the base model loads first, then the adapter layers are applied on top:

python

base = AutoModelForCausalLM.from_pretrained(
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    torch_dtype=torch.float16,
    device_map="auto",
)

fine_tuned = PeftModel.from_pretrained(base, "./tinyllama-lora-adapter")
tokenizer_eval = AutoTokenizer.from_pretrained("./tinyllama-lora-adapter")
tokenizer_eval.pad_token = tokenizer_eval.eos_token

python

def generate_response(model, tokenizer, instruction, max_new_tokens=200):
    messages = [{"role": "user", "content": instruction}]
    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )
    # Strip the input prompt from the output
    new_tokens = outputs[0][inputs["input_ids"].shape[1]:]
    return tokenizer.decode(new_tokens, skip_special_tokens=True)

response = generate_response(
    fine_tuned,
    tokenizer_eval,
    "What is a list comprehension in Python?",
)
print(response)

Note: Fine-tuning on 8 examples produces minimal measurable improvement — this example demonstrates the mechanics, not production results. A real evaluation needs a held-out test set of at least 50–100 examples. For task-specific metrics: ROUGE for summarization, exact match for QA, pass@k for code generation, or perplexity for general language quality. See the [LLM evaluation guide](/python/evaluating-llm-outputs/) for a full walkthrough.

Try It Yourself

Exercise 3: Side-by-Side Before/After Comparison

Load the base model (no adapter) and the fine-tuned model separately. Run both on the same question using generate_response() and print their answers side by side to see how the style changed.

python

# Starter code
question = "How do you sort a list of dictionaries by a key in Python?"

base_only = AutoModelForCausalLM.from_pretrained(
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer_base = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
tokenizer_base.pad_token = tokenizer_base.eos_token

# Fill in: generate from base model
base_answer = ???

# Fill in: generate from fine-tuned model
ft_answer = ???

print("BASE MODEL:")
print(base_answer)
print("\nFINE-TUNED:")
print(ft_answer)

Solution

python

question = "How do you sort a list of dictionaries by a key in Python?"

base_only = AutoModelForCausalLM.from_pretrained(
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer_base = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
tokenizer_base.pad_token = tokenizer_base.eos_token

base_answer = generate_response(base_only, tokenizer_base, question)
ft_answer   = generate_response(fine_tuned, tokenizer_eval, question)

print("BASE MODEL:")
print(base_answer)
print("\nFINE-TUNED:")
print(ft_answer)

With fine-tuning data that includes this question, the fine-tuned model should mirror the concise, code-first style of the training examples. The base model typically gives longer, more varied responses. With only 8 training examples, the shift is subtle — but the mechanics are identical to production fine-tuning runs on thousands of examples.

Merging the LoRA Adapter into the Base Model

The adapter-based setup is flexible: many tasks share one base model, and adapters can be swapped at runtime. My general rule: keep adapters separate during experimentation (easy to iterate, nothing changes in the base), merge when you’re ready for production or want to share a standalone model.

merge_and_unload() computes $W’ = W + BA$ for every adapted layer and returns a standard HuggingFace model with no PEFT dependency:

python

# Load base model on CPU — avoids VRAM overflow during merge on large models
merge_base = AutoModelForCausalLM.from_pretrained(
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    torch_dtype=torch.float16,
    device_map="cpu",
)

to_merge = PeftModel.from_pretrained(merge_base, "./tinyllama-lora-adapter")
merged_model = to_merge.merge_and_unload()

merged_model.save_pretrained("./tinyllama-merged")
tokenizer_eval.save_pretrained("./tinyllama-merged")
print("Merged model saved.")

Output:

python

Merged model saved.

To share the merged model on Hugging Face Hub, push it directly after merging:

python

from huggingface_hub import login
login()  # Prompts for your HF token

merged_model.push_to_hub("your-username/tinyllama-python-qa")
tokenizer_eval.push_to_hub("your-username/tinyllama-python-qa")
print("Model pushed to Hugging Face Hub.")

After merging, load and use the model like any standard HuggingFace model — no peft import needed:

python

standalone = AutoModelForCausalLM.from_pretrained(
    "./tinyllama-merged",
    torch_dtype=torch.float16,
    device_map="auto",
)

For multi-task deployments where you want to keep adapters separate, skip merging and use hot-swapping instead:

python

# Load multiple adapters once — no reloading needed
model.load_adapter("./adapter-summarize", adapter_name="summarize")
model.load_adapter("./adapter-code", adapter_name="code")

model.set_adapter("summarize")
summary = generate_response(model, tokenizer_eval, "Summarize this article...")

model.set_adapter("code")
code_answer = generate_response(model, tokenizer_eval, "Write a Python function that...")

Tip: **Always merge on CPU.** `merge_and_unload()` performs matrix additions across every adapted layer. For a 7B model, this temporarily requires holding multiple large tensors in memory simultaneously. Loading with `device_map=”cpu”` before merging avoids GPU out-of-memory errors that appear with larger models.

Choosing the Right LoRA Hyperparameters

This is where I see most practitioners get stuck — not in the code, but in knowing what to change when results are disappointing. Let me walk through the most impactful settings.

Rank (r) — Start Low, Increase Deliberately

Rank controls the size of each adapter pair. Here’s how to choose:

Rank	Use Case	Approx. Trainable Params (7B model)
4	Simple style adaptation, tone shift	~5M (0.07%)
8	Standard task adaptation (my default)	~10M (0.14%)
16	Complex task with diverse dataset	~20M (0.28%)
32	Multi-domain reasoning, large dataset	~40M (0.57%)
64	Near-full fine-tuning territory	~80M (1.14%)

I start at r=8 for every new task. I increase to 16 or 32 only if the model clearly fails to learn after a few hundred steps. Beyond r=64, you’re spending memory on capacity you almost certainly don’t need.

Alpha (lora_alpha) — Keep It at 2× Rank

Set alpha = 2 × r as your starting point. This produces a scaling factor of 2.0, which gives LoRA meaningful influence without overwhelming the pretrained weights. If training loss barely moves, try alpha = 4 × r. If the model overfits within the first epoch, try alpha = r (scaling factor of 1.0).

Target Modules by Architecture

Different architectures name their linear layers differently. Use attention-only targeting for simple tasks; add MLP layers for complex reasoning:

Architecture	Attention Only	Attention + MLP
Llama 2/3, TinyLlama	`q_proj`, `v_proj`	`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
Mistral, Mixtral	`q_proj`, `v_proj`	`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
Falcon	`query_key_value`	`query_key_value`, `dense`, `dense_h_to_4h`, `dense_4h_to_h`
GPT-2, GPT-NeoX	`c_attn`, `c_proj`	`c_attn`, `c_proj`, `c_fc`

The QLoRA paper found that applying LoRA to all linear layers — including MLP — consistently outperforms attention-only LoRA for complex tasks. For simple style adaptation, attention-only is sufficient and trains faster.

[UNDER THE HOOD]
How PEFT finds target modules. When you pass target_modules=["q_proj", "v_proj"], PEFT iterates every named module in the model and applies a LoRA adapter to any nn.Linear layer whose name ends with those strings. You can pass a regex pattern instead of a list. To see all linear layer names in your model: {name for name, mod in model.named_modules() if isinstance(mod, torch.nn.Linear)}.

Common Mistakes and How to Fix Them

These are the five errors I see most often in LoRA implementations — each one causes a real failure, not just a style issue.

Mistake 1: Not Setting the Pad Token

❌ Wrong:

python

tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
trainer = SFTTrainer(model=model, ...)
# ValueError: Asking to pad but the tokenizer does not have a padding token.

✅ Correct:

python

tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.eos_token_id

Why it breaks: Causal language models were designed for left-to-right generation, not batched training. The Trainer needs a padding token to batch examples of different lengths. Using EOS as the padding token is the standard workaround.

Mistake 2: Training on the Full Prompt Instead of Just the Response

❌ Wrong — loss computed on instruction + response:

python

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=512,
    # No data_collator — model learns to predict instruction tokens
)

✅ Correct — mask the instruction, train on response only:

python

response_template = "<|assistant|>\n"
collator = DataCollatorForCompletionOnlyLM(
    response_template=response_template,
    tokenizer=tokenizer,
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    data_collator=collator,
    dataset_text_field="text",
    max_seq_length=512,
)

Why it matters: Without the completion-only collator, loss is computed on both the instruction and response. The model learns to predict instruction tokens — wasting capacity and producing responses that sometimes mirror instruction phrasing.

Mistake 3: Wrong Target Module Names for the Architecture

❌ Wrong — using Llama names on a GPT-2 model:

python

# PEFT silently trains zero parameters when no modules match
lora_config = LoraConfig(
    target_modules=["q_proj", "v_proj"],  # Llama-style — doesn't exist in GPT-2
    ...
)
model = get_peft_model(gpt2_model, lora_config)
model.print_trainable_parameters()
# trainable params: 0 || all params: 124,439,808 || trainable%: 0.0000

✅ Correct — inspect first:

python

# Find all Linear layer names in your specific model
linear_names = {
    name
    for name, module in model.named_modules()
    if isinstance(module, torch.nn.Linear)
}
print(linear_names)
# Use the correct names from this output in target_modules

This is the most insidious mistake — PEFT doesn’t raise an error. It just trains nothing, and your loss stays flat with no indication of why. Always verify with print_trainable_parameters().

Mistake 4: Rank Too High on a Small Dataset

High rank + few examples = fast overfitting. A model with r=64 and 100 training examples often reaches near-zero training loss within one epoch while generating incoherent responses on new inputs. Fix: start at r=8, use lora_dropout=0.1, and monitor validation loss.

Mistake 5: Skipping `gradient_checkpointing` for 7B+ Models

For 7B+ models, training without gradient checkpointing can exhaust VRAM even with LoRA:

python

# Required before gradient checkpointing when using PEFT
model.enable_input_require_grads()

training_args = TrainingArguments(
    ...,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},  # PEFT compatibility
)

Gradient checkpointing recomputes activations during the backward pass instead of storing them — roughly 20–30% slower training in exchange for ~40% less VRAM. On a 7B model, this is often the difference between training fitting in 24GB and crashing at 18GB.

LoRA vs QLoRA vs Full Fine-Tuning — When to Use Which

	Full Fine-Tuning	LoRA	QLoRA	QLoRA + Unsloth
GPU (7B model)	~4× A100 (320GB)	1–2× A100 (40–80GB)	1× RTX 4090 (24GB)	1× RTX 4090 (24GB)
Training Speed	Slowest	Moderate	Moderate	2× faster than QLoRA
Quality	Baseline	95–98%	90–95%	90–95%
Deployment	Full model copy	Shared base + adapters	Shared base + adapters	Shared base + adapters
Best For	Max quality, large budget	Good GPUs, fast iteration	Consumer hardware	Large datasets, speed matters

Use full fine-tuning when you have the compute budget and maximum quality is critical. A 2–5% quality improvement over LoRA may justify the 10× cost increase for customer-facing production models.

Use LoRA when you have access to decent GPUs and want near-full quality with fast experimentation. The ability to swap adapters in production without maintaining multiple full model copies is also a meaningful operational advantage.

Use QLoRA when you’re on consumer hardware or a tight budget. The quality difference from LoRA is typically 1–3%, and you gain the ability to work with models that simply don’t fit in fp16.

Use QLoRA + Unsloth when training speed is the bottleneck and you’re iterating on many adapter versions or large datasets.

Frequently Asked Questions

Can I fine-tune LoRA on multiple tasks and switch between them at inference?

Yes. Train a separate adapter for each task and save each to its own directory. At inference, load all adapters once with load_adapter(), then switch between them instantly with set_adapter() — no model reload required. This is the architecture used in production serving systems that need to handle many specialized tasks from a single base model.

Does LoRA work with all LLM architectures?

LoRA works with any architecture that has linear layers — which includes every transformer-based LLM. PEFT officially supports Llama 1/2/3, Mistral, Mixtral, Falcon, GPT-2/J/NeoX, T5, BERT, and many more. For unsupported architectures, inspect module names manually using the linear layer inspection snippet from Common Mistakes.

How do I know if my fine-tuned model actually improved?

Decreasing training loss is necessary but not sufficient — the model might be memorizing training examples. Evaluate on a held-out validation set using task-specific metrics: perplexity for language modeling quality, ROUGE for summarization, exact match or F1 for extractive QA, pass@k for code generation, or accuracy for classification. For open-ended generation, human preference evaluation is often the most meaningful signal.

What is the difference between LoRA and prefix tuning?

Both are parameter-efficient fine-tuning methods, but they differ in mechanism. Prefix tuning prepends learnable “virtual token” embeddings to the input sequence and trains only those vectors. LoRA injects learnable matrices into specific weight layers. In practice, LoRA is more stable to train, more memory-efficient, and achieves higher quality — which is why it became the dominant PEFT method after 2022.

Can I use LoRA to fine-tune vision transformers or multimodal models?

Yes. LoRA works on any linear layer. For vision transformers, target the attention projections in the vision encoder. For multimodal models like LLaVA or Qwen-VL, you can apply LoRA to the language model backbone, the vision encoder, or both. The PEFT code is identical — only the target_modules list changes.

Does the quantized base model need to be re-quantized after merging?

No — merge_and_unload() merges the fp16 LoRA weights into the dequantized base. The resulting merged model is in fp16. If you want to re-quantize for Ollama or llama.cpp deployment, convert to GGUF format after merging using the llama.cpp conversion tools.

What to Learn Next

Fine-tuning is one piece of the LLM engineering stack. Here are the natural next steps:

Evaluating LLM outputs — how to measure whether your fine-tuned model is actually better
Hugging Face Transformers guide — deep dive into the base library powering everything in this article
LLM inference optimization — quantization, speculative decoding, and batching for production serving
Building LLM applications with LangChain — integrating your fine-tuned model into a full retrieval-augmented pipeline

Complete Code

Click to expand the full LoRA fine-tuning script (copy-paste and run)

python

# Complete code: How to Fine-Tune LLMs with LoRA in Python
# Requirements: pip install transformers peft trl bitsandbytes datasets accelerate
# Python 3.9+ | transformers>=4.44 | peft>=0.12 | trl>=0.11 | bitsandbytes>=0.44

import torch
from datasets import Dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    BitsAndBytesConfig,
)
from peft import (
    LoraConfig,
    TaskType,
    get_peft_model,
    prepare_model_for_kbit_training,
    PeftModel,
)
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM

# --- Configuration ---
MODEL_ID     = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
ADAPTER_DIR  = "./tinyllama-lora-adapter"
MERGED_DIR   = "./tinyllama-merged"

# --- Section 1: Dataset ---
raw_data = [
    {"instruction": "What does the enumerate() function do in Python?",
     "response": "enumerate() adds a counter to an iterable and returns it as enumerate objects. "
                 "Use it when you need both the index and value: `for i, val in enumerate(my_list): ...`"},
    {"instruction": "Explain the difference between a list and a tuple.",
     "response": "Lists are mutable while tuples are immutable. Use tuples for fixed data "
                 "and lists when you need to modify the collection."},
    {"instruction": "How do you handle exceptions in Python?",
     "response": "Use try/except blocks. Wrap risky code in `try`, then catch specific exceptions "
                 "with `except ExceptionType as e`. Always catch specific types — bare `except:` hides bugs."},
    {"instruction": "What is a list comprehension in Python?",
     "response": "A list comprehension creates a new list by applying an expression to each item in an iterable: "
                 "`[expr for item in iterable if condition]`."},
    {"instruction": "How do you read a file in Python?",
     "response": "Use `with open('filename.txt', 'r') as f: content = f.read()`. "
                 "The context manager closes the file automatically."},
    {"instruction": "What is the difference between deepcopy and copy?",
     "response": "`copy.copy()` is a shallow copy — nested objects share memory. "
                 "`copy.deepcopy()` recursively copies all nested objects."},
    {"instruction": "How do you sort a list of dictionaries by a key?",
     "response": "Use `sorted(data, key=lambda x: x['age'])`. For reverse order, add `reverse=True`."},
    {"instruction": "What are *args and **kwargs?",
     "response": "`*args` collects extra positional arguments as a tuple. "
                 "`**kwargs` collects extra keyword arguments as a dictionary."},
]
dataset = Dataset.from_list(raw_data)

# --- Section 2: Tokenizer and Formatting ---
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token

def format_chat(example):
    messages = [
        {"role": "user", "content": example["instruction"]},
        {"role": "assistant", "content": example["response"]},
    ]
    return {"text": tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=False
    )}

dataset = dataset.map(format_chat)

# --- Section 3: LoRA Configuration ---
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

# --- Section 4: Load Model + Apply LoRA ---
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, torch_dtype=torch.float16, device_map="auto"
)
model.config.pad_token_id = tokenizer.eos_token_id
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# --- Section 5: Training ---
response_template = "<|assistant|>\n"
collator = DataCollatorForCompletionOnlyLM(
    response_template=response_template,
    tokenizer=tokenizer,
)

training_args = TrainingArguments(
    output_dir="./tinyllama-lora-output",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    fp16=True,
    logging_steps=10,
    save_steps=100,
    save_total_limit=2,
    remove_unused_columns=False,
    report_to="none",
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    data_collator=collator,
    dataset_text_field="text",
    max_seq_length=512,
)

trainer.train()
model.save_pretrained(ADAPTER_DIR)
tokenizer.save_pretrained(ADAPTER_DIR)

# --- Section 6: Inference ---
def generate_response(model, tokenizer, instruction, max_new_tokens=200):
    messages = [{"role": "user", "content": instruction}]
    prompt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )
    new_tokens = outputs[0][inputs["input_ids"].shape[1]:]
    return tokenizer.decode(new_tokens, skip_special_tokens=True)

base_reload = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, torch_dtype=torch.float16, device_map="auto"
)
fine_tuned = PeftModel.from_pretrained(base_reload, ADAPTER_DIR)
tok_eval   = AutoTokenizer.from_pretrained(ADAPTER_DIR)
tok_eval.pad_token = tok_eval.eos_token

print(generate_response(fine_tuned, tok_eval, "What is a list comprehension?"))

# --- Section 7: Merge Adapter ---
merge_base = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, torch_dtype=torch.float16, device_map="cpu"
)
merged = PeftModel.from_pretrained(merge_base, ADAPTER_DIR).merge_and_unload()
merged.save_pretrained(MERGED_DIR)
tok_eval.save_pretrained(MERGED_DIR)

print("Script completed successfully.")

References

Hu, E. J., Shen, Y., Wallis, P., et al. — LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022. arXiv:2106.09685
Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. — QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS 2023. arXiv:2305.14314
Hugging Face PEFT Documentation — Parameter-Efficient Fine-Tuning. Link
Hugging Face PEFT — LoRA Conceptual Guide. Link
Hugging Face TRL — SFTTrainer Documentation. Link
Hugging Face Transformers — BitsAndBytesConfig. Link
Databricks Blog — Efficient Fine-Tuning with LoRA: A Guide to Optimal Parameter Selection. Link
He, J., Zhou, C., Ma, X., et al. — Towards a Unified View of Parameter-Efficient Transfer Learning. ICLR 2022. arXiv:2110.04366
Lialin, V., Deshpande, V., & Rumshisky, A. — Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning. 2023. arXiv:2303.15647

Free Course

Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course →

Trusted by 50,000+ learners

Written by

Selva Prabhakaran →

Related Course

Master Deep Learning — Hands-On

Join 5,000+ students at edu.machinelearningplus.com

Explore Course

How to Fine-Tune LLMs with LoRA in Python — A Complete Guide

Why LoRA Changes Everything for LLM Fine-Tuning

The Intuition Behind LoRA

The Math Behind LoRA

Setting Up Your Environment

Preparing Your Dataset

Configuring LoRA with LoraConfig

Training with SFTTrainer

QLoRA: Fine-Tuning on Consumer Hardware

Running Inference and Evaluating Results

Merging the LoRA Adapter into the Base Model

Choosing the Right LoRA Hyperparameters

Common Mistakes and How to Fix Them

Mistake 1: Not Setting the Pad Token

Mistake 2: Training on the Full Prompt Instead of Just the Response

Mistake 3: Wrong Target Module Names for the Architecture

Mistake 4: Rank Too High on a Small Dataset

Mistake 5: Skipping `gradient_checkpointing` for 7B+ Models

LoRA vs QLoRA vs Full Fine-Tuning — When to Use Which

Frequently Asked Questions

Can I fine-tune LoRA on multiple tasks and switch between them at inference?

Does LoRA work with all LLM architectures?

How do I know if my fine-tuned model actually improved?

What is the difference between LoRA and prefix tuning?

Can I use LoRA to fine-tune vision transformers or multimodal models?

Does the quantized base model need to be re-quantized after merging?

What to Learn Next

Complete Code

References

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Why LoRA Changes Everything for LLM Fine-Tuning

The Intuition Behind LoRA

The Math Behind LoRA

Setting Up Your Environment

Preparing Your Dataset

Configuring LoRA with LoraConfig

Training with SFTTrainer

QLoRA: Fine-Tuning on Consumer Hardware

Running Inference and Evaluating Results

Merging the LoRA Adapter into the Base Model

Choosing the Right LoRA Hyperparameters

Common Mistakes and How to Fix Them

Mistake 1: Not Setting the Pad Token

Mistake 2: Training on the Full Prompt Instead of Just the Response

Mistake 3: Wrong Target Module Names for the Architecture

Mistake 4: Rank Too High on a Small Dataset

Mistake 5: Skipping gradient_checkpointing for 7B+ Models

LoRA vs QLoRA vs Full Fine-Tuning — When to Use Which

Frequently Asked Questions

Can I fine-tune LoRA on multiple tasks and switch between them at inference?

Does LoRA work with all LLM architectures?

How do I know if my fine-tuned model actually improved?

What is the difference between LoRA and prefix tuning?

Can I use LoRA to fine-tune vision transformers or multimodal models?

Does the quantized base model need to be re-quantized after merging?

What to Learn Next

Complete Code

References

Related Articles

Fine-Tuning LLMs with LoRA and QLoRA in Python — A Complete Guide

How to implement Linear Regression in TensorFlow

How to use tf.function to speed up Python code in Tensorflow

Get Your Free AI/ML Engineer Roadmap

Want help choosing the right AI/ML path?

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Mistake 5: Skipping `gradient_checkpointing` for 7B+ Models