Menu

Unsloth Fine-Tuning — Train LLMs 2x Faster with 70% Less Memory

Written by Selva Prabhakaran | 20 min read

Fine-tuning a 7-billion-parameter model normally eats 30+ GB of GPU memory. Most of us don’t have that kind of hardware sitting around. Unsloth cuts that to under 8 GB — and training finishes twice as fast.

Want to customize Llama 3.1 for your own chatbot, or train a domain-specific Q&A model? This guide walks you through every step, from loading your first model to exporting a deployment-ready file.

Before you write a single line of code, here’s how the pieces connect.

You start with a pre-trained model — Llama, Qwen, Mistral, or any supported architecture. The model already knows language, but it doesn’t know YOUR task. So you load it in 4-bit quantized form. This slashes memory from 14 GB down to roughly 5 GB.

Next, you attach thin LoRA adapter layers on top. These are small trainable matrices — only a few million parameters compared to the model’s 7 billion. The original model weights stay frozen. Only the adapters learn.


Then you feed in your dataset and let the SFTTrainer run. Question-answer pairs, instructions, conversations — whatever matches your use case.

When training finishes, you test the model and export it. Save the tiny adapter file (around 100 MB), or merge it into the full model and convert to GGUF for tools like Ollama.

That’s the whole pipeline. Six steps.

What Is Unsloth?

Unsloth is an open-source Python library that makes LLM fine-tuning 2-5x faster while using up to 70% less GPU memory.

How does it pull this off? The developers manually derived the backpropagation math for transformer layers. Then they rewrote those operations as hand-tuned Triton kernels (Triton is a language for writing GPU programs that compile to highly optimized machine code). Standard PyTorch uses generic kernels. Unsloth’s kernels are purpose-built for LoRA training patterns.

The result: identical math, less memory, faster execution. This isn’t an approximation — the computations are exactly the same.

Key Insight: **Unsloth doesn’t trade accuracy for speed.** It performs mathematically identical computations to standard training — just faster, because the GPU kernels are hand-optimized instead of generic.

I find the HuggingFace integration particularly well done. Unsloth works with transformers, PEFT, TRL, and the Hub. If you already fine-tune with HuggingFace tools, switching takes about three extra lines of code.

Supported model families: Llama 3/3.1/3.2, Qwen 2.5/3, Mistral, Gemma 2, DeepSeek, Phi-3/4, and many more. Vision-language models work through FastVisionModel.

Prerequisites and Unsloth Installation

You need the following:

  • Python: 3.10 or newer
  • GPU: Any NVIDIA GPU with CUDA support (GTX 1070 through H100)
  • VRAM: 8 GB minimum for 7B models with QLoRA. 16 GB gives you room to breathe.
  • OS: Linux, WSL2 on Windows, or Google Colab
  • CUDA: Version 11.8 or newer with compatible drivers

Install Unsloth and the training libraries:

bash
pip install unsloth
pip install --upgrade transformers trl datasets

Verify your GPU is recognized:

python
import torch
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))
python
True
NVIDIA GeForce RTX 3090

Your device name will differ. What matters is that the first line prints True.

Note: **Running on Google Colab?** Select a T4 GPU runtime (free tier). The T4 has 15 GB VRAM — enough to fine-tune 7B models with QLoRA. Install with `!pip install unsloth` in a code cell.

LoRA vs QLoRA — Picking Your Unsloth Strategy

Before loading a model, you need to decide: LoRA or QLoRA?

Both use the same core idea. Instead of updating all 7 billion weights, you freeze the model and attach small adapter matrices. Only these adapters get trained.

The difference is how the frozen model is stored in memory.

LoRA keeps the base model in 16-bit precision. Maximum accuracy, but the model itself uses more VRAM.

QLoRA quantizes the base model to 4-bit precision before attaching adapters. This cuts memory by roughly 75%. The adapters still train in 16-bit, so training quality stays high.

Feature LoRA (16-bit) QLoRA (4-bit)
Base model precision float16 / bfloat16 4-bit (NF4)
VRAM for 7B model ~14 GB ~5 GB
Training speed Fast Fast (with Unsloth)
Accuracy vs full fine-tune ~99% ~98-99%
Best for 24+ GB GPUs Consumer GPUs (8-16 GB)

Which should you pick? Start with QLoRA. It works on almost any modern GPU. The quality difference is negligible with Unsloth’s dynamic 4-bit quantization. If you have a 24 GB+ GPU and want to maximize quality, try LoRA.

Tip: **Unsloth’s pre-quantized models skip the quantization step.** Models with `bnb-4bit` in the name (like `unsloth/llama-3.1-8b-bnb-4bit`) load 4x faster and use less memory during loading.

Step 1 — Load a Pre-Trained Model with Unsloth

Here’s where Unsloth replaces the standard HuggingFace loading code. Instead of AutoModelForCausalLM, you use FastLanguageModel. This wrapper applies the optimized Triton kernels automatically.

The from_pretrained method takes four key arguments. Here we load Llama 3.1 8B in 4-bit mode:

python
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-bnb-4bit",
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)

What each parameter does:

  • model_name — The pre-quantized bnb-4bit version from Unsloth’s Hub. Loads faster than quantizing on the fly.
  • max_seq_length — Maximum tokens in a training example. 2048 is a solid default.
  • dtype=None — Unsloth auto-detects the best precision. Ampere+ GPUs get bfloat16, older ones get float16.
  • load_in_4bit=True — Activates QLoRA. Set to False for standard LoRA.
Warning: **Don’t set `max_seq_length` higher than you need.** Each doubling roughly doubles memory for attention. If your training examples max out at 1024 tokens, use 1024 — not 4096 “just in case.”

Step 2 — Configure LoRA Adapters

Now attach the trainable adapter layers. I recommend targeting all attention and MLP projections — this is the configuration the Unsloth team uses in their own notebooks.

The get_peft_model function takes your frozen model and wraps specific layers with small adapter matrices. Two parameters matter most: r (rank) controls adapter capacity, and target_modules picks which layers get adapters.

python
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    use_rslora=False,
    random_state=3407,
)

Here’s what each setting controls:

  • r=16 — Adapter rank, or “capacity.” Values 8-64 work well. Start with 16.
  • target_modules — The attention projections (q, k, v, o) and MLP layers (gate, up, down). These are the layers where the model processes and routes information.
  • lora_alpha=16 — Scaling factor for adapter outputs. Common rule: set equal to r.
  • lora_dropout=0 — Unsloth’s kernels are optimized for zero dropout. Keep it at 0.
  • use_gradient_checkpointing="unsloth" — Custom checkpointing that saves more memory than PyTorch’s default.
  • use_rslora=False — Rank-Stabilized LoRA. When True, it changes the scaling factor from 1/r to 1/sqrt(r), which stabilizes training at higher ranks. Worth trying if you use r=32 or higher.

Check how many parameters you’re actually training:

python
model.print_trainable_parameters()
python
trainable params: 41,943,040 || all params: 8,072,204,288 || trainable%: 0.5196

Only about 0.5% of total parameters are trainable. That’s LoRA in action — a tiny fraction learns while the rest stays frozen.

Quick check: If you used r=16 and targeted 7 modules across 32 transformer layers, can you estimate why the trainable count is around 42 million? Each adapter adds two small matrices per targeted layer — that’s how a small r value translates into millions of parameters across an entire model.

What if we change the rank?

What happens when you raise r from 16 to 64? More trainable parameters, which means more capacity to learn complex patterns. But also more memory and slower training.

Rank (r) Trainable Params (approx.) VRAM Impact Best For
8 ~21M Lowest Simple tasks, quick experiments
16 ~42M Low Most use cases (recommended)
32 ~84M Medium Complex tasks, large datasets
64 ~168M Higher Maximum adaptation capacity

I typically start with 16 and only increase if the model isn’t learning well enough. Going above 64 rarely helps and starts eating into your memory budget. If you do bump up the rank, consider enabling use_rslora=True — it stabilizes the learning dynamics at higher values.

Key Insight: **The `r` parameter is a dial between quality and efficiency.** Low rank uses less memory and trains faster. High rank gives more learning capacity. For most tasks, r=16 hits the sweet spot.

Step 3 — Prepare Your Training Data

The model is wired up. It needs data to learn from.

Unsloth works with any HuggingFace Dataset object. The most common format for instruction fine-tuning is the Alpaca format — each example has an instruction, an optional input, and the expected output.

First, define the prompt template. This tells the model where the instruction ends and the response begins:

python
from datasets import load_dataset

alpaca_prompt = """Below is an instruction that describes a task, \
paired with an input that provides further context. \
Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

Next, write a formatting function that applies this template to every row. The crucial detail: append the EOS (end-of-sequence) token so the model learns when to stop generating:

python
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []
    for instruction, inp, output in zip(
        instructions, inputs, outputs
    ):
        text = alpaca_prompt.format(instruction, inp, output)
        text += tokenizer.eos_token
        texts.append(text)
    return {"text": texts}

Finally, load and transform the dataset. We use alpaca-cleaned, which fixes known errors in the original Alpaca dataset:

python
dataset = load_dataset("yahma/alpaca-cleaned", split="train")
dataset = dataset.map(formatting_prompts_func, batched=True)

batched=True processes rows in batches — much faster than one at a time.

Want to use your own data? Structure it as a CSV or JSON with instruction, input, and output columns. Load it with load_dataset("csv", data_files="your_data.csv").

Tip: **Quality beats quantity for fine-tuning data.** A curated set of 1,000 high-quality examples often outperforms 50,000 noisy ones. If you’re generating synthetic training data, run it through a quality classifier first — even a simple one catches the worst examples.

Chat Format Alternative

If you’re fine-tuning a chat model (like Llama 3.1 Instruct), use the model’s built-in chat template instead of the Alpaca format. Unsloth provides a helper:

python
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template="llama-3.1",
)

This applies the correct special tokens. Using the wrong template is one of the most common mistakes — it’s covered in the troubleshooting section below.

For multi-turn conversations in ShareGPT format, you can map the conversation fields:

python
tokenizer = get_chat_template(
    tokenizer,
    mapping={"role": "from", "content": "value",
             "user": "human", "assistant": "gpt"},
    chat_template="chatml",
)

This handles datasets where conversations use “human”/”gpt” labels instead of “user”/”assistant.”

Step 4 — Configure Training and Run

Training uses SFTTrainer from the TRL library. You configure it with TrainingArguments, which controls batch size, learning rate, and training duration.

Two settings control your effective batch size: per_device_train_batch_size is how many examples fit on the GPU at once, and gradient_accumulation_steps simulates a larger batch by accumulating gradients across multiple forward passes.

Set up the training arguments first. A learning rate of 2e-4 is the well-tested default for LoRA fine-tuning:

python
import torch
from transformers import TrainingArguments

training_args = TrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    warmup_steps=5,
    max_steps=60,
    learning_rate=2e-4,
    fp16=not torch.cuda.is_bf16_supported(),
    bf16=torch.cuda.is_bf16_supported(),
    logging_steps=1,
    optim="adamw_8bit",
    weight_decay=0.01,
    lr_scheduler_type="linear",
    seed=3407,
    output_dir="outputs",
)

What matters most here:

  • per_device_train_batch_size=2 — Keep this low on consumer GPUs (8-16 GB).
  • gradient_accumulation_steps=4 — Effective batch size = 2 x 4 = 8.
  • max_steps=60 — For a quick test. For production, use num_train_epochs=1 or 2.
  • learning_rate=2e-4 — Standard for LoRA. Too high and the model diverges.
  • optim="adamw_8bit" — Cuts optimizer memory in half with negligible quality impact.

Now create the trainer and start training:

python
from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    dataset_num_proc=2,
    packing=False,
    args=training_args,
)

trainer_stats = trainer.train()

About packing: when set to True, Unsloth packs multiple short examples into one sequence. This speeds up training by 1.1-2x but can reduce quality for some tasks. I’d suggest trying it after your first successful run.

Training output shows loss at each step:

python
Step  Training Loss
1     2.510000
10    1.830000
20    1.420000
40    1.050000
60    0.870000

A healthy run shows loss steadily decreasing. If loss hits 0, you’re overfitting — reduce steps or add more data. A final loss between 0.5 and 1.0 is a good sign.

Warning: **Out-of-memory during training?** Reduce `per_device_train_batch_size` to 1 first. If still too tight, lower `max_seq_length` or use a smaller model.
Try It Yourself

Exercise 1: Adjust Training Hyperparameters

Modify the training configuration to simulate an effective batch size of 16 while keeping per_device_train_batch_size=2. Also switch the optimizer to full-precision AdamW.

Hints

1. Effective batch size = `per_device_train_batch_size` x `gradient_accumulation_steps`. You need accumulation steps = 16 / 2 = 8.
2. The full-precision optimizer name in HuggingFace is `”adamw_torch”`.

Solution
python
training_args = TrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,  # 2 x 8 = 16
    optim="adamw_torch",           # full-precision AdamW
    # ... other args stay the same
)

**Why this matters:** Larger effective batch sizes produce more stable gradients. The tradeoff: each “step” processes 16 examples instead of 8, so training takes longer per step. Full-precision AdamW uses more memory but can converge slightly better.

Step 5 — Test Your Fine-Tuned Model

Training is done. Before saving, does the model actually produce useful output?

Unsloth’s for_inference switches the model to optimized inference mode. This makes generation roughly 2x faster:

python
FastLanguageModel.for_inference(model)

inputs = tokenizer(
    [alpaca_prompt.format(
        "Explain what LoRA fine-tuning is in 2 sentences.",
        "",
        "",
    )],
    return_tensors="pt",
).to("cuda")

outputs = model.generate(
    **inputs, max_new_tokens=128, temperature=0.7
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

The model should produce a coherent answer about LoRA. If the output is gibberish or repetitive, check your data formatting and try again with more steps.

For a streaming response where tokens appear one at a time:

python
from transformers import TextStreamer

text_streamer = TextStreamer(tokenizer)
model.generate(
    **inputs, streamer=text_streamer, max_new_tokens=128
)

I find streaming especially helpful for longer generations — you can spot problems early without waiting for the full output.

Try It Yourself

Exercise 2: Test with a Custom Prompt

Write code that generates a response to: “Write a Python function that checks if a number is prime.” Use temperature=0.3 for focused output and max_new_tokens=256.

Hints

1. Use the same `alpaca_prompt.format()` pattern. First argument is the instruction, second and third are empty strings.
2. Lower temperature means less randomness — the model picks more probable tokens.

Solution
python
FastLanguageModel.for_inference(model)

inputs = tokenizer(
    [alpaca_prompt.format(
        "Write a Python function that checks if a number is prime.",
        "",
        "",
    )],
    return_tensors="pt",
).to("cuda")

outputs = model.generate(
    **inputs, max_new_tokens=256, temperature=0.3
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

**Why temperature=0.3?** For code generation, you want focused, deterministic output. Low temperature reduces randomness. For creative text, raise it to 0.7-1.0.

Step 6 — Save and Export Your Model

You have four export paths, depending on how you’ll deploy.

Option A: Save the LoRA Adapter Only

The lightest option — saves only the adapter weights (~100 MB). You need the base model to load it later.

python
model.save_pretrained("lora_model")
tokenizer.save_pretrained("lora_model")

To reload:

python
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="lora_model",
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

Option B: Merged 16-bit Model

Merges adapters back into the base model. Good for sharing on the HuggingFace Hub:

python
model.save_pretrained_merged(
    "merged_model", tokenizer, save_method="merged_16bit"
)

Option C: Export to GGUF

GGUF is the format used by llama.cpp, Ollama, and other local inference tools. This is what I’d recommend for most deployment scenarios.

The q4_k_m quantization gives a solid balance between file size and quality:

python
model.save_pretrained_gguf(
    "gguf_model", tokenizer, quantization_method="q4_k_m"
)

Here’s how the quantization methods compare:

Method File Size (7B) Quality Speed
q4_k_m ~4.4 GB Good Fast
q5_k_m ~5.3 GB Better Medium
q8_0 ~7.7 GB Best Slower
f16 ~14 GB Full Slowest

After exporting, run it with Ollama:

bash
ollama create my-model -f Modelfile
ollama run my-model

Where Modelfile points to your exported GGUF file.

Option D: Push to HuggingFace Hub

Share your fine-tuned model directly on the Hub. This makes it available to anyone:

python
model.push_to_hub_merged(
    "your-username/my-fine-tuned-model",
    tokenizer,
    save_method="merged_16bit",
)

You can also push GGUF quantizations to the Hub:

python
model.push_to_hub_gguf(
    "your-username/my-model-GGUF",
    tokenizer,
    quantization_method="q4_k_m",
)

This is how most open-source fine-tuned models get shared. The community can then download and use your model directly.

Try It Yourself

Exercise 3: Choose the Right Export Format

You’re building a customer service chatbot that will run on an RTX 3060 (12 GB VRAM). The model needs to be fast and fit comfortably alongside other processes. Which export format and quantization should you choose?

Hints

1. A 7B model in full 16-bit needs ~14 GB — too much for 12 GB.
2. GGUF works with Ollama and llama.cpp for efficient local inference.
3. Leave ~4 GB free for KV-cache and system processes.

Solution

**Use GGUF with `q4_k_m`.** At ~4.4 GB, it fits comfortably on 12 GB with room for KV-cache and other processes.

python
model.save_pretrained_gguf(
    "chatbot_model", tokenizer, quantization_method="q4_k_m"
)

You wouldn’t want `f16` (14 GB — won’t fit) or `q8_0` (7.7 GB — tight for long conversations). The `q4_k_m` method uses mixed precision, keeping the most important weights at higher precision.

Unsloth Performance Benchmarks

How much faster is Unsloth in practice? Here are benchmarks from 59 test runs across different hardware.

Speed and Memory Improvements

A100 GPU (40 GB):

Model Speed Improvement VRAM Reduction
Llama 2 7B 1.87x faster -39.3% VRAM
Mistral 7B 1.88x faster -65.9% VRAM
Code Llama 34B 1.94x faster -22.7% VRAM
TinyLlama 1.1B 2.74x faster -57.8% VRAM

T4 GPU (Free Colab):

Model Speed Improvement VRAM Reduction
Llama 2 7B 1.95x faster -43.3% VRAM
Mistral 7B 1.56x faster -13.7% VRAM
TinyLlama 1.1B 3.87x faster -73.8% VRAM

The pattern: smaller models see the biggest speedups. The overhead of standard PyTorch kernels is proportionally larger for smaller models. But even a 34B model gets nearly 2x speed.

With sequence packing enabled (packing=True), Unsloth reports an additional 1.1-2x speedup and 30% less memory on top of these numbers.

Estimated Training Times

How long will your fine-tuning actually take? Here are rough estimates for training a 7B-8B model on 100K examples (1 epoch):

GPU VRAM Estimated Time Cost (Cloud)
T4 (Colab free) 15 GB ~47 hours Free
L4 24 GB ~20 hours ~$15
A100 40GB 40 GB ~5 hours ~$10
H100 80 GB ~2.5 hours ~$20

For a quick experiment with 60 steps on a smaller subset, even a free Colab T4 finishes in under 10 minutes. I wouldn’t worry about training time until you’re working with production datasets.

Beyond SFT — DPO, GRPO, and Vision Models

Supervised fine-tuning is the starting point. Unsloth supports three advanced training methods that all follow the same load → adapt → train → export workflow.

DPO (Direct Preference Optimization) teaches the model to prefer “good” responses over “bad” ones. You provide a dataset with chosen and rejected response pairs. I’ve found DPO particularly effective for reducing hallucinations after an initial SFT round.

Here’s the core setup — notice it’s nearly identical to SFT, just with a different trainer:

python
from trl import DPOTrainer

dpo_trainer = DPOTrainer(
    model=model,
    ref_model=None,  # Unsloth handles this
    train_dataset=dpo_dataset,
    tokenizer=tokenizer,
    args=training_args,
    beta=0.1,
)
dpo_trainer.train()

The beta parameter controls how much the model deviates from the original behavior. Lower values produce stronger alignment. Start with 0.1.

GRPO (Group Relative Policy Optimization) is the reinforcement learning method behind DeepSeek-R1’s reasoning ability. Unsloth recently added QLoRA support for GRPO, which previously required full fine-tuning and massive VRAM. You can now train reasoning models on a single consumer GPU.

Vision model fine-tuning handles multimodal models like Qwen2.5-VL through FastVisionModel. The API mirrors the text model workflow.

Common Mistakes and How to Fix Them

Mistake 1: Forgetting the EOS Token

python
# Wrong — no end-of-sequence token
text = alpaca_prompt.format(instruction, inp, output)

What happens: The model never learns when to stop. At inference, it rambles until hitting max_new_tokens.

python
# Correct — always append EOS
text = alpaca_prompt.format(instruction, inp, output)
text += tokenizer.eos_token

Mistake 2: Setting max_seq_length Too High

python
# Wasteful — most examples are under 512 tokens
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-bnb-4bit",
    max_seq_length=8192,
    load_in_4bit=True,
)

What happens: Attention memory scales with sequence length. Setting 8192 when your data peaks at 512 wastes gigabytes of VRAM.

Fix: Check your dataset’s token lengths first. Set max_seq_length to the 95th percentile.

Mistake 3: Using the Wrong Chat Template

If the base model was pre-trained with Llama 3.1’s <|start_header_id|> format, feeding it Alpaca-style prompts confuses it. The model sees an unfamiliar pattern and produces worse results.

Fix: Use get_chat_template for instruct/chat models. Reserve the Alpaca format for base models.

Mistake 4: Training Too Long

A training loss of 0 means the model memorized your data. It won’t generalize.

Signs: loss drops to 0, model repeats training examples verbatim, poor performance on new prompts.

Fix: Fewer steps, more diverse data, or a slight increase in lora_dropout.

Tip: **Quick diagnostic:** If your model’s outputs look worse AFTER fine-tuning than before, the most common cause is a chat template mismatch. The second most common is too many training steps.

When NOT to Use Unsloth

You might not need fine-tuning at all. If prompt engineering or RAG solves your problem, save yourself the effort. I always try few-shot prompting before reaching for fine-tuning.

Non-NVIDIA hardware. Unsloth requires CUDA. For AMD GPUs or Apple Silicon, look at MLX (Apple) or use cloud GPUs.

Distributed multi-GPU training. Unsloth focuses on single-GPU efficiency. For 4+ GPUs, consider FSDP or DeepSpeed.

Unsupported architectures. Most popular models are covered, but niche architectures may lack optimized kernels. Check the Unsloth docs.

Complete Code

Click to expand the full script (copy-paste and run)
python
# Complete fine-tuning script using Unsloth
# Requires: pip install unsloth transformers trl datasets
# GPU: Any NVIDIA GPU with 8+ GB VRAM
# Python 3.10+

import torch
from unsloth import FastLanguageModel
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments

# --- Step 1: Load Model ---
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-bnb-4bit",
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)

# --- Step 2: Configure LoRA ---
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    use_rslora=False,
    random_state=3407,
)

# --- Step 3: Prepare Dataset ---
alpaca_prompt = """Below is an instruction that describes a task, \
paired with an input that provides further context. \
Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []
    for instruction, inp, output in zip(
        instructions, inputs, outputs
    ):
        text = alpaca_prompt.format(instruction, inp, output)
        text += tokenizer.eos_token
        texts.append(text)
    return {"text": texts}

dataset = load_dataset("yahma/alpaca-cleaned", split="train")
dataset = dataset.map(formatting_prompts_func, batched=True)

# --- Step 4: Train ---
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    dataset_num_proc=2,
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
    ),
)

trainer_stats = trainer.train()

# --- Step 5: Test ---
FastLanguageModel.for_inference(model)

inputs = tokenizer(
    [alpaca_prompt.format(
        "Explain what LoRA fine-tuning is in 2 sentences.",
        "",
        "",
    )],
    return_tensors="pt",
).to("cuda")

outputs = model.generate(
    **inputs, max_new_tokens=128, temperature=0.7
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# --- Step 6: Save ---
model.save_pretrained("lora_model")
tokenizer.save_pretrained("lora_model")

# Optional: Export to GGUF
# model.save_pretrained_gguf(
#     "gguf_model", tokenizer, quantization_method="q4_k_m"
# )

print("Fine-tuning complete!")

Frequently Asked Questions

Can I fine-tune on multiple GPUs with Unsloth?

Unsloth is optimized for single-GPU training. For multi-GPU setups, combine it with PyTorch’s DataParallel. For true distributed training across many GPUs, DeepSpeed or FSDP are better suited.

How much data do I need for fine-tuning?

For simple instruction-following, 500-1000 high-quality examples often suffice. For domain-specific knowledge, 5,000-10,000 examples work better. Quality always beats quantity.

Does Unsloth support continued pre-training?

Yes. Set full_finetuning=True in from_pretrained instead of using LoRA. This updates all weights — useful for knowledge injection. It requires roughly 4x more VRAM than QLoRA.

Can I merge the LoRA adapter into the base model?

Yes. Use save_pretrained_merged with save_method="merged_16bit". The merged model works with any HuggingFace tool without needing Unsloth or PEFT at inference.

What’s the difference between Unsloth and Axolotl?

Axolotl provides YAML-based config with multi-GPU support. Unsloth focuses on single-GPU kernel optimization. You can use both together — Axolotl supports Unsloth as a backend.

How do I know if my fine-tuning worked?

Compare the model’s responses before and after training on 5-10 test prompts that match your use case. Look for: coherent, on-topic responses that follow the format of your training data. If responses are worse than the base model, check your chat template and training steps.

References

  1. Unsloth Documentation — Fine-tuning LLMs Guide. Link
  2. Unsloth GitHub Repository — unslothai/unsloth. Link
  3. Hu, E. J. et al. — LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685 (2021). Link
  4. Dettmers, T. et al. — QLoRA: Efficient Finetuning of Quantized Language Models. arXiv:2305.14314 (2023). Link
  5. HuggingFace Blog — Make LLM Fine-tuning 2x faster with Unsloth and TRL. Link
  6. TRL Documentation — SFTTrainer. Link
  7. PEFT Documentation — LoRA. Link
  8. NVIDIA Blog — How to Fine-Tune LLMs on RTX GPUs With Unsloth. Link
  9. Rafailov, R. et al. — Direct Preference Optimization. arXiv:2305.18290 (2023). Link
  10. Kalajdzievski, D. — Scaling Data-Constrained Language Models. arXiv:2305.16264 (2023). Link

Last reviewed: March 2026. Tested with Unsloth 2025.3, transformers 4.48, TRL 0.14, Python 3.11.

Free Course
Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Related Course
Master Gen AI — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Get the full course,
completely free.
Join 57,000+ students learning Python, SQL & ML. One year of access, all resources included.
📚 10 Courses
🐍 Python & ML
🗄️ SQL
📦 Downloads
📅 1 Year Access
No thanks
🎓
Free AI/ML Starter Kit
Python · SQL · ML · 10 Courses · 57,000+ students
🎉   You're in! Check your inbox (or Promotions/Spam) for the access link.
⚡ Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

🐍
Core Python — from first line to expert level
📈
NumPy & Pandas — the #1 libraries every DS job needs
🗃️
SQL Levels I–III — basics to Window Functions
📄
Real industry data — Jupyter notebooks included
R A M S K
57,000+ students
★★★★★ Rated 4.9/5
⚡ Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  ★★★★★ 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
🔒 100% free ☕ No spam, ever ✓ Instant access
🚀
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course →
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science