Unsloth Fine-Tuning — Train LLMs 2x Faster with 70% Less Memory
Fine-tuning a 7-billion-parameter model normally eats 30+ GB of GPU memory. Most of us don’t have that kind of hardware sitting around. Unsloth cuts that to under 8 GB — and training finishes twice as fast.
Want to customize Llama 3.1 for your own chatbot, or train a domain-specific Q&A model? This guide walks you through every step, from loading your first model to exporting a deployment-ready file.
Before you write a single line of code, here’s how the pieces connect.
You start with a pre-trained model — Llama, Qwen, Mistral, or any supported architecture. The model already knows language, but it doesn’t know YOUR task. So you load it in 4-bit quantized form. This slashes memory from 14 GB down to roughly 5 GB.
Next, you attach thin LoRA adapter layers on top. These are small trainable matrices — only a few million parameters compared to the model’s 7 billion. The original model weights stay frozen. Only the adapters learn.
Then you feed in your dataset and let the SFTTrainer run. Question-answer pairs, instructions, conversations — whatever matches your use case.
When training finishes, you test the model and export it. Save the tiny adapter file (around 100 MB), or merge it into the full model and convert to GGUF for tools like Ollama.
That’s the whole pipeline. Six steps.
What Is Unsloth?
Unsloth is an open-source Python library that makes LLM fine-tuning 2-5x faster while using up to 70% less GPU memory.
How does it pull this off? The developers manually derived the backpropagation math for transformer layers. Then they rewrote those operations as hand-tuned Triton kernels (Triton is a language for writing GPU programs that compile to highly optimized machine code). Standard PyTorch uses generic kernels. Unsloth’s kernels are purpose-built for LoRA training patterns.
The result: identical math, less memory, faster execution. This isn’t an approximation — the computations are exactly the same.
I find the HuggingFace integration particularly well done. Unsloth works with transformers, PEFT, TRL, and the Hub. If you already fine-tune with HuggingFace tools, switching takes about three extra lines of code.
Supported model families: Llama 3/3.1/3.2, Qwen 2.5/3, Mistral, Gemma 2, DeepSeek, Phi-3/4, and many more. Vision-language models work through FastVisionModel.
Prerequisites and Unsloth Installation
You need the following:
- Python: 3.10 or newer
- GPU: Any NVIDIA GPU with CUDA support (GTX 1070 through H100)
- VRAM: 8 GB minimum for 7B models with QLoRA. 16 GB gives you room to breathe.
- OS: Linux, WSL2 on Windows, or Google Colab
- CUDA: Version 11.8 or newer with compatible drivers
Install Unsloth and the training libraries:
pip install unsloth
pip install --upgrade transformers trl datasets
Verify your GPU is recognized:
import torch
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))
True
NVIDIA GeForce RTX 3090
Your device name will differ. What matters is that the first line prints True.
LoRA vs QLoRA — Picking Your Unsloth Strategy
Before loading a model, you need to decide: LoRA or QLoRA?
Both use the same core idea. Instead of updating all 7 billion weights, you freeze the model and attach small adapter matrices. Only these adapters get trained.
The difference is how the frozen model is stored in memory.
LoRA keeps the base model in 16-bit precision. Maximum accuracy, but the model itself uses more VRAM.
QLoRA quantizes the base model to 4-bit precision before attaching adapters. This cuts memory by roughly 75%. The adapters still train in 16-bit, so training quality stays high.
| Feature | LoRA (16-bit) | QLoRA (4-bit) |
|---|---|---|
| Base model precision | float16 / bfloat16 | 4-bit (NF4) |
| VRAM for 7B model | ~14 GB | ~5 GB |
| Training speed | Fast | Fast (with Unsloth) |
| Accuracy vs full fine-tune | ~99% | ~98-99% |
| Best for | 24+ GB GPUs | Consumer GPUs (8-16 GB) |
Which should you pick? Start with QLoRA. It works on almost any modern GPU. The quality difference is negligible with Unsloth’s dynamic 4-bit quantization. If you have a 24 GB+ GPU and want to maximize quality, try LoRA.
Step 1 — Load a Pre-Trained Model with Unsloth
Here’s where Unsloth replaces the standard HuggingFace loading code. Instead of AutoModelForCausalLM, you use FastLanguageModel. This wrapper applies the optimized Triton kernels automatically.
The from_pretrained method takes four key arguments. Here we load Llama 3.1 8B in 4-bit mode:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Meta-Llama-3.1-8B-bnb-4bit",
max_seq_length=2048,
dtype=None,
load_in_4bit=True,
)
What each parameter does:
model_name— The pre-quantizedbnb-4bitversion from Unsloth’s Hub. Loads faster than quantizing on the fly.max_seq_length— Maximum tokens in a training example. 2048 is a solid default.dtype=None— Unsloth auto-detects the best precision. Ampere+ GPUs getbfloat16, older ones getfloat16.load_in_4bit=True— Activates QLoRA. Set toFalsefor standard LoRA.
Step 2 — Configure LoRA Adapters
Now attach the trainable adapter layers. I recommend targeting all attention and MLP projections — this is the configuration the Unsloth team uses in their own notebooks.
The get_peft_model function takes your frozen model and wraps specific layers with small adapter matrices. Two parameters matter most: r (rank) controls adapter capacity, and target_modules picks which layers get adapters.
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
use_rslora=False,
random_state=3407,
)
Here’s what each setting controls:
r=16— Adapter rank, or “capacity.” Values 8-64 work well. Start with 16.target_modules— The attention projections (q,k,v,o) and MLP layers (gate,up,down). These are the layers where the model processes and routes information.lora_alpha=16— Scaling factor for adapter outputs. Common rule: set equal tor.lora_dropout=0— Unsloth’s kernels are optimized for zero dropout. Keep it at 0.use_gradient_checkpointing="unsloth"— Custom checkpointing that saves more memory than PyTorch’s default.use_rslora=False— Rank-Stabilized LoRA. WhenTrue, it changes the scaling factor from1/rto1/sqrt(r), which stabilizes training at higher ranks. Worth trying if you use r=32 or higher.
Check how many parameters you’re actually training:
model.print_trainable_parameters()
trainable params: 41,943,040 || all params: 8,072,204,288 || trainable%: 0.5196
Only about 0.5% of total parameters are trainable. That’s LoRA in action — a tiny fraction learns while the rest stays frozen.
Quick check: If you used r=16 and targeted 7 modules across 32 transformer layers, can you estimate why the trainable count is around 42 million? Each adapter adds two small matrices per targeted layer — that’s how a small r value translates into millions of parameters across an entire model.
What if we change the rank?
What happens when you raise r from 16 to 64? More trainable parameters, which means more capacity to learn complex patterns. But also more memory and slower training.
Rank (r) |
Trainable Params (approx.) | VRAM Impact | Best For |
|---|---|---|---|
| 8 | ~21M | Lowest | Simple tasks, quick experiments |
| 16 | ~42M | Low | Most use cases (recommended) |
| 32 | ~84M | Medium | Complex tasks, large datasets |
| 64 | ~168M | Higher | Maximum adaptation capacity |
I typically start with 16 and only increase if the model isn’t learning well enough. Going above 64 rarely helps and starts eating into your memory budget. If you do bump up the rank, consider enabling use_rslora=True — it stabilizes the learning dynamics at higher values.
Step 3 — Prepare Your Training Data
The model is wired up. It needs data to learn from.
Unsloth works with any HuggingFace Dataset object. The most common format for instruction fine-tuning is the Alpaca format — each example has an instruction, an optional input, and the expected output.
First, define the prompt template. This tells the model where the instruction ends and the response begins:
from datasets import load_dataset
alpaca_prompt = """Below is an instruction that describes a task, \
paired with an input that provides further context. \
Write a response that appropriately completes the request.
### Instruction:
{}
### Input:
{}
### Response:
{}"""
Next, write a formatting function that applies this template to every row. The crucial detail: append the EOS (end-of-sequence) token so the model learns when to stop generating:
def formatting_prompts_func(examples):
instructions = examples["instruction"]
inputs = examples["input"]
outputs = examples["output"]
texts = []
for instruction, inp, output in zip(
instructions, inputs, outputs
):
text = alpaca_prompt.format(instruction, inp, output)
text += tokenizer.eos_token
texts.append(text)
return {"text": texts}
Finally, load and transform the dataset. We use alpaca-cleaned, which fixes known errors in the original Alpaca dataset:
dataset = load_dataset("yahma/alpaca-cleaned", split="train")
dataset = dataset.map(formatting_prompts_func, batched=True)
batched=True processes rows in batches — much faster than one at a time.
Want to use your own data? Structure it as a CSV or JSON with instruction, input, and output columns. Load it with load_dataset("csv", data_files="your_data.csv").
Chat Format Alternative
If you’re fine-tuning a chat model (like Llama 3.1 Instruct), use the model’s built-in chat template instead of the Alpaca format. Unsloth provides a helper:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
tokenizer,
chat_template="llama-3.1",
)
This applies the correct special tokens. Using the wrong template is one of the most common mistakes — it’s covered in the troubleshooting section below.
For multi-turn conversations in ShareGPT format, you can map the conversation fields:
tokenizer = get_chat_template(
tokenizer,
mapping={"role": "from", "content": "value",
"user": "human", "assistant": "gpt"},
chat_template="chatml",
)
This handles datasets where conversations use “human”/”gpt” labels instead of “user”/”assistant.”
Step 4 — Configure Training and Run
Training uses SFTTrainer from the TRL library. You configure it with TrainingArguments, which controls batch size, learning rate, and training duration.
Two settings control your effective batch size: per_device_train_batch_size is how many examples fit on the GPU at once, and gradient_accumulation_steps simulates a larger batch by accumulating gradients across multiple forward passes.
Set up the training arguments first. A learning rate of 2e-4 is the well-tested default for LoRA fine-tuning:
import torch
from transformers import TrainingArguments
training_args = TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=5,
max_steps=60,
learning_rate=2e-4,
fp16=not torch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
logging_steps=1,
optim="adamw_8bit",
weight_decay=0.01,
lr_scheduler_type="linear",
seed=3407,
output_dir="outputs",
)
What matters most here:
per_device_train_batch_size=2— Keep this low on consumer GPUs (8-16 GB).gradient_accumulation_steps=4— Effective batch size = 2 x 4 = 8.max_steps=60— For a quick test. For production, usenum_train_epochs=1or2.learning_rate=2e-4— Standard for LoRA. Too high and the model diverges.optim="adamw_8bit"— Cuts optimizer memory in half with negligible quality impact.
Now create the trainer and start training:
from trl import SFTTrainer
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=2048,
dataset_num_proc=2,
packing=False,
args=training_args,
)
trainer_stats = trainer.train()
About packing: when set to True, Unsloth packs multiple short examples into one sequence. This speeds up training by 1.1-2x but can reduce quality for some tasks. I’d suggest trying it after your first successful run.
Training output shows loss at each step:
Step Training Loss
1 2.510000
10 1.830000
20 1.420000
40 1.050000
60 0.870000
A healthy run shows loss steadily decreasing. If loss hits 0, you’re overfitting — reduce steps or add more data. A final loss between 0.5 and 1.0 is a good sign.
Exercise 1: Adjust Training Hyperparameters
Modify the training configuration to simulate an effective batch size of 16 while keeping per_device_train_batch_size=2. Also switch the optimizer to full-precision AdamW.
Hints
1. Effective batch size = `per_device_train_batch_size` x `gradient_accumulation_steps`. You need accumulation steps = 16 / 2 = 8.
2. The full-precision optimizer name in HuggingFace is `”adamw_torch”`.
Solution
training_args = TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=8, # 2 x 8 = 16
optim="adamw_torch", # full-precision AdamW
# ... other args stay the same
)
**Why this matters:** Larger effective batch sizes produce more stable gradients. The tradeoff: each “step” processes 16 examples instead of 8, so training takes longer per step. Full-precision AdamW uses more memory but can converge slightly better.
Step 5 — Test Your Fine-Tuned Model
Training is done. Before saving, does the model actually produce useful output?
Unsloth’s for_inference switches the model to optimized inference mode. This makes generation roughly 2x faster:
FastLanguageModel.for_inference(model)
inputs = tokenizer(
[alpaca_prompt.format(
"Explain what LoRA fine-tuning is in 2 sentences.",
"",
"",
)],
return_tensors="pt",
).to("cuda")
outputs = model.generate(
**inputs, max_new_tokens=128, temperature=0.7
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
The model should produce a coherent answer about LoRA. If the output is gibberish or repetitive, check your data formatting and try again with more steps.
For a streaming response where tokens appear one at a time:
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
model.generate(
**inputs, streamer=text_streamer, max_new_tokens=128
)
I find streaming especially helpful for longer generations — you can spot problems early without waiting for the full output.
Exercise 2: Test with a Custom Prompt
Write code that generates a response to: “Write a Python function that checks if a number is prime.” Use temperature=0.3 for focused output and max_new_tokens=256.
Hints
1. Use the same `alpaca_prompt.format()` pattern. First argument is the instruction, second and third are empty strings.
2. Lower temperature means less randomness — the model picks more probable tokens.
Solution
FastLanguageModel.for_inference(model)
inputs = tokenizer(
[alpaca_prompt.format(
"Write a Python function that checks if a number is prime.",
"",
"",
)],
return_tensors="pt",
).to("cuda")
outputs = model.generate(
**inputs, max_new_tokens=256, temperature=0.3
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
**Why temperature=0.3?** For code generation, you want focused, deterministic output. Low temperature reduces randomness. For creative text, raise it to 0.7-1.0.
Step 6 — Save and Export Your Model
You have four export paths, depending on how you’ll deploy.
Option A: Save the LoRA Adapter Only
The lightest option — saves only the adapter weights (~100 MB). You need the base model to load it later.
model.save_pretrained("lora_model")
tokenizer.save_pretrained("lora_model")
To reload:
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="lora_model",
max_seq_length=2048,
dtype=None,
load_in_4bit=True,
)
FastLanguageModel.for_inference(model)
Option B: Merged 16-bit Model
Merges adapters back into the base model. Good for sharing on the HuggingFace Hub:
model.save_pretrained_merged(
"merged_model", tokenizer, save_method="merged_16bit"
)
Option C: Export to GGUF
GGUF is the format used by llama.cpp, Ollama, and other local inference tools. This is what I’d recommend for most deployment scenarios.
The q4_k_m quantization gives a solid balance between file size and quality:
model.save_pretrained_gguf(
"gguf_model", tokenizer, quantization_method="q4_k_m"
)
Here’s how the quantization methods compare:
| Method | File Size (7B) | Quality | Speed |
|---|---|---|---|
q4_k_m |
~4.4 GB | Good | Fast |
q5_k_m |
~5.3 GB | Better | Medium |
q8_0 |
~7.7 GB | Best | Slower |
f16 |
~14 GB | Full | Slowest |
After exporting, run it with Ollama:
ollama create my-model -f Modelfile
ollama run my-model
Where Modelfile points to your exported GGUF file.
Option D: Push to HuggingFace Hub
Share your fine-tuned model directly on the Hub. This makes it available to anyone:
model.push_to_hub_merged(
"your-username/my-fine-tuned-model",
tokenizer,
save_method="merged_16bit",
)
You can also push GGUF quantizations to the Hub:
model.push_to_hub_gguf(
"your-username/my-model-GGUF",
tokenizer,
quantization_method="q4_k_m",
)
This is how most open-source fine-tuned models get shared. The community can then download and use your model directly.
Exercise 3: Choose the Right Export Format
You’re building a customer service chatbot that will run on an RTX 3060 (12 GB VRAM). The model needs to be fast and fit comfortably alongside other processes. Which export format and quantization should you choose?
Hints
1. A 7B model in full 16-bit needs ~14 GB — too much for 12 GB.
2. GGUF works with Ollama and llama.cpp for efficient local inference.
3. Leave ~4 GB free for KV-cache and system processes.
Solution
**Use GGUF with `q4_k_m`.** At ~4.4 GB, it fits comfortably on 12 GB with room for KV-cache and other processes.
model.save_pretrained_gguf(
"chatbot_model", tokenizer, quantization_method="q4_k_m"
)
You wouldn’t want `f16` (14 GB — won’t fit) or `q8_0` (7.7 GB — tight for long conversations). The `q4_k_m` method uses mixed precision, keeping the most important weights at higher precision.
Unsloth Performance Benchmarks
How much faster is Unsloth in practice? Here are benchmarks from 59 test runs across different hardware.
Speed and Memory Improvements
A100 GPU (40 GB):
| Model | Speed Improvement | VRAM Reduction |
|---|---|---|
| Llama 2 7B | 1.87x faster | -39.3% VRAM |
| Mistral 7B | 1.88x faster | -65.9% VRAM |
| Code Llama 34B | 1.94x faster | -22.7% VRAM |
| TinyLlama 1.1B | 2.74x faster | -57.8% VRAM |
T4 GPU (Free Colab):
| Model | Speed Improvement | VRAM Reduction |
|---|---|---|
| Llama 2 7B | 1.95x faster | -43.3% VRAM |
| Mistral 7B | 1.56x faster | -13.7% VRAM |
| TinyLlama 1.1B | 3.87x faster | -73.8% VRAM |
The pattern: smaller models see the biggest speedups. The overhead of standard PyTorch kernels is proportionally larger for smaller models. But even a 34B model gets nearly 2x speed.
With sequence packing enabled (packing=True), Unsloth reports an additional 1.1-2x speedup and 30% less memory on top of these numbers.
Estimated Training Times
How long will your fine-tuning actually take? Here are rough estimates for training a 7B-8B model on 100K examples (1 epoch):
| GPU | VRAM | Estimated Time | Cost (Cloud) |
|---|---|---|---|
| T4 (Colab free) | 15 GB | ~47 hours | Free |
| L4 | 24 GB | ~20 hours | ~$15 |
| A100 40GB | 40 GB | ~5 hours | ~$10 |
| H100 | 80 GB | ~2.5 hours | ~$20 |
For a quick experiment with 60 steps on a smaller subset, even a free Colab T4 finishes in under 10 minutes. I wouldn’t worry about training time until you’re working with production datasets.
Beyond SFT — DPO, GRPO, and Vision Models
Supervised fine-tuning is the starting point. Unsloth supports three advanced training methods that all follow the same load → adapt → train → export workflow.
DPO (Direct Preference Optimization) teaches the model to prefer “good” responses over “bad” ones. You provide a dataset with chosen and rejected response pairs. I’ve found DPO particularly effective for reducing hallucinations after an initial SFT round.
Here’s the core setup — notice it’s nearly identical to SFT, just with a different trainer:
from trl import DPOTrainer
dpo_trainer = DPOTrainer(
model=model,
ref_model=None, # Unsloth handles this
train_dataset=dpo_dataset,
tokenizer=tokenizer,
args=training_args,
beta=0.1,
)
dpo_trainer.train()
The beta parameter controls how much the model deviates from the original behavior. Lower values produce stronger alignment. Start with 0.1.
GRPO (Group Relative Policy Optimization) is the reinforcement learning method behind DeepSeek-R1’s reasoning ability. Unsloth recently added QLoRA support for GRPO, which previously required full fine-tuning and massive VRAM. You can now train reasoning models on a single consumer GPU.
Vision model fine-tuning handles multimodal models like Qwen2.5-VL through FastVisionModel. The API mirrors the text model workflow.
Common Mistakes and How to Fix Them
Mistake 1: Forgetting the EOS Token
# Wrong — no end-of-sequence token
text = alpaca_prompt.format(instruction, inp, output)
What happens: The model never learns when to stop. At inference, it rambles until hitting max_new_tokens.
# Correct — always append EOS
text = alpaca_prompt.format(instruction, inp, output)
text += tokenizer.eos_token
Mistake 2: Setting max_seq_length Too High
# Wasteful — most examples are under 512 tokens
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Meta-Llama-3.1-8B-bnb-4bit",
max_seq_length=8192,
load_in_4bit=True,
)
What happens: Attention memory scales with sequence length. Setting 8192 when your data peaks at 512 wastes gigabytes of VRAM.
Fix: Check your dataset’s token lengths first. Set max_seq_length to the 95th percentile.
Mistake 3: Using the Wrong Chat Template
If the base model was pre-trained with Llama 3.1’s <|start_header_id|> format, feeding it Alpaca-style prompts confuses it. The model sees an unfamiliar pattern and produces worse results.
Fix: Use get_chat_template for instruct/chat models. Reserve the Alpaca format for base models.
Mistake 4: Training Too Long
A training loss of 0 means the model memorized your data. It won’t generalize.
Signs: loss drops to 0, model repeats training examples verbatim, poor performance on new prompts.
Fix: Fewer steps, more diverse data, or a slight increase in lora_dropout.
When NOT to Use Unsloth
You might not need fine-tuning at all. If prompt engineering or RAG solves your problem, save yourself the effort. I always try few-shot prompting before reaching for fine-tuning.
Non-NVIDIA hardware. Unsloth requires CUDA. For AMD GPUs or Apple Silicon, look at MLX (Apple) or use cloud GPUs.
Distributed multi-GPU training. Unsloth focuses on single-GPU efficiency. For 4+ GPUs, consider FSDP or DeepSpeed.
Unsupported architectures. Most popular models are covered, but niche architectures may lack optimized kernels. Check the Unsloth docs.
Complete Code
Click to expand the full script (copy-paste and run)
# Complete fine-tuning script using Unsloth
# Requires: pip install unsloth transformers trl datasets
# GPU: Any NVIDIA GPU with 8+ GB VRAM
# Python 3.10+
import torch
from unsloth import FastLanguageModel
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
# --- Step 1: Load Model ---
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Meta-Llama-3.1-8B-bnb-4bit",
max_seq_length=2048,
dtype=None,
load_in_4bit=True,
)
# --- Step 2: Configure LoRA ---
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
use_rslora=False,
random_state=3407,
)
# --- Step 3: Prepare Dataset ---
alpaca_prompt = """Below is an instruction that describes a task, \
paired with an input that provides further context. \
Write a response that appropriately completes the request.
### Instruction:
{}
### Input:
{}
### Response:
{}"""
def formatting_prompts_func(examples):
instructions = examples["instruction"]
inputs = examples["input"]
outputs = examples["output"]
texts = []
for instruction, inp, output in zip(
instructions, inputs, outputs
):
text = alpaca_prompt.format(instruction, inp, output)
text += tokenizer.eos_token
texts.append(text)
return {"text": texts}
dataset = load_dataset("yahma/alpaca-cleaned", split="train")
dataset = dataset.map(formatting_prompts_func, batched=True)
# --- Step 4: Train ---
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=2048,
dataset_num_proc=2,
packing=False,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=5,
max_steps=60,
learning_rate=2e-4,
fp16=not torch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
logging_steps=1,
optim="adamw_8bit",
weight_decay=0.01,
lr_scheduler_type="linear",
seed=3407,
output_dir="outputs",
),
)
trainer_stats = trainer.train()
# --- Step 5: Test ---
FastLanguageModel.for_inference(model)
inputs = tokenizer(
[alpaca_prompt.format(
"Explain what LoRA fine-tuning is in 2 sentences.",
"",
"",
)],
return_tensors="pt",
).to("cuda")
outputs = model.generate(
**inputs, max_new_tokens=128, temperature=0.7
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# --- Step 6: Save ---
model.save_pretrained("lora_model")
tokenizer.save_pretrained("lora_model")
# Optional: Export to GGUF
# model.save_pretrained_gguf(
# "gguf_model", tokenizer, quantization_method="q4_k_m"
# )
print("Fine-tuning complete!")
Frequently Asked Questions
Can I fine-tune on multiple GPUs with Unsloth?
Unsloth is optimized for single-GPU training. For multi-GPU setups, combine it with PyTorch’s DataParallel. For true distributed training across many GPUs, DeepSpeed or FSDP are better suited.
How much data do I need for fine-tuning?
For simple instruction-following, 500-1000 high-quality examples often suffice. For domain-specific knowledge, 5,000-10,000 examples work better. Quality always beats quantity.
Does Unsloth support continued pre-training?
Yes. Set full_finetuning=True in from_pretrained instead of using LoRA. This updates all weights — useful for knowledge injection. It requires roughly 4x more VRAM than QLoRA.
Can I merge the LoRA adapter into the base model?
Yes. Use save_pretrained_merged with save_method="merged_16bit". The merged model works with any HuggingFace tool without needing Unsloth or PEFT at inference.
What’s the difference between Unsloth and Axolotl?
Axolotl provides YAML-based config with multi-GPU support. Unsloth focuses on single-GPU kernel optimization. You can use both together — Axolotl supports Unsloth as a backend.
How do I know if my fine-tuning worked?
Compare the model’s responses before and after training on 5-10 test prompts that match your use case. Look for: coherent, on-topic responses that follow the format of your training data. If responses are worse than the base model, check your chat template and training steps.
References
- Unsloth Documentation — Fine-tuning LLMs Guide. Link
- Unsloth GitHub Repository — unslothai/unsloth. Link
- Hu, E. J. et al. — LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685 (2021). Link
- Dettmers, T. et al. — QLoRA: Efficient Finetuning of Quantized Language Models. arXiv:2305.14314 (2023). Link
- HuggingFace Blog — Make LLM Fine-tuning 2x faster with Unsloth and TRL. Link
- TRL Documentation — SFTTrainer. Link
- PEFT Documentation — LoRA. Link
- NVIDIA Blog — How to Fine-Tune LLMs on RTX GPUs With Unsloth. Link
- Rafailov, R. et al. — Direct Preference Optimization. arXiv:2305.18290 (2023). Link
- Kalajdzievski, D. — Scaling Data-Constrained Language Models. arXiv:2305.16264 (2023). Link
Last reviewed: March 2026. Tested with Unsloth 2025.3, transformers 4.48, TRL 0.14, Python 3.11.
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →