machine learning +
Groq vs Fireworks vs Together AI: Speed Benchmark
Ollama Tutorial: Run LLMs Locally (Llama, Mistral)
Learn to run LLMs locally with Ollama. Install Llama, Mistral, and DeepSeek, use the OpenAI-compatible Python API, and build a local-to-cloud fallback client.
Install open-source models on your laptop, call them from Python, and build a client that falls back to a cloud API when local quality drops.
You send a prompt to an LLM API. Three seconds later a reply shows up. But your data just left your machine. It crossed the web and landed on someone else’s server.
For private docs, patient records, or secret code — that’s a real problem.
Running LLMs on your own machine fixes it. Your data stays put. You pay nothing per token. And replies come back fast. Ollama makes this easy to set up. It works on a normal laptop with no GPU.
What Is Ollama?
Ollama pulls, manages, and runs open-source LLMs on your own machine. Think of it as Docker for language models. One command pulls a model. One more starts it.
Under the hood, Ollama wraps llama.cpp. This engine runs smaller versions of models on both CPUs and GPUs. You don’t build anything or set up CUDA. Ollama does it for you.
The result? A local server on http://localhost:11434 that speaks an OpenAI-style API. Any Python code that talks to OpenAI can talk to Ollama instead — just change one URL.
Key Insight: Ollama turns your laptop into a local LLM server with an OpenAI-style API. Change the base URL in your Python code — nothing else.
Install Ollama and Pull Your First Model
Go to ollama.com and grab the installer. It works on macOS, Linux, and Windows.
On macOS or Linux, one command handles everything:
bash
curl -fsSL https://ollama.com/install.sh | sh
On Windows, download the .exe installer and run it. Verify with:
bash
ollama --version
Output:
python
ollama version 0.6.2
The install starts a service in the back. It runs on port 11434. Open http://localhost:11434 in your browser — you’ll see “Ollama is running.”
Note: You can also run Ollama in Docker. Pull the image with `docker pull ollama/ollama` and start it with `docker run -d -p 11434:11434 ollama/ollama`. This is handy on Linux servers where you want a clean setup.
Prerequisites
- Python version: 3.9+
- Required libraries: openai (1.0+), requests
- Install:
pip install openai requests - Hardware: 8 GB RAM minimum (16 GB recommended)
- Disk space: 2-8 GB per model
- Time to complete: 20-25 minutes (plus model download time)
Now pull a model. Phi-3 Mini from Microsoft has 3.8 billion weights and runs on almost any machine:
bash
ollama pull phi3:mini
Output:
python
pulling manifest
pulling model...
verifying sha256 digest
writing manifest
success
That pulled a smaller version of Phi-3 Mini (~2.3 GB). Test it:
bash
ollama run phi3:mini "What is gradient descent in one sentence?"
The model replies in a few seconds. No API key. No web access needed. No cost per token.
Quick Check: What port does Ollama listen on by default? (Answer: 11434)
Download More Models for Comparison
Ollama has hundreds of models in its library. Here are five good ones to try in this Ollama tutorial:
| Model | Parameters | Disk Size | Best For |
|---|---|---|---|
phi3:mini | 3.8B | ~2.3 GB | Fast tasks, light Q&A |
llama3.1:8b | 8B | ~4.7 GB | General-purpose, balanced |
mistral:7b | 7B | ~4.1 GB | Good at following prompts |
qwen2.5:7b | 7B | ~4.4 GB | Many languages, strong coding |
deepseek-r1:8b | 8B | ~4.9 GB | Step-by-step reasoning |
Pull them all:
bash
ollama pull llama3.1:8b
ollama pull mistral:7b
ollama pull qwen2.5:7b
ollama pull deepseek-r1:8b
Each download takes a few minutes. Once pulled, models live in ~/.ollama/models/ on Linux/macOS or C:\Users\<you>\.ollama\models\ on Windows.
Tip: List your downloaded models with `ollama list`. Remove unused ones with `ollama rm ` to free disk space.
What Is Quantization? GGUF, Q4, and Q8 Explained
How does an 8 billion weight model fit in just 4.7 GB? Let’s do the math.
At full size (32 bits), each weight takes 4 bytes. An 8B model needs 32 GB of RAM. That’s way too much for most laptops. We fix this by cutting the size of each weight — that’s called quantization.
| Quantization | Bits | 8B Model Size | Quality Impact |
|---|---|---|---|
| Full (FP32) | 32 | ~32 GB | Baseline |
| Half (FP16) | 16 | ~16 GB | Negligible |
| Q8 | 8 | ~8 GB | Very small |
| Q4_K_M | 4 | ~4.7 GB | Small for most tasks |
| Q2 | 2 | ~2.5 GB | Noticeable — avoid for complex work |
GGUF is the file format for these shrunk models. It’s what llama.cpp (and Ollama) reads. When you pull llama3.1:8b, you get Q4_K_M by default.
Here’s a way to think about it. Q8 is like a high-end JPEG — almost the same as the raw file. Q4 is like a normal JPEG — good enough for most work. Q2 is like a crushed JPEG — you can see the damage.
Key Insight: Q4 shrinks memory use by about 8x with very little quality loss. Most users won’t spot a gap on daily tasks like summaries, Q&A, and code writing.
Quick Check: If you have 16 GB of RAM, can you run a full-size (FP32) 8B model? (Answer: No — it needs ~32 GB. Use Q4 instead.)
Use the Ollama REST API from Python
Ollama has a REST API on localhost:11434. You don’t need the Ollama Python package. The requests library works fine for calling your local LLM.
Here’s a call to the chat endpoint. We send a system message and a user prompt, then print what comes back:
python
import requests
import json
response = requests.post(
"http://localhost:11434/api/chat",
json={
"model": "llama3.1:8b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain list comprehension in 2 sentences."}
],
"stream": False
}
)
result = response.json()
print(result["message"]["content"])
The stream: False flag sends the full reply at once. Set it to True to get tokens one by one — handy for chat apps.
I like the REST API for quick scripts. But for bigger projects, the OpenAI-style endpoint is better. Here’s why.
Use the OpenAI-Compatible API
Ollama has an endpoint at /v1/chat/completions that follows the OpenAI API format. You can use the OpenAI Python library with zero code changes to run LLMs on your own machine.
Point the client at your local server. The api_key field is needed by the library but Ollama skips it:
python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1/",
api_key="ollama" # required but ignored
)
response = client.chat.completions.create(
model="llama3.1:8b",
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a function to check if a number is prime."}
],
temperature=0.7,
max_tokens=200
)
print(response.choices[0].message.content)
That’s the exact same client.chat.completions.create() call you’d make to OpenAI. The only change is base_url.
Tip: Switch between local and cloud by changing two lines. For OpenAI: `base_url=”https://api.openai.com/v1/”` and your real key. For Ollama: `base_url=”http://localhost:11434/v1/”` and `api_key=”ollama”`. Everything else stays the same.
Stream Responses Token by Token
For chat UIs, you want text to appear as the model generates it. Set stream=True and iterate over the chunks:
python
stream = client.chat.completions.create(
model="mistral:7b",
messages=[{"role": "user", "content": "What are Newton's three laws?"}],
stream=True
)
for chunk in stream:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
print()
Each chunk holds a small piece of the reply. The flush=True forces Python to print right away. You see a smooth typing effect in your terminal.
Compare Models on Quality and Speed
Each model has its own strengths. I’ve found Mistral follows prompts better than Llama. DeepSeek-R1 shines on tasks that need step-by-step thinking.
Let’s test all five on the same prompt. The script sends a coding task, tracks time, and prints a preview:
python
import time
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1/",
api_key="ollama"
)
models = [
"phi3:mini", "llama3.1:8b", "mistral:7b",
"qwen2.5:7b", "deepseek-r1:8b"
]
prompt = "Write a Python function that reverses a string without slicing."
for model_name in models:
start = time.time()
resp = client.chat.completions.create(
model=model_name,
messages=[{"role": "user", "content": prompt}],
temperature=0.3,
max_tokens=150
)
elapsed = time.time() - start
print(f"\n{'='*50}")
print(f"Model: {model_name} | Time: {elapsed:.1f}s")
print(f"{'='*50}")
print(resp.choices[0].message.content[:300])
On a typical laptop with 16 GB RAM and no GPU, expect these rough ranges:
| Model | Response Time | Notes |
|---|---|---|
phi3:mini | 2-5s | Fastest. Good for simple tasks. |
llama3.1:8b | 5-15s | Balanced quality and speed. |
mistral:7b | 5-12s | Detailed. Follows prompts well. |
qwen2.5:7b | 5-14s | Strong at code. Many languages. |
deepseek-r1:8b | 8-20s | Shows its thinking. Slower but deep. |
Warning: Speed depends on your hardware. These numbers assume CPU-only mode with 16 GB RAM. A GPU with 8+ GB VRAM cuts times by 3-5x. With only 8 GB RAM, big models may run very slowly or fail.
Customize Models with a Modelfile
Ollama lets you change how a model acts through a Modelfile. Think of it as a Dockerfile — but for LLMs. You set a system prompt, tweak the temperature, and lock in settings.
Create a file called Modelfile with these contents:
python
FROM llama3.1:8b
SYSTEM "You are a Python coding assistant. Always include type hints. Keep answers concise."
PARAMETER temperature 0.3
PARAMETER num_ctx 4096
Then build your custom model:
bash
ollama create python-coder -f Modelfile
Now you can use it like any other model:
bash
ollama run python-coder "Write a function to flatten a nested list."
The model uses your system prompt and settings by default. No need to pass them in every API call. This helps keep things the same across your team or project.
Local vs Cloud — When to Use Which
This is the top question I hear about running LLMs on your own machine. Short answer: use both.
Go local when:
– Data is sensitive (medical, legal, proprietary code)
– You need offline access
– You want zero API cost for bulk processing
– Latency matters and you have decent hardware
Go cloud when:
– You need top-tier reasoning (local 7-8B models can’t match GPT-4 on hard tasks)
– You’re serving many users at once (a laptop handles 1-2 at a time)
– You need long context windows (100K+ tokens)
– You need vision or audio features
The best path is to mix both. Route simple tasks to your local model. Send hard tasks to the cloud. That’s what we’ll build next.
Note: Local models keep getting better. Today’s 8B models match cloud models from 2 years ago. Check your local-vs-cloud split every 6 months.
Build a Python Client with Local-to-Cloud Fallback
Here’s the hands-on project. We’ll build a SmartLLMClient that tries Ollama first. If the local model fails or sends back a very short reply, it falls back to a cloud API.
First, the class shell. It holds clients for both local and cloud:
python
import os
import time
from openai import OpenAI
class SmartLLMClient:
"""Tries local Ollama first, falls back to cloud."""
def __init__(
self,
local_model="llama3.1:8b",
cloud_model="gpt-4o-mini",
min_response_length=20
):
self.local_client = OpenAI(
base_url="http://localhost:11434/v1/",
api_key="ollama"
)
self.cloud_client = OpenAI(
api_key=os.getenv("OPENAI_API_KEY", "your-key-here")
)
self.local_model = local_model
self.cloud_model = cloud_model
self.min_response_length = min_response_length
The min_response_length is our quality gate. Fewer than 20 chars means something broke.
The generate method wraps the local call in try/except. If it fails or the reply is too short, it tries the cloud:
python
def generate(self, prompt, system_msg="You are a helpful assistant."):
"""Try local model first, fall back to cloud on failure."""
messages = [
{"role": "system", "content": system_msg},
{"role": "user", "content": prompt}
]
# Attempt local inference
try:
start = time.time()
resp = self.local_client.chat.completions.create(
model=self.local_model,
messages=messages,
temperature=0.7,
max_tokens=500
)
content = resp.choices[0].message.content
elapsed = time.time() - start
if len(content.strip()) >= self.min_response_length:
return {"content": content, "source": "local",
"model": self.local_model, "time": round(elapsed, 2)}
print(f"Local response too short. Falling back.")
except Exception as e:
print(f"Local failed: {e}. Falling back to cloud.")
If local works and the reply is long enough, we return right away. If not, the cloud takes over:
python
# Cloud fallback
start = time.time()
resp = self.cloud_client.chat.completions.create(
model=self.cloud_model,
messages=messages,
temperature=0.7,
max_tokens=500
)
content = resp.choices[0].message.content
elapsed = time.time() - start
return {"content": content, "source": "cloud",
"model": self.cloud_model, "time": round(elapsed, 2)}
Let’s use it. Create an instance and send a prompt:
python
# Requires: Ollama running locally + OPENAI_API_KEY for cloud fallback
llm = SmartLLMClient(
local_model="llama3.1:8b",
cloud_model="gpt-4o-mini"
)
result = llm.generate("Explain the difference between a list and a tuple in Python.")
print(f"Source: {result['source']} ({result['model']})")
print(f"Time: {result['time']}s")
print(f"\n{result['content']}")
If Ollama isn’t running, you’ll see:
python
Local failed: Connection refused. Falling back to cloud.
Source: cloud (gpt-4o-mini)
Key Insight: The local-first, cloud-backup pattern works in real apps. Route cheap, high-volume tasks to your local model. Save cloud APIs for hard reasoning. This cuts API costs a lot.
{
type: ‘exercise’,
id: ‘ollama-fallback-ex1’,
title: ‘Exercise 1: Add Response Logging’,
difficulty: ‘intermediate’,
exerciseType: ‘write’,
instructions: ‘Extend the SmartLLMClient by adding a log list attribute. Every call to generate should append a dictionary with keys “timestamp”, “source”, “model”, and “prompt_length” to self.log. Then add a get_stats method that returns a dictionary with “total_calls”, “local_calls”, and “cloud_calls”.\n\nThe starter code gives you the class skeleton. Fill in the logging and stats.’,
starterCode: ‘import time\n\nclass SmartLLMClient:\n def init(self):\n self.log = []\n\n def generate(self, prompt):\n source = “local”\n model = “llama3.1:8b”\n # TODO: Append to self.log\n self.log.append({\n # fill in\n })\n return {“content”: “response”, “source”: source}\n\n def get_stats(self):\n # TODO: Return dict with total_calls, local_calls, cloud_calls\n pass\n\nclient = SmartLLMClient()\nclient.generate(“Hello”)\nclient.generate(“World”)\nstats = client.get_stats()\nprint(f”Total: {stats[\’total_calls\’]}, Local: {stats[\’local_calls\’]}, Cloud: {stats[\’cloud_calls\’]}”)’,
testCases: [
{ id: ‘tc1’, input: ‘client = SmartLLMClient()\nclient.generate(“Hello”)\nclient.generate(“World”)\nstats = client.get_stats()\nprint(f”{stats[\’total_calls\’]} {stats[\’local_calls\’]} {stats[\’cloud_calls\’]}”)’, expectedOutput: ‘2 2 0’, description: ‘Two local calls, zero cloud’ },
{ id: ‘tc2’, input: ‘client = SmartLLMClient()\nclient.generate(“Test”)\nprint(len(client.log))’, expectedOutput: ‘1’, description: ‘Log has one entry after one call’ },
],
hints: [
‘Use time.time() for the timestamp and len(prompt) for prompt_length.’,
‘For get_stats: sum(1 for e in self.log if e[“source”] == “local”)’,
],
solution: ‘import time\n\nclass SmartLLMClient:\n def init(self):\n self.log = []\n\n def generate(self, prompt):\n source = “local”\n model = “llama3.1:8b”\n self.log.append({\n “timestamp”: time.time(),\n “source”: source,\n “model”: model,\n “prompt_length”: len(prompt)\n })\n return {“content”: “response”, “source”: source}\n\n def get_stats(self):\n return {\n “total_calls”: len(self.log),\n “local_calls”: sum(1 for e in self.log if e[“source”] == “local”),\n “cloud_calls”: sum(1 for e in self.log if e[“source”] == “cloud”)\n }\n\nclient = SmartLLMClient()\nclient.generate(“Hello”)\nclient.generate(“World”)\nstats = client.get_stats()\nprint(f”Total: {stats[\’total_calls\’]}, Local: {stats[\’local_calls\’]}, Cloud: {stats[\’cloud_calls\’]}”)’,
solutionExplanation: ‘Each generate() call appends a log entry. The get_stats() method counts entries by source using a generator expression inside sum().’,
xpReward: 15,
}
Common Mistakes and How to Fix Them
Mistake 1: Running Python code before starting Ollama
python
# This will fail with ConnectionError
response = requests.post("http://localhost:11434/api/chat", ...)
# ConnectionError: Connection refused
Ollama runs as a background service. On macOS and Windows, it starts at boot. On Linux, you may need to start it on your own:
bash
ollama serve
Run your Python code in a separate terminal after the service is up.
Mistake 2: Using a model you haven’t pulled
bash
ollama run llama3:70b
# Error: model "llama3:70b" not found, try pulling it first
Ollama doesn’t auto-download. Pull first, then run:
bash
ollama pull llama3:70b
ollama run llama3:70b
Mistake 3: Loading a model too large for your RAM
A 70B model on a 16 GB machine will crash or crawl. Ollama swaps to disk, and each reply takes minutes.
Rule of thumb: Free RAM should be at least 1.5x the model’s file size. A 4.7 GB model needs ~7 GB free. A 40 GB model needs ~60 GB.
Warning: Check free RAM before pulling big models. Linux/macOS: `free -h`. Windows: Task Manager. If a model won’t fit, use a smaller size or a smaller model.
{
type: ‘exercise’,
id: ‘ollama-model-select-ex2’,
title: ‘Exercise 2: Build a Model Selector’,
difficulty: ‘intermediate’,
exerciseType: ‘write’,
instructions: ‘Write a function select_model(ram_gb) that returns the largest Ollama model the user can run.\n\nModel requirements: phi3:mini (4 GB), mistral:7b (6 GB), llama3.1:8b (7 GB), qwen2.5:7b (7 GB), deepseek-r1:8b (8 GB).\n\nReturn the model with the highest RAM requirement that fits. If nothing fits, return “none”.’,
starterCode: ‘def select_model(ram_gb):\n models = [\n (“deepseek-r1:8b”, 8),\n (“llama3.1:8b”, 7),\n (“qwen2.5:7b”, 7),\n (“mistral:7b”, 6),\n (“phi3:mini”, 4),\n ]\n # TODO: Return first model that fits in ram_gb\n pass\n\nprint(select_model(16))\nprint(select_model(5))\nprint(select_model(2))’,
testCases: [
{ id: ‘tc1’, input: ‘print(select_model(16))’, expectedOutput: ‘deepseek-r1:8b’, description: ’16 GB fits the largest model’ },
{ id: ‘tc2’, input: ‘print(select_model(5))’, expectedOutput: ‘phi3:mini’, description: ‘5 GB only fits phi3’ },
{ id: ‘tc3’, input: ‘print(select_model(2))’, expectedOutput: ‘none’, description: ‘2 GB fits nothing’ },
],
hints: [
‘The list is sorted largest-first. Loop and return the first model where ram_gb >= min_ram.’,
‘for name, min_ram in models:\n if ram_gb >= min_ram:\n return name\nreturn “none”‘,
],
solution: ‘def select_model(ram_gb):\n models = [\n (“deepseek-r1:8b”, 8),\n (“llama3.1:8b”, 7),\n (“qwen2.5:7b”, 7),\n (“mistral:7b”, 6),\n (“phi3:mini”, 4),\n ]\n for name, min_ram in models:\n if ram_gb >= min_ram:\n return name\n return “none”\n\nprint(select_model(16))\nprint(select_model(5))\nprint(select_model(2))’,
solutionExplanation: ‘The models are sorted by RAM requirement descending. The first match is always the biggest model that fits. If nothing fits, return “none”.’,
xpReward: 15,
}
Complete Code
Summary
You can now run LLMs locally with Ollama. Here’s what you covered:
- Ollama pulls and serves open-source LLMs through a local REST API
- Q4/Q8 shrinks models by up to 8x with very little quality loss. GGUF is the file format.
- The OpenAI-style API lets you switch local and cloud by changing one URL
- Modelfiles let you set system prompts and settings for steady results
- Local LLMs are best for privacy, offline use, and free bulk work
- Cloud LLMs win for hard reasoning, long context, and many users
- The local-first fallback gives you the best of both worlds
Practice Exercise:
Build a BatchProcessor class that takes a list of prompts and processes them through the SmartLLMClient. It should process each prompt, track local vs cloud counts, and print a summary with total prompts, local count, cloud count, and average response time.
Frequently Asked Questions
Can I run Ollama without a GPU?
Yes. Ollama runs on CPU by default. A GPU helps, but isn’t needed. With 16 GB RAM and a modern CPU, 7-8B models reply in 5-15 seconds. That’s fast enough for dev work and daily use.
How do I use Ollama with LangChain?
LangChain has a ChatOllama class. Install it with pip install langchain-ollama. Since Ollama works like OpenAI, you can also use ChatOpenAI(base_url="http://localhost:11434/v1/", api_key="ollama") — no extra package needed.
Can I fine-tune models through Ollama?
No. Ollama only serves ready-made models. For fine-tuning, use Unsloth, Axolotl, or Hugging Face PEFT. After you fine-tune, save to GGUF and load it into Ollama with a Modelfile.
How much disk space do multiple models need?
Each model takes 2-8 GB. Five 7-8B models need about 20-25 GB total. Ollama shares layers between model versions, so pulling variants uses less space than you’d think.
Is Ollama suitable for production APIs?
For one user or a small team, yes. Put it behind a proxy (nginx, Caddy) for internal use. For high-traffic production, look at vLLM or TGI instead. They handle batching and GPU sharing.
References
- Ollama official documentation. Link
- Ollama GitHub repository. Link
- Ollama model library. Link
- Ollama blog — OpenAI compatibility. Link
- llama.cpp — GGUF format specification. Link
- Meta AI — Llama 3.1 model card. Link
- Mistral AI — Mistral 7B technical report. Link
- Microsoft Research — Phi-3 technical report. arXiv:2404.14219 (2024).
- DeepSeek AI — DeepSeek-R1 technical report. Link
- OpenAI Python library documentation. Link
Free Course
Master Core Python — Your First Step into AI/ML
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →Trusted by 50,000+ learners
Related Course
Master Gen AI — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
