Ollama Tutorial: Run LLMs Locally (Llama, Mistral)

Learn to run LLMs locally with Ollama. Install Llama, Mistral, and DeepSeek, use the OpenAI-compatible Python API, and build a local-to-cloud fallback client.

Written by Selva Prabhakaran | 20 min read

Install open-source models on your laptop, call them from Python, and build a client that falls back to a cloud API when local quality drops.

You send a prompt to an LLM API. Three seconds later a reply shows up. But your data just left your machine. It crossed the web and landed on someone else’s server.

For private docs, patient records, or secret code — that’s a real problem.

Running LLMs on your own machine fixes it. Your data stays put. You pay nothing per token. And replies come back fast. Ollama makes this easy to set up. It works on a normal laptop with no GPU.

What Is Ollama?

Ollama pulls, manages, and runs open-source LLMs on your own machine. Think of it as Docker for language models. One command pulls a model. One more starts it.

Under the hood, Ollama wraps llama.cpp. This engine runs smaller versions of models on both CPUs and GPUs. You don’t build anything or set up CUDA. Ollama does it for you.

The result? A local server on http://localhost:11434 that speaks an OpenAI-style API. Any Python code that talks to OpenAI can talk to Ollama instead — just change one URL.

Key Insight: Ollama turns your laptop into a local LLM server with an OpenAI-style API. Change the base URL in your Python code — nothing else.

Install Ollama and Pull Your First Model

Go to ollama.com and grab the installer. It works on macOS, Linux, and Windows.

On macOS or Linux, one command handles everything:

bash

curl -fsSL https://ollama.com/install.sh | sh

On Windows, download the .exe installer and run it. Verify with:

bash

ollama --version

Output:

python

ollama version 0.6.2

The install starts a service in the back. It runs on port 11434. Open http://localhost:11434 in your browser — you’ll see “Ollama is running.”

Note: You can also run Ollama in Docker. Pull the image with `docker pull ollama/ollama` and start it with `docker run -d -p 11434:11434 ollama/ollama`. This is handy on Linux servers where you want a clean setup.

Prerequisites

Python version: 3.9+
Required libraries: openai (1.0+), requests
Install: pip install openai requests
Hardware: 8 GB RAM minimum (16 GB recommended)
Disk space: 2-8 GB per model
Time to complete: 20-25 minutes (plus model download time)

Now pull a model. Phi-3 Mini from Microsoft has 3.8 billion weights and runs on almost any machine:

bash

ollama pull phi3:mini

Output:

python

pulling manifest
pulling model...
verifying sha256 digest
writing manifest
success

That pulled a smaller version of Phi-3 Mini (~2.3 GB). Test it:

bash

ollama run phi3:mini "What is gradient descent in one sentence?"

The model replies in a few seconds. No API key. No web access needed. No cost per token.

Quick Check: What port does Ollama listen on by default? (Answer: 11434)

Download More Models for Comparison

Ollama has hundreds of models in its library. Here are five good ones to try in this Ollama tutorial:

Model	Parameters	Disk Size	Best For
`phi3:mini`	3.8B	~2.3 GB	Fast tasks, light Q&A
`llama3.1:8b`	8B	~4.7 GB	General-purpose, balanced
`mistral:7b`	7B	~4.1 GB	Good at following prompts
`qwen2.5:7b`	7B	~4.4 GB	Many languages, strong coding
`deepseek-r1:8b`	8B	~4.9 GB	Step-by-step reasoning

Pull them all:

bash

ollama pull llama3.1:8b
ollama pull mistral:7b
ollama pull qwen2.5:7b
ollama pull deepseek-r1:8b

Each download takes a few minutes. Once pulled, models live in ~/.ollama/models/ on Linux/macOS or C:\Users\<you>\.ollama\models\ on Windows.

Tip: List your downloaded models with `ollama list`. Remove unused ones with `ollama rm ` to free disk space.

What Is Quantization? GGUF, Q4, and Q8 Explained

How does an 8 billion weight model fit in just 4.7 GB? Let’s do the math.

At full size (32 bits), each weight takes 4 bytes. An 8B model needs 32 GB of RAM. That’s way too much for most laptops. We fix this by cutting the size of each weight — that’s called quantization.

Quantization	Bits	8B Model Size	Quality Impact
Full (FP32)	32	~32 GB	Baseline
Half (FP16)	16	~16 GB	Negligible
Q8	8	~8 GB	Very small
Q4_K_M	4	~4.7 GB	Small for most tasks
Q2	2	~2.5 GB	Noticeable — avoid for complex work

GGUF is the file format for these shrunk models. It’s what llama.cpp (and Ollama) reads. When you pull llama3.1:8b, you get Q4_K_M by default.

Here’s a way to think about it. Q8 is like a high-end JPEG — almost the same as the raw file. Q4 is like a normal JPEG — good enough for most work. Q2 is like a crushed JPEG — you can see the damage.

Key Insight: Q4 shrinks memory use by about 8x with very little quality loss. Most users won’t spot a gap on daily tasks like summaries, Q&A, and code writing.

Quick Check: If you have 16 GB of RAM, can you run a full-size (FP32) 8B model? (Answer: No — it needs ~32 GB. Use Q4 instead.)

Use the Ollama REST API from Python

Ollama has a REST API on localhost:11434. You don’t need the Ollama Python package. The requests library works fine for calling your local LLM.

Here’s a call to the chat endpoint. We send a system message and a user prompt, then print what comes back:

python

import requests
import json

response = requests.post(
    "http://localhost:11434/api/chat",
    json={
        "model": "llama3.1:8b",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Explain list comprehension in 2 sentences."}
        ],
        "stream": False
    }
)

result = response.json()
print(result["message"]["content"])

The stream: False flag sends the full reply at once. Set it to True to get tokens one by one — handy for chat apps.

I like the REST API for quick scripts. But for bigger projects, the OpenAI-style endpoint is better. Here’s why.

Use the OpenAI-Compatible API

Ollama has an endpoint at /v1/chat/completions that follows the OpenAI API format. You can use the OpenAI Python library with zero code changes to run LLMs on your own machine.

Point the client at your local server. The api_key field is needed by the library but Ollama skips it:

python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1/",
    api_key="ollama"  # required but ignored
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a function to check if a number is prime."}
    ],
    temperature=0.7,
    max_tokens=200
)

print(response.choices[0].message.content)

That’s the exact same client.chat.completions.create() call you’d make to OpenAI. The only change is base_url.

Tip: Switch between local and cloud by changing two lines. For OpenAI: `base_url=”https://api.openai.com/v1/”` and your real key. For Ollama: `base_url=”http://localhost:11434/v1/”` and `api_key=”ollama”`. Everything else stays the same.

Stream Responses Token by Token

For chat UIs, you want text to appear as the model generates it. Set stream=True and iterate over the chunks:

python

stream = client.chat.completions.create(
    model="mistral:7b",
    messages=[{"role": "user", "content": "What are Newton's three laws?"}],
    stream=True
)

for chunk in stream:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)
print()

Each chunk holds a small piece of the reply. The flush=True forces Python to print right away. You see a smooth typing effect in your terminal.

Compare Models on Quality and Speed

Each model has its own strengths. I’ve found Mistral follows prompts better than Llama. DeepSeek-R1 shines on tasks that need step-by-step thinking.

Let’s test all five on the same prompt. The script sends a coding task, tracks time, and prints a preview:

python

import time
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1/",
    api_key="ollama"
)

models = [
    "phi3:mini", "llama3.1:8b", "mistral:7b",
    "qwen2.5:7b", "deepseek-r1:8b"
]
prompt = "Write a Python function that reverses a string without slicing."

for model_name in models:
    start = time.time()
    resp = client.chat.completions.create(
        model=model_name,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.3,
        max_tokens=150
    )
    elapsed = time.time() - start
    print(f"\n{'='*50}")
    print(f"Model: {model_name} | Time: {elapsed:.1f}s")
    print(f"{'='*50}")
    print(resp.choices[0].message.content[:300])

On a typical laptop with 16 GB RAM and no GPU, expect these rough ranges:

Model	Response Time	Notes
`phi3:mini`	2-5s	Fastest. Good for simple tasks.
`llama3.1:8b`	5-15s	Balanced quality and speed.
`mistral:7b`	5-12s	Detailed. Follows prompts well.
`qwen2.5:7b`	5-14s	Strong at code. Many languages.
`deepseek-r1:8b`	8-20s	Shows its thinking. Slower but deep.

Warning: Speed depends on your hardware. These numbers assume CPU-only mode with 16 GB RAM. A GPU with 8+ GB VRAM cuts times by 3-5x. With only 8 GB RAM, big models may run very slowly or fail.

Customize Models with a Modelfile

Ollama lets you change how a model acts through a Modelfile. Think of it as a Dockerfile — but for LLMs. You set a system prompt, tweak the temperature, and lock in settings.

Create a file called Modelfile with these contents:

python

FROM llama3.1:8b

SYSTEM "You are a Python coding assistant. Always include type hints. Keep answers concise."

PARAMETER temperature 0.3
PARAMETER num_ctx 4096

Then build your custom model:

bash

ollama create python-coder -f Modelfile

Now you can use it like any other model:

bash

ollama run python-coder "Write a function to flatten a nested list."

The model uses your system prompt and settings by default. No need to pass them in every API call. This helps keep things the same across your team or project.

Local vs Cloud — When to Use Which

This is the top question I hear about running LLMs on your own machine. Short answer: use both.

Go local when:
– Data is sensitive (medical, legal, proprietary code)
– You need offline access
– You want zero API cost for bulk processing
– Latency matters and you have decent hardware

Go cloud when:
– You need top-tier reasoning (local 7-8B models can’t match GPT-4 on hard tasks)
– You’re serving many users at once (a laptop handles 1-2 at a time)
– You need long context windows (100K+ tokens)
– You need vision or audio features

The best path is to mix both. Route simple tasks to your local model. Send hard tasks to the cloud. That’s what we’ll build next.

Note: Local models keep getting better. Today’s 8B models match cloud models from 2 years ago. Check your local-vs-cloud split every 6 months.

Build a Python Client with Local-to-Cloud Fallback

Here’s the hands-on project. We’ll build a SmartLLMClient that tries Ollama first. If the local model fails or sends back a very short reply, it falls back to a cloud API.

First, the class shell. It holds clients for both local and cloud:

python

import os
import time
from openai import OpenAI

class SmartLLMClient:
    """Tries local Ollama first, falls back to cloud."""

    def __init__(
        self,
        local_model="llama3.1:8b",
        cloud_model="gpt-4o-mini",
        min_response_length=20
    ):
        self.local_client = OpenAI(
            base_url="http://localhost:11434/v1/",
            api_key="ollama"
        )
        self.cloud_client = OpenAI(
            api_key=os.getenv("OPENAI_API_KEY", "your-key-here")
        )
        self.local_model = local_model
        self.cloud_model = cloud_model
        self.min_response_length = min_response_length

The min_response_length is our quality gate. Fewer than 20 chars means something broke.

The generate method wraps the local call in try/except. If it fails or the reply is too short, it tries the cloud:

python

    def generate(self, prompt, system_msg="You are a helpful assistant."):
        """Try local model first, fall back to cloud on failure."""
        messages = [
            {"role": "system", "content": system_msg},
            {"role": "user", "content": prompt}
        ]

        # Attempt local inference
        try:
            start = time.time()
            resp = self.local_client.chat.completions.create(
                model=self.local_model,
                messages=messages,
                temperature=0.7,
                max_tokens=500
            )
            content = resp.choices[0].message.content
            elapsed = time.time() - start

            if len(content.strip()) >= self.min_response_length:
                return {"content": content, "source": "local",
                        "model": self.local_model, "time": round(elapsed, 2)}

            print(f"Local response too short. Falling back.")
        except Exception as e:
            print(f"Local failed: {e}. Falling back to cloud.")

If local works and the reply is long enough, we return right away. If not, the cloud takes over:

python

        # Cloud fallback
        start = time.time()
        resp = self.cloud_client.chat.completions.create(
            model=self.cloud_model,
            messages=messages,
            temperature=0.7,
            max_tokens=500
        )
        content = resp.choices[0].message.content
        elapsed = time.time() - start

        return {"content": content, "source": "cloud",
                "model": self.cloud_model, "time": round(elapsed, 2)}

Let’s use it. Create an instance and send a prompt:

python

# Requires: Ollama running locally + OPENAI_API_KEY for cloud fallback
llm = SmartLLMClient(
    local_model="llama3.1:8b",
    cloud_model="gpt-4o-mini"
)

result = llm.generate("Explain the difference between a list and a tuple in Python.")

print(f"Source: {result['source']} ({result['model']})")
print(f"Time: {result['time']}s")
print(f"\n{result['content']}")

If Ollama isn’t running, you’ll see:

python

Local failed: Connection refused. Falling back to cloud.
Source: cloud (gpt-4o-mini)

Key Insight: The local-first, cloud-backup pattern works in real apps. Route cheap, high-volume tasks to your local model. Save cloud APIs for hard reasoning. This cuts API costs a lot.

{
type: ‘exercise’,
id: ‘ollama-fallback-ex1’,
title: ‘Exercise 1: Add Response Logging’,
difficulty: ‘intermediate’,
exerciseType: ‘write’,
instructions: ‘Extend the SmartLLMClient by adding a log list attribute. Every call to generate should append a dictionary with keys “timestamp”, “source”, “model”, and “prompt_length” to self.log. Then add a get_stats method that returns a dictionary with “total_calls”, “local_calls”, and “cloud_calls”.\n\nThe starter code gives you the class skeleton. Fill in the logging and stats.’,
starterCode: ‘import time\n\nclass SmartLLMClient:\n def init(self):\n self.log = []\n\n def generate(self, prompt):\n source = “local”\n model = “llama3.1:8b”\n # TODO: Append to self.log\n self.log.append({\n # fill in\n })\n return {“content”: “response”, “source”: source}\n\n def get_stats(self):\n # TODO: Return dict with total_calls, local_calls, cloud_calls\n pass\n\nclient = SmartLLMClient()\nclient.generate(“Hello”)\nclient.generate(“World”)\nstats = client.get_stats()\nprint(f”Total: {stats[\’total_calls\’]}, Local: {stats[\’local_calls\’]}, Cloud: {stats[\’cloud_calls\’]}”)’,
testCases: [
{ id: ‘tc1’, input: ‘client = SmartLLMClient()\nclient.generate(“Hello”)\nclient.generate(“World”)\nstats = client.get_stats()\nprint(f”{stats[\’total_calls\’]} {stats[\’local_calls\’]} {stats[\’cloud_calls\’]}”)’, expectedOutput: ‘2 2 0’, description: ‘Two local calls, zero cloud’ },
{ id: ‘tc2’, input: ‘client = SmartLLMClient()\nclient.generate(“Test”)\nprint(len(client.log))’, expectedOutput: ‘1’, description: ‘Log has one entry after one call’ },
],
hints: [
‘Use time.time() for the timestamp and len(prompt) for prompt_length.’,
‘For get_stats: sum(1 for e in self.log if e[“source”] == “local”)’,
],
solution: ‘import time\n\nclass SmartLLMClient:\n def init(self):\n self.log = []\n\n def generate(self, prompt):\n source = “local”\n model = “llama3.1:8b”\n self.log.append({\n “timestamp”: time.time(),\n “source”: source,\n “model”: model,\n “prompt_length”: len(prompt)\n })\n return {“content”: “response”, “source”: source}\n\n def get_stats(self):\n return {\n “total_calls”: len(self.log),\n “local_calls”: sum(1 for e in self.log if e[“source”] == “local”),\n “cloud_calls”: sum(1 for e in self.log if e[“source”] == “cloud”)\n }\n\nclient = SmartLLMClient()\nclient.generate(“Hello”)\nclient.generate(“World”)\nstats = client.get_stats()\nprint(f”Total: {stats[\’total_calls\’]}, Local: {stats[\’local_calls\’]}, Cloud: {stats[\’cloud_calls\’]}”)’,
solutionExplanation: ‘Each generate() call appends a log entry. The get_stats() method counts entries by source using a generator expression inside sum().’,
xpReward: 15,
}

Common Mistakes and How to Fix Them

Mistake 1: Running Python code before starting Ollama

python

# This will fail with ConnectionError
response = requests.post("http://localhost:11434/api/chat", ...)
# ConnectionError: Connection refused

Ollama runs as a background service. On macOS and Windows, it starts at boot. On Linux, you may need to start it on your own:

bash

ollama serve

Run your Python code in a separate terminal after the service is up.

Mistake 2: Using a model you haven’t pulled

bash

ollama run llama3:70b
# Error: model "llama3:70b" not found, try pulling it first

Ollama doesn’t auto-download. Pull first, then run:

bash

ollama pull llama3:70b
ollama run llama3:70b

Mistake 3: Loading a model too large for your RAM

A 70B model on a 16 GB machine will crash or crawl. Ollama swaps to disk, and each reply takes minutes.

Rule of thumb: Free RAM should be at least 1.5x the model’s file size. A 4.7 GB model needs ~7 GB free. A 40 GB model needs ~60 GB.

Warning: Check free RAM before pulling big models. Linux/macOS: `free -h`. Windows: Task Manager. If a model won’t fit, use a smaller size or a smaller model.

{
type: ‘exercise’,
id: ‘ollama-model-select-ex2’,
title: ‘Exercise 2: Build a Model Selector’,
difficulty: ‘intermediate’,
exerciseType: ‘write’,
instructions: ‘Write a function select_model(ram_gb) that returns the largest Ollama model the user can run.\n\nModel requirements: phi3:mini (4 GB), mistral:7b (6 GB), llama3.1:8b (7 GB), qwen2.5:7b (7 GB), deepseek-r1:8b (8 GB).\n\nReturn the model with the highest RAM requirement that fits. If nothing fits, return “none”.’,
starterCode: ‘def select_model(ram_gb):\n models = [\n (“deepseek-r1:8b”, 8),\n (“llama3.1:8b”, 7),\n (“qwen2.5:7b”, 7),\n (“mistral:7b”, 6),\n (“phi3:mini”, 4),\n ]\n # TODO: Return first model that fits in ram_gb\n pass\n\nprint(select_model(16))\nprint(select_model(5))\nprint(select_model(2))’,
testCases: [
{ id: ‘tc1’, input: ‘print(select_model(16))’, expectedOutput: ‘deepseek-r1:8b’, description: ’16 GB fits the largest model’ },
{ id: ‘tc2’, input: ‘print(select_model(5))’, expectedOutput: ‘phi3:mini’, description: ‘5 GB only fits phi3’ },
{ id: ‘tc3’, input: ‘print(select_model(2))’, expectedOutput: ‘none’, description: ‘2 GB fits nothing’ },
],
hints: [
‘The list is sorted largest-first. Loop and return the first model where ram_gb >= min_ram.’,
‘for name, min_ram in models:\n if ram_gb >= min_ram:\n return name\nreturn “none”‘,
],
solution: ‘def select_model(ram_gb):\n models = [\n (“deepseek-r1:8b”, 8),\n (“llama3.1:8b”, 7),\n (“qwen2.5:7b”, 7),\n (“mistral:7b”, 6),\n (“phi3:mini”, 4),\n ]\n for name, min_ram in models:\n if ram_gb >= min_ram:\n return name\n return “none”\n\nprint(select_model(16))\nprint(select_model(5))\nprint(select_model(2))’,
solutionExplanation: ‘The models are sorted by RAM requirement descending. The first match is always the biggest model that fits. If nothing fits, return “none”.’,
xpReward: 15,
}

Complete Code

Click to expand the full script (copy-paste and run)

python

# Complete code from: Run LLMs Locally with Ollama
# Requires: pip install openai requests
# Requires: Ollama running locally (ollama.com)
# Python 3.9+

import os
import time
import requests
from openai import OpenAI

# --- Section 1: Basic REST API Call ---
print("=== REST API Call ===")
response = requests.post(
    "http://localhost:11434/api/chat",
    json={
        "model": "llama3.1:8b",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Explain list comprehension in 2 sentences."}
        ],
        "stream": False
    }
)
result = response.json()
print(result["message"]["content"])

# --- Section 2: OpenAI-Compatible API ---
print("\n=== OpenAI-Compatible API ===")
client = OpenAI(
    base_url="http://localhost:11434/v1/",
    api_key="ollama"
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a function to check if a number is prime."}
    ],
    temperature=0.7,
    max_tokens=200
)
print(response.choices[0].message.content)

# --- Section 3: Streaming ---
print("\n=== Streaming ===")
stream = client.chat.completions.create(
    model="mistral:7b",
    messages=[{"role": "user", "content": "What are Newton's three laws?"}],
    stream=True
)
for chunk in stream:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)
print()

# --- Section 4: Model Comparison ---
print("\n=== Model Comparison ===")
models = ["phi3:mini", "llama3.1:8b", "mistral:7b",
          "qwen2.5:7b", "deepseek-r1:8b"]
prompt = "Write a Python function that reverses a string without slicing."

for model_name in models:
    start = time.time()
    resp = client.chat.completions.create(
        model=model_name,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.3,
        max_tokens=150
    )
    elapsed = time.time() - start
    print(f"\nModel: {model_name} | Time: {elapsed:.1f}s")
    print(resp.choices[0].message.content[:200])

# --- Section 5: Smart LLM Client ---
print("\n=== Smart Client ===")

class SmartLLMClient:
    def __init__(self, local_model="llama3.1:8b",
                 cloud_model="gpt-4o-mini", min_response_length=20):
        self.local_client = OpenAI(
            base_url="http://localhost:11434/v1/", api_key="ollama")
        self.cloud_client = OpenAI(
            api_key=os.getenv("OPENAI_API_KEY", "your-key-here"))
        self.local_model = local_model
        self.cloud_model = cloud_model
        self.min_response_length = min_response_length

    def generate(self, prompt, system_msg="You are a helpful assistant."):
        messages = [
            {"role": "system", "content": system_msg},
            {"role": "user", "content": prompt}
        ]
        try:
            start = time.time()
            resp = self.local_client.chat.completions.create(
                model=self.local_model, messages=messages,
                temperature=0.7, max_tokens=500)
            content = resp.choices[0].message.content
            elapsed = time.time() - start
            if len(content.strip()) >= self.min_response_length:
                return {"content": content, "source": "local",
                        "model": self.local_model, "time": round(elapsed, 2)}
            print("Local response too short. Falling back.")
        except Exception as e:
            print(f"Local failed: {e}. Falling back to cloud.")

        start = time.time()
        resp = self.cloud_client.chat.completions.create(
            model=self.cloud_model, messages=messages,
            temperature=0.7, max_tokens=500)
        content = resp.choices[0].message.content
        elapsed = time.time() - start
        return {"content": content, "source": "cloud",
                "model": self.cloud_model, "time": round(elapsed, 2)}

llm = SmartLLMClient()
result = llm.generate("Explain list vs tuple in Python.")
print(f"Source: {result['source']} ({result['model']})")
print(f"Time: {result['time']}s")
print(result['content'])

print("\nScript completed successfully.")

Summary

You can now run LLMs locally with Ollama. Here’s what you covered:

Ollama pulls and serves open-source LLMs through a local REST API
Q4/Q8 shrinks models by up to 8x with very little quality loss. GGUF is the file format.
The OpenAI-style API lets you switch local and cloud by changing one URL
Modelfiles let you set system prompts and settings for steady results
Local LLMs are best for privacy, offline use, and free bulk work
Cloud LLMs win for hard reasoning, long context, and many users
The local-first fallback gives you the best of both worlds

Practice Exercise:

Build a BatchProcessor class that takes a list of prompts and processes them through the SmartLLMClient. It should process each prompt, track local vs cloud counts, and print a summary with total prompts, local count, cloud count, and average response time.

Click to see the solution

python

class BatchProcessor:
    def __init__(self, client):
        self.client = client
        self.results = []

    def process(self, prompts):
        for prompt in prompts:
            result = self.client.generate(prompt)
            self.results.append(result)
            print(f"[{result['source']}] {prompt[:50]}...")

        local = sum(1 for r in self.results if r["source"] == "local")
        cloud = sum(1 for r in self.results if r["source"] == "cloud")
        avg_time = sum(r["time"] for r in self.results) / len(self.results)

        print(f"\n--- Summary ---")
        print(f"Total: {len(self.results)}")
        print(f"Local: {local} | Cloud: {cloud}")
        print(f"Avg time: {avg_time:.2f}s")

# Usage
llm = SmartLLMClient()
processor = BatchProcessor(llm)
processor.process([
    "What is a decorator in Python?",
    "Explain the GIL in one paragraph.",
    "What does __init__ do?",
])

Frequently Asked Questions

Can I run Ollama without a GPU?

Yes. Ollama runs on CPU by default. A GPU helps, but isn’t needed. With 16 GB RAM and a modern CPU, 7-8B models reply in 5-15 seconds. That’s fast enough for dev work and daily use.

How do I use Ollama with LangChain?

LangChain has a ChatOllama class. Install it with pip install langchain-ollama. Since Ollama works like OpenAI, you can also use ChatOpenAI(base_url="http://localhost:11434/v1/", api_key="ollama") — no extra package needed.

Can I fine-tune models through Ollama?

No. Ollama only serves ready-made models. For fine-tuning, use Unsloth, Axolotl, or Hugging Face PEFT. After you fine-tune, save to GGUF and load it into Ollama with a Modelfile.

How much disk space do multiple models need?

Each model takes 2-8 GB. Five 7-8B models need about 20-25 GB total. Ollama shares layers between model versions, so pulling variants uses less space than you’d think.

Is Ollama suitable for production APIs?

For one user or a small team, yes. Put it behind a proxy (nginx, Caddy) for internal use. For high-traffic production, look at vLLM or TGI instead. They handle batching and GPU sharing.

References

Ollama official documentation. Link
Ollama GitHub repository. Link
Ollama model library. Link
Ollama blog — OpenAI compatibility. Link
llama.cpp — GGUF format specification. Link
Meta AI — Llama 3.1 model card. Link
Mistral AI — Mistral 7B technical report. Link
Microsoft Research — Phi-3 technical report. arXiv:2404.14219 (2024).
DeepSeek AI — DeepSeek-R1 technical report. Link
OpenAI Python library documentation. Link

Free Course

Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course →

Trusted by 50,000+ learners

Written by

Selva Prabhakaran →

Related Course

Master Gen AI — Hands-On

Join 5,000+ students at edu.machinelearningplus.com

Explore Course

Ollama Tutorial: Run LLMs Locally (Llama, Mistral)

What Is Ollama?

Install Ollama and Pull Your First Model

Prerequisites

Download More Models for Comparison

What Is Quantization? GGUF, Q4, and Q8 Explained

Use the Ollama REST API from Python

Use the OpenAI-Compatible API

Stream Responses Token by Token

Compare Models on Quality and Speed

Customize Models with a Modelfile

Local vs Cloud — When to Use Which

Build a Python Client with Local-to-Cloud Fallback

Common Mistakes and How to Fix Them

Mistake 1: Running Python code before starting Ollama

Mistake 2: Using a model you haven’t pulled

Mistake 3: Loading a model too large for your RAM

Complete Code

Summary

Frequently Asked Questions

Can I run Ollama without a GPU?

How do I use Ollama with LangChain?

Can I fine-tune models through Ollama?

How much disk space do multiple models need?

Is Ollama suitable for production APIs?

References

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

What Is Ollama?

Install Ollama and Pull Your First Model

Prerequisites

Download More Models for Comparison

What Is Quantization? GGUF, Q4, and Q8 Explained

Use the Ollama REST API from Python

Use the OpenAI-Compatible API

Stream Responses Token by Token

Compare Models on Quality and Speed

Customize Models with a Modelfile

Local vs Cloud — When to Use Which

Build a Python Client with Local-to-Cloud Fallback

Common Mistakes and How to Fix Them

Mistake 1: Running Python code before starting Ollama

Mistake 2: Using a model you haven’t pulled

Mistake 3: Loading a model too large for your RAM

Complete Code

Summary

Frequently Asked Questions

Can I run Ollama without a GPU?

How do I use Ollama with LangChain?

Can I fine-tune models through Ollama?

How much disk space do multiple models need?

Is Ollama suitable for production APIs?

References

Related Articles

LLM Temperature, Top-P, and Top-K Explained — With Python Simulations

OpenAI API Python Tutorial — Chat Completions, Streaming & Error Handling

Groq vs Fireworks vs Together AI: Speed Benchmark

Get Your Free AI/ML Engineer Roadmap

Want help choosing the right AI/ML path?

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science