Menu

Ollama Tutorial: Run LLMs Locally (Llama, Mistral)

Learn to run LLMs locally with Ollama. Install Llama, Mistral, and DeepSeek, use the OpenAI-compatible Python API, and build a local-to-cloud fallback client.

Written by Selva Prabhakaran | 20 min read

Install open-source models on your laptop, call them from Python, and build a client that falls back to a cloud API when local quality drops.

You send a prompt to an LLM API. Three seconds later a reply shows up. But your data just left your machine. It crossed the web and landed on someone else’s server.

For private docs, patient records, or secret code — that’s a real problem.

Running LLMs on your own machine fixes it. Your data stays put. You pay nothing per token. And replies come back fast. Ollama makes this easy to set up. It works on a normal laptop with no GPU.

What Is Ollama?

Ollama pulls, manages, and runs open-source LLMs on your own machine. Think of it as Docker for language models. One command pulls a model. One more starts it.

Under the hood, Ollama wraps llama.cpp. This engine runs smaller versions of models on both CPUs and GPUs. You don’t build anything or set up CUDA. Ollama does it for you.

The result? A local server on http://localhost:11434 that speaks an OpenAI-style API. Any Python code that talks to OpenAI can talk to Ollama instead — just change one URL.

Key Insight: Ollama turns your laptop into a local LLM server with an OpenAI-style API. Change the base URL in your Python code — nothing else.

Install Ollama and Pull Your First Model

Go to ollama.com and grab the installer. It works on macOS, Linux, and Windows.

On macOS or Linux, one command handles everything:

bash
curl -fsSL https://ollama.com/install.sh | sh

On Windows, download the .exe installer and run it. Verify with:

bash
ollama --version

Output:

python
ollama version 0.6.2

The install starts a service in the back. It runs on port 11434. Open http://localhost:11434 in your browser — you’ll see “Ollama is running.”

Note: You can also run Ollama in Docker. Pull the image with `docker pull ollama/ollama` and start it with `docker run -d -p 11434:11434 ollama/ollama`. This is handy on Linux servers where you want a clean setup.

Prerequisites

  • Python version: 3.9+
  • Required libraries: openai (1.0+), requests
  • Install: pip install openai requests
  • Hardware: 8 GB RAM minimum (16 GB recommended)
  • Disk space: 2-8 GB per model
  • Time to complete: 20-25 minutes (plus model download time)

Now pull a model. Phi-3 Mini from Microsoft has 3.8 billion weights and runs on almost any machine:

bash
ollama pull phi3:mini

Output:

python
pulling manifest
pulling model...
verifying sha256 digest
writing manifest
success

That pulled a smaller version of Phi-3 Mini (~2.3 GB). Test it:

bash
ollama run phi3:mini "What is gradient descent in one sentence?"

The model replies in a few seconds. No API key. No web access needed. No cost per token.

Quick Check: What port does Ollama listen on by default? (Answer: 11434)

Download More Models for Comparison

Ollama has hundreds of models in its library. Here are five good ones to try in this Ollama tutorial:

ModelParametersDisk SizeBest For
phi3:mini3.8B~2.3 GBFast tasks, light Q&A
llama3.1:8b8B~4.7 GBGeneral-purpose, balanced
mistral:7b7B~4.1 GBGood at following prompts
qwen2.5:7b7B~4.4 GBMany languages, strong coding
deepseek-r1:8b8B~4.9 GBStep-by-step reasoning

Pull them all:

bash
ollama pull llama3.1:8b
ollama pull mistral:7b
ollama pull qwen2.5:7b
ollama pull deepseek-r1:8b

Each download takes a few minutes. Once pulled, models live in ~/.ollama/models/ on Linux/macOS or C:\Users\<you>\.ollama\models\ on Windows.

Tip: List your downloaded models with `ollama list`. Remove unused ones with `ollama rm ` to free disk space.

What Is Quantization? GGUF, Q4, and Q8 Explained

How does an 8 billion weight model fit in just 4.7 GB? Let’s do the math.

At full size (32 bits), each weight takes 4 bytes. An 8B model needs 32 GB of RAM. That’s way too much for most laptops. We fix this by cutting the size of each weight — that’s called quantization.

QuantizationBits8B Model SizeQuality Impact
Full (FP32)32~32 GBBaseline
Half (FP16)16~16 GBNegligible
Q88~8 GBVery small
Q4_K_M4~4.7 GBSmall for most tasks
Q22~2.5 GBNoticeable — avoid for complex work

GGUF is the file format for these shrunk models. It’s what llama.cpp (and Ollama) reads. When you pull llama3.1:8b, you get Q4_K_M by default.

Here’s a way to think about it. Q8 is like a high-end JPEG — almost the same as the raw file. Q4 is like a normal JPEG — good enough for most work. Q2 is like a crushed JPEG — you can see the damage.

Key Insight: Q4 shrinks memory use by about 8x with very little quality loss. Most users won’t spot a gap on daily tasks like summaries, Q&A, and code writing.

Quick Check: If you have 16 GB of RAM, can you run a full-size (FP32) 8B model? (Answer: No — it needs ~32 GB. Use Q4 instead.)

Use the Ollama REST API from Python

Ollama has a REST API on localhost:11434. You don’t need the Ollama Python package. The requests library works fine for calling your local LLM.

Here’s a call to the chat endpoint. We send a system message and a user prompt, then print what comes back:

python
import requests
import json

response = requests.post(
    "http://localhost:11434/api/chat",
    json={
        "model": "llama3.1:8b",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Explain list comprehension in 2 sentences."}
        ],
        "stream": False
    }
)

result = response.json()
print(result["message"]["content"])

The stream: False flag sends the full reply at once. Set it to True to get tokens one by one — handy for chat apps.

I like the REST API for quick scripts. But for bigger projects, the OpenAI-style endpoint is better. Here’s why.

Use the OpenAI-Compatible API

Ollama has an endpoint at /v1/chat/completions that follows the OpenAI API format. You can use the OpenAI Python library with zero code changes to run LLMs on your own machine.

Point the client at your local server. The api_key field is needed by the library but Ollama skips it:

python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1/",
    api_key="ollama"  # required but ignored
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a function to check if a number is prime."}
    ],
    temperature=0.7,
    max_tokens=200
)

print(response.choices[0].message.content)

That’s the exact same client.chat.completions.create() call you’d make to OpenAI. The only change is base_url.

Tip: Switch between local and cloud by changing two lines. For OpenAI: `base_url=”https://api.openai.com/v1/”` and your real key. For Ollama: `base_url=”http://localhost:11434/v1/”` and `api_key=”ollama”`. Everything else stays the same.

Stream Responses Token by Token

For chat UIs, you want text to appear as the model generates it. Set stream=True and iterate over the chunks:

python
stream = client.chat.completions.create(
    model="mistral:7b",
    messages=[{"role": "user", "content": "What are Newton's three laws?"}],
    stream=True
)

for chunk in stream:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)
print()

Each chunk holds a small piece of the reply. The flush=True forces Python to print right away. You see a smooth typing effect in your terminal.

Compare Models on Quality and Speed

Each model has its own strengths. I’ve found Mistral follows prompts better than Llama. DeepSeek-R1 shines on tasks that need step-by-step thinking.

Let’s test all five on the same prompt. The script sends a coding task, tracks time, and prints a preview:

python
import time
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1/",
    api_key="ollama"
)

models = [
    "phi3:mini", "llama3.1:8b", "mistral:7b",
    "qwen2.5:7b", "deepseek-r1:8b"
]
prompt = "Write a Python function that reverses a string without slicing."

for model_name in models:
    start = time.time()
    resp = client.chat.completions.create(
        model=model_name,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.3,
        max_tokens=150
    )
    elapsed = time.time() - start
    print(f"\n{'='*50}")
    print(f"Model: {model_name} | Time: {elapsed:.1f}s")
    print(f"{'='*50}")
    print(resp.choices[0].message.content[:300])

On a typical laptop with 16 GB RAM and no GPU, expect these rough ranges:

ModelResponse TimeNotes
phi3:mini2-5sFastest. Good for simple tasks.
llama3.1:8b5-15sBalanced quality and speed.
mistral:7b5-12sDetailed. Follows prompts well.
qwen2.5:7b5-14sStrong at code. Many languages.
deepseek-r1:8b8-20sShows its thinking. Slower but deep.
Warning: Speed depends on your hardware. These numbers assume CPU-only mode with 16 GB RAM. A GPU with 8+ GB VRAM cuts times by 3-5x. With only 8 GB RAM, big models may run very slowly or fail.

Customize Models with a Modelfile

Ollama lets you change how a model acts through a Modelfile. Think of it as a Dockerfile — but for LLMs. You set a system prompt, tweak the temperature, and lock in settings.

Create a file called Modelfile with these contents:

python
FROM llama3.1:8b

SYSTEM "You are a Python coding assistant. Always include type hints. Keep answers concise."

PARAMETER temperature 0.3
PARAMETER num_ctx 4096

Then build your custom model:

bash
ollama create python-coder -f Modelfile

Now you can use it like any other model:

bash
ollama run python-coder "Write a function to flatten a nested list."

The model uses your system prompt and settings by default. No need to pass them in every API call. This helps keep things the same across your team or project.

Local vs Cloud — When to Use Which

This is the top question I hear about running LLMs on your own machine. Short answer: use both.

Go local when:
– Data is sensitive (medical, legal, proprietary code)
– You need offline access
– You want zero API cost for bulk processing
– Latency matters and you have decent hardware

Go cloud when:
– You need top-tier reasoning (local 7-8B models can’t match GPT-4 on hard tasks)
– You’re serving many users at once (a laptop handles 1-2 at a time)
– You need long context windows (100K+ tokens)
– You need vision or audio features

The best path is to mix both. Route simple tasks to your local model. Send hard tasks to the cloud. That’s what we’ll build next.

Note: Local models keep getting better. Today’s 8B models match cloud models from 2 years ago. Check your local-vs-cloud split every 6 months.

Build a Python Client with Local-to-Cloud Fallback

Here’s the hands-on project. We’ll build a SmartLLMClient that tries Ollama first. If the local model fails or sends back a very short reply, it falls back to a cloud API.

First, the class shell. It holds clients for both local and cloud:

python
import os
import time
from openai import OpenAI

class SmartLLMClient:
    """Tries local Ollama first, falls back to cloud."""

    def __init__(
        self,
        local_model="llama3.1:8b",
        cloud_model="gpt-4o-mini",
        min_response_length=20
    ):
        self.local_client = OpenAI(
            base_url="http://localhost:11434/v1/",
            api_key="ollama"
        )
        self.cloud_client = OpenAI(
            api_key=os.getenv("OPENAI_API_KEY", "your-key-here")
        )
        self.local_model = local_model
        self.cloud_model = cloud_model
        self.min_response_length = min_response_length

The min_response_length is our quality gate. Fewer than 20 chars means something broke.

The generate method wraps the local call in try/except. If it fails or the reply is too short, it tries the cloud:

python
    def generate(self, prompt, system_msg="You are a helpful assistant."):
        """Try local model first, fall back to cloud on failure."""
        messages = [
            {"role": "system", "content": system_msg},
            {"role": "user", "content": prompt}
        ]

        # Attempt local inference
        try:
            start = time.time()
            resp = self.local_client.chat.completions.create(
                model=self.local_model,
                messages=messages,
                temperature=0.7,
                max_tokens=500
            )
            content = resp.choices[0].message.content
            elapsed = time.time() - start

            if len(content.strip()) >= self.min_response_length:
                return {"content": content, "source": "local",
                        "model": self.local_model, "time": round(elapsed, 2)}

            print(f"Local response too short. Falling back.")
        except Exception as e:
            print(f"Local failed: {e}. Falling back to cloud.")

If local works and the reply is long enough, we return right away. If not, the cloud takes over:

python
        # Cloud fallback
        start = time.time()
        resp = self.cloud_client.chat.completions.create(
            model=self.cloud_model,
            messages=messages,
            temperature=0.7,
            max_tokens=500
        )
        content = resp.choices[0].message.content
        elapsed = time.time() - start

        return {"content": content, "source": "cloud",
                "model": self.cloud_model, "time": round(elapsed, 2)}

Let’s use it. Create an instance and send a prompt:

python
# Requires: Ollama running locally + OPENAI_API_KEY for cloud fallback
llm = SmartLLMClient(
    local_model="llama3.1:8b",
    cloud_model="gpt-4o-mini"
)

result = llm.generate("Explain the difference between a list and a tuple in Python.")

print(f"Source: {result['source']} ({result['model']})")
print(f"Time: {result['time']}s")
print(f"\n{result['content']}")

If Ollama isn’t running, you’ll see:

python
Local failed: Connection refused. Falling back to cloud.
Source: cloud (gpt-4o-mini)
Key Insight: The local-first, cloud-backup pattern works in real apps. Route cheap, high-volume tasks to your local model. Save cloud APIs for hard reasoning. This cuts API costs a lot.

{
type: ‘exercise’,
id: ‘ollama-fallback-ex1’,
title: ‘Exercise 1: Add Response Logging’,
difficulty: ‘intermediate’,
exerciseType: ‘write’,
instructions: ‘Extend the SmartLLMClient by adding a log list attribute. Every call to generate should append a dictionary with keys “timestamp”, “source”, “model”, and “prompt_length” to self.log. Then add a get_stats method that returns a dictionary with “total_calls”, “local_calls”, and “cloud_calls”.\n\nThe starter code gives you the class skeleton. Fill in the logging and stats.’,
starterCode: ‘import time\n\nclass SmartLLMClient:\n def init(self):\n self.log = []\n\n def generate(self, prompt):\n source = “local”\n model = “llama3.1:8b”\n # TODO: Append to self.log\n self.log.append({\n # fill in\n })\n return {“content”: “response”, “source”: source}\n\n def get_stats(self):\n # TODO: Return dict with total_calls, local_calls, cloud_calls\n pass\n\nclient = SmartLLMClient()\nclient.generate(“Hello”)\nclient.generate(“World”)\nstats = client.get_stats()\nprint(f”Total: {stats[\’total_calls\’]}, Local: {stats[\’local_calls\’]}, Cloud: {stats[\’cloud_calls\’]}”)’,
testCases: [
{ id: ‘tc1’, input: ‘client = SmartLLMClient()\nclient.generate(“Hello”)\nclient.generate(“World”)\nstats = client.get_stats()\nprint(f”{stats[\’total_calls\’]} {stats[\’local_calls\’]} {stats[\’cloud_calls\’]}”)’, expectedOutput: ‘2 2 0’, description: ‘Two local calls, zero cloud’ },
{ id: ‘tc2’, input: ‘client = SmartLLMClient()\nclient.generate(“Test”)\nprint(len(client.log))’, expectedOutput: ‘1’, description: ‘Log has one entry after one call’ },
],
hints: [
‘Use time.time() for the timestamp and len(prompt) for prompt_length.’,
‘For get_stats: sum(1 for e in self.log if e[“source”] == “local”)’,
],
solution: ‘import time\n\nclass SmartLLMClient:\n def init(self):\n self.log = []\n\n def generate(self, prompt):\n source = “local”\n model = “llama3.1:8b”\n self.log.append({\n “timestamp”: time.time(),\n “source”: source,\n “model”: model,\n “prompt_length”: len(prompt)\n })\n return {“content”: “response”, “source”: source}\n\n def get_stats(self):\n return {\n “total_calls”: len(self.log),\n “local_calls”: sum(1 for e in self.log if e[“source”] == “local”),\n “cloud_calls”: sum(1 for e in self.log if e[“source”] == “cloud”)\n }\n\nclient = SmartLLMClient()\nclient.generate(“Hello”)\nclient.generate(“World”)\nstats = client.get_stats()\nprint(f”Total: {stats[\’total_calls\’]}, Local: {stats[\’local_calls\’]}, Cloud: {stats[\’cloud_calls\’]}”)’,
solutionExplanation: ‘Each generate() call appends a log entry. The get_stats() method counts entries by source using a generator expression inside sum().’,
xpReward: 15,
}

Common Mistakes and How to Fix Them

Mistake 1: Running Python code before starting Ollama

python
# This will fail with ConnectionError
response = requests.post("http://localhost:11434/api/chat", ...)
# ConnectionError: Connection refused

Ollama runs as a background service. On macOS and Windows, it starts at boot. On Linux, you may need to start it on your own:

bash
ollama serve

Run your Python code in a separate terminal after the service is up.

Mistake 2: Using a model you haven’t pulled

bash
ollama run llama3:70b
# Error: model "llama3:70b" not found, try pulling it first

Ollama doesn’t auto-download. Pull first, then run:

bash
ollama pull llama3:70b
ollama run llama3:70b

Mistake 3: Loading a model too large for your RAM

A 70B model on a 16 GB machine will crash or crawl. Ollama swaps to disk, and each reply takes minutes.

Rule of thumb: Free RAM should be at least 1.5x the model’s file size. A 4.7 GB model needs ~7 GB free. A 40 GB model needs ~60 GB.

Warning: Check free RAM before pulling big models. Linux/macOS: `free -h`. Windows: Task Manager. If a model won’t fit, use a smaller size or a smaller model.

{
type: ‘exercise’,
id: ‘ollama-model-select-ex2’,
title: ‘Exercise 2: Build a Model Selector’,
difficulty: ‘intermediate’,
exerciseType: ‘write’,
instructions: ‘Write a function select_model(ram_gb) that returns the largest Ollama model the user can run.\n\nModel requirements: phi3:mini (4 GB), mistral:7b (6 GB), llama3.1:8b (7 GB), qwen2.5:7b (7 GB), deepseek-r1:8b (8 GB).\n\nReturn the model with the highest RAM requirement that fits. If nothing fits, return “none”.’,
starterCode: ‘def select_model(ram_gb):\n models = [\n (“deepseek-r1:8b”, 8),\n (“llama3.1:8b”, 7),\n (“qwen2.5:7b”, 7),\n (“mistral:7b”, 6),\n (“phi3:mini”, 4),\n ]\n # TODO: Return first model that fits in ram_gb\n pass\n\nprint(select_model(16))\nprint(select_model(5))\nprint(select_model(2))’,
testCases: [
{ id: ‘tc1’, input: ‘print(select_model(16))’, expectedOutput: ‘deepseek-r1:8b’, description: ’16 GB fits the largest model’ },
{ id: ‘tc2’, input: ‘print(select_model(5))’, expectedOutput: ‘phi3:mini’, description: ‘5 GB only fits phi3’ },
{ id: ‘tc3’, input: ‘print(select_model(2))’, expectedOutput: ‘none’, description: ‘2 GB fits nothing’ },
],
hints: [
‘The list is sorted largest-first. Loop and return the first model where ram_gb >= min_ram.’,
‘for name, min_ram in models:\n if ram_gb >= min_ram:\n return name\nreturn “none”‘,
],
solution: ‘def select_model(ram_gb):\n models = [\n (“deepseek-r1:8b”, 8),\n (“llama3.1:8b”, 7),\n (“qwen2.5:7b”, 7),\n (“mistral:7b”, 6),\n (“phi3:mini”, 4),\n ]\n for name, min_ram in models:\n if ram_gb >= min_ram:\n return name\n return “none”\n\nprint(select_model(16))\nprint(select_model(5))\nprint(select_model(2))’,
solutionExplanation: ‘The models are sorted by RAM requirement descending. The first match is always the biggest model that fits. If nothing fits, return “none”.’,
xpReward: 15,
}

Complete Code

Click to expand the full script (copy-paste and run)
python
# Complete code from: Run LLMs Locally with Ollama
# Requires: pip install openai requests
# Requires: Ollama running locally (ollama.com)
# Python 3.9+

import os
import time
import requests
from openai import OpenAI

# --- Section 1: Basic REST API Call ---
print("=== REST API Call ===")
response = requests.post(
    "http://localhost:11434/api/chat",
    json={
        "model": "llama3.1:8b",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Explain list comprehension in 2 sentences."}
        ],
        "stream": False
    }
)
result = response.json()
print(result["message"]["content"])

# --- Section 2: OpenAI-Compatible API ---
print("\n=== OpenAI-Compatible API ===")
client = OpenAI(
    base_url="http://localhost:11434/v1/",
    api_key="ollama"
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a function to check if a number is prime."}
    ],
    temperature=0.7,
    max_tokens=200
)
print(response.choices[0].message.content)

# --- Section 3: Streaming ---
print("\n=== Streaming ===")
stream = client.chat.completions.create(
    model="mistral:7b",
    messages=[{"role": "user", "content": "What are Newton's three laws?"}],
    stream=True
)
for chunk in stream:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)
print()

# --- Section 4: Model Comparison ---
print("\n=== Model Comparison ===")
models = ["phi3:mini", "llama3.1:8b", "mistral:7b",
          "qwen2.5:7b", "deepseek-r1:8b"]
prompt = "Write a Python function that reverses a string without slicing."

for model_name in models:
    start = time.time()
    resp = client.chat.completions.create(
        model=model_name,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.3,
        max_tokens=150
    )
    elapsed = time.time() - start
    print(f"\nModel: {model_name} | Time: {elapsed:.1f}s")
    print(resp.choices[0].message.content[:200])

# --- Section 5: Smart LLM Client ---
print("\n=== Smart Client ===")

class SmartLLMClient:
    def __init__(self, local_model="llama3.1:8b",
                 cloud_model="gpt-4o-mini", min_response_length=20):
        self.local_client = OpenAI(
            base_url="http://localhost:11434/v1/", api_key="ollama")
        self.cloud_client = OpenAI(
            api_key=os.getenv("OPENAI_API_KEY", "your-key-here"))
        self.local_model = local_model
        self.cloud_model = cloud_model
        self.min_response_length = min_response_length

    def generate(self, prompt, system_msg="You are a helpful assistant."):
        messages = [
            {"role": "system", "content": system_msg},
            {"role": "user", "content": prompt}
        ]
        try:
            start = time.time()
            resp = self.local_client.chat.completions.create(
                model=self.local_model, messages=messages,
                temperature=0.7, max_tokens=500)
            content = resp.choices[0].message.content
            elapsed = time.time() - start
            if len(content.strip()) >= self.min_response_length:
                return {"content": content, "source": "local",
                        "model": self.local_model, "time": round(elapsed, 2)}
            print("Local response too short. Falling back.")
        except Exception as e:
            print(f"Local failed: {e}. Falling back to cloud.")

        start = time.time()
        resp = self.cloud_client.chat.completions.create(
            model=self.cloud_model, messages=messages,
            temperature=0.7, max_tokens=500)
        content = resp.choices[0].message.content
        elapsed = time.time() - start
        return {"content": content, "source": "cloud",
                "model": self.cloud_model, "time": round(elapsed, 2)}

llm = SmartLLMClient()
result = llm.generate("Explain list vs tuple in Python.")
print(f"Source: {result['source']} ({result['model']})")
print(f"Time: {result['time']}s")
print(result['content'])

print("\nScript completed successfully.")

Summary

You can now run LLMs locally with Ollama. Here’s what you covered:

  • Ollama pulls and serves open-source LLMs through a local REST API
  • Q4/Q8 shrinks models by up to 8x with very little quality loss. GGUF is the file format.
  • The OpenAI-style API lets you switch local and cloud by changing one URL
  • Modelfiles let you set system prompts and settings for steady results
  • Local LLMs are best for privacy, offline use, and free bulk work
  • Cloud LLMs win for hard reasoning, long context, and many users
  • The local-first fallback gives you the best of both worlds

Practice Exercise:

Build a BatchProcessor class that takes a list of prompts and processes them through the SmartLLMClient. It should process each prompt, track local vs cloud counts, and print a summary with total prompts, local count, cloud count, and average response time.

Click to see the solution
python
class BatchProcessor:
    def __init__(self, client):
        self.client = client
        self.results = []

    def process(self, prompts):
        for prompt in prompts:
            result = self.client.generate(prompt)
            self.results.append(result)
            print(f"[{result['source']}] {prompt[:50]}...")

        local = sum(1 for r in self.results if r["source"] == "local")
        cloud = sum(1 for r in self.results if r["source"] == "cloud")
        avg_time = sum(r["time"] for r in self.results) / len(self.results)

        print(f"\n--- Summary ---")
        print(f"Total: {len(self.results)}")
        print(f"Local: {local} | Cloud: {cloud}")
        print(f"Avg time: {avg_time:.2f}s")

# Usage
llm = SmartLLMClient()
processor = BatchProcessor(llm)
processor.process([
    "What is a decorator in Python?",
    "Explain the GIL in one paragraph.",
    "What does __init__ do?",
])

Frequently Asked Questions

Can I run Ollama without a GPU?

Yes. Ollama runs on CPU by default. A GPU helps, but isn’t needed. With 16 GB RAM and a modern CPU, 7-8B models reply in 5-15 seconds. That’s fast enough for dev work and daily use.

How do I use Ollama with LangChain?

LangChain has a ChatOllama class. Install it with pip install langchain-ollama. Since Ollama works like OpenAI, you can also use ChatOpenAI(base_url="http://localhost:11434/v1/", api_key="ollama") — no extra package needed.

Can I fine-tune models through Ollama?

No. Ollama only serves ready-made models. For fine-tuning, use Unsloth, Axolotl, or Hugging Face PEFT. After you fine-tune, save to GGUF and load it into Ollama with a Modelfile.

How much disk space do multiple models need?

Each model takes 2-8 GB. Five 7-8B models need about 20-25 GB total. Ollama shares layers between model versions, so pulling variants uses less space than you’d think.

Is Ollama suitable for production APIs?

For one user or a small team, yes. Put it behind a proxy (nginx, Caddy) for internal use. For high-traffic production, look at vLLM or TGI instead. They handle batching and GPU sharing.

References

  1. Ollama official documentation. Link
  2. Ollama GitHub repository. Link
  3. Ollama model library. Link
  4. Ollama blog — OpenAI compatibility. Link
  5. llama.cpp — GGUF format specification. Link
  6. Meta AI — Llama 3.1 model card. Link
  7. Mistral AI — Mistral 7B technical report. Link
  8. Microsoft Research — Phi-3 technical report. arXiv:2404.14219 (2024).
  9. DeepSeek AI — DeepSeek-R1 technical report. Link
  10. OpenAI Python library documentation. Link
Free Course
Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Related Course
Master Gen AI — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Free Callback - Limited Slots
Not Sure Which Course to Start With?
Talk to our AI Counsellors and Practitioners. We'll help you clear all your questions for your background and goals, bridging the gap between your current skills and a career in AI.
10-digit mobile number
📞
Thank You!
We'll Call You Soon!
Our learning advisor will reach out within 24 hours.
(Check your inbox too — we've sent a confirmation)
⚡ Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

🐍
Core Python — from first line to expert level
📈
NumPy & Pandas — the #1 libraries every DS job needs
🗃️
SQL Levels I–III — basics to Window Functions
📄
Real industry data — Jupyter notebooks included
R A M S K
57,000+ students
★★★★★ Rated 4.9/5
⚡ Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  ★★★★★ 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
🔒 100% free ☕ No spam, ever ✓ Instant access
🚀
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course →
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science