Hugging Face Inference API Tutorial in Python

Master Hugging Face inference in 20 minutes. Run LLMs locally with Pipeline API or serverless via HTTP — with Python examples you can copy and run.

Written by Selva Prabhakaran | 19 min read

Run LLMs locally with two lines of code, or call them over HTTP without any GPU — your choice.

This post has interactive code — click ‘Run’ or press Ctrl+Enter on any code block to execute it directly in your browser. The first run may take a few seconds to initialize.

You want to run an LLM. Maybe to generate text. Maybe to summarize a long document. You open the Hugging Face Model Hub and see 800,000+ models staring back at you. Which one do you pick? How do you run it? Do you need a GPU?

This article answers all of that. You’ll learn two ways to run models: the Pipeline API (local) and the Inference API (over HTTP). By the end, you’ll know when to use each one.

What Is Hugging Face Inference?

Hugging Face is the GitHub of machine learning. It hosts over 800,000 pre-trained models that you can download and use for free. Here’s a quick taste of what that looks like in code:

python

from transformers import pipeline

generator = pipeline("text-generation", model="gpt2")
print(generator("Machine learning is", max_new_tokens=20))

Two lines to load a model. One line to generate text. That’s the Pipeline API in action.

The platform has three parts you care about:

Model Hub — where models live. You browse it at huggingface.co/models.
Transformers library — Python tools to load and run models on your machine.
Inference API — call models over HTTP. No downloads, no GPU, no setup.

Key Insight: Hugging Face gives you two paths to run any model: download it and run locally (Pipeline API), or call it over the internet (Inference API). Everything else is details.

How to Find Models on the Hugging Face Hub

Before you write any code, you need to find the right model. The Hub can feel overwhelming at first. Here’s how to cut through the noise.

Go to huggingface.co/models. Use the sidebar filters. Start with task — click “Text Generation” and the list narrows to models built for that job.

Each model has a model card. Think of it as a README for the model. Here’s what to check:

Section	What It Tells You	Why You Care
Description	What the model does	Does it fit your task?
Parameters	1B, 7B, 13B, 70B	Bigger = smarter but needs more RAM
License	MIT, Apache 2.0, etc.	Can you use it in production?
Usage example	Code snippet	Copy-paste starting point
Downloads	Popularity count	Popular = better community support

Tip: Sort by “Most downloads” when starting out. Models like `meta-llama/Llama-3.1-8B-Instruct` and `google/gemma-2-2b-it` have large communities. More users means more help when you get stuck.

Understanding Model Sizes

Model size tells you what hardware you need. Here’s the rule of thumb.

One billion parameters needs about 2 GB of RAM in float16. A 7B model needs ~14 GB. A 13B model needs ~26 GB.

Got a laptop with 16 GB RAM? You can run models up to 7B. No GPU? Models under 3B run on CPU. They’re slow, but they work.

For anything bigger, skip the download. Use the Inference API instead.

Quick check: A model has 3 billion parameters. Roughly how much RAM does it need in float16? (Answer: about 6 GB.)

Setting Up Your Environment

You need Python 3.9+ and a few packages.

bash

pip install transformers huggingface_hub torch

The transformers and torch packages are for local pipeline use. If you only want the Inference API, huggingface_hub alone is enough.

You also need a Hugging Face token. Go to huggingface.co/settings/tokens. Create a token with “Read” access.

import micropip
await micropip.install('requests')

import os
from transformers import pipeline
from huggingface_hub import InferenceClient

os.environ["HF_TOKEN"] = "hf_your_token_here"

Prerequisites

Python version: 3.9+
Required libraries: transformers (4.40+), huggingface_hub (0.23+), torch (2.0+)
Install: pip install transformers huggingface_hub torch
API token: Free account + token (create here)
Time to complete: 20-25 minutes
Local inference hardware: 8+ GB RAM (CPU) or GPU with 6+ GB VRAM

Hugging Face Pipeline API — Running Models Locally

Runs in browser? No. This section needs local Python with transformers and torch.

The Pipeline API is the simplest way to run a model on your machine. It needs two things: a task and a model. It downloads the model on first use, caches it, and handles all processing.

Here’s text generation with google/gemma-2-2b-it — a 2B instruction-tuned model that runs on CPU.

python

from transformers import pipeline

generator = pipeline(
    task="text-generation",
    model="google/gemma-2-2b-it",
    device="cpu"  # Use "cuda" if you have a GPU
)

result = generator(
    "Explain what a neural network is in two sentences.",
    max_new_tokens=100,
    temperature=0.7
)

print(result[0]["generated_text"])

Your output will differ from mine — text generation is random by default. But you’ll see a clear explanation of neural networks.

What do these parameters do?

task — tells the pipeline what kind of work to do
model — which model to load from the Hub
device — CPU or GPU
max_new_tokens — caps how long the response can be
temperature — higher = more creative, lower = more predictable

Warning: First run downloads the model. For `gemma-2-2b-it`, that’s ~5 GB. After that, it loads from cache in seconds.

Summarization with Pipeline

The Pipeline API handles dozens of tasks. You just change the task name and the model. Let me show you summarization.

We’ll use facebook/bart-large-cnn. This model takes long text and returns a short summary. The max_length and min_length parameters control how long the summary can be.

python

summarizer = pipeline(
    task="summarization",
    model="facebook/bart-large-cnn"
)

long_text = """
Hugging Face is a company that develops tools for building
machine learning applications. The company is most notable
for its Transformers library, which provides thousands of
pretrained models. Founded in 2016, Hugging Face has grown
into one of the most important AI platforms, hosting over
800,000 models.
"""

summary = summarizer(long_text, max_length=50, min_length=20)
print(summary[0]["summary_text"])

Result:

python

Hugging Face develops tools for building machine learning applications. The company is most notable for its Transformers library. It hosts over 800,000 models.

See the pattern? Same pipeline() call. Different task. Different model. Everything else is the same.

Supported Pipeline Tasks

Here are the most useful tasks for LLM work:

Task	What It Does	Example Model
`text-generation`	Generate from a prompt	`google/gemma-2-2b-it`
`summarization`	Condense long text	`facebook/bart-large-cnn`
`text-classification`	Sentiment, topic	`distilbert-base-uncased-finetuned-sst-2-english`
`question-answering`	Answer from context	`deepset/roberta-base-squad2`
`translation`	Translate text	`Helsinki-NLP/opus-mt-en-fr`

Pick a task. Pick a model. Call pipeline(). That’s the whole workflow.

Key Insight: The Pipeline API hides tokenization, model loading, and decoding. You focus on WHAT you want — not HOW it works under the hood.

Hugging Face Inference API — Calling Models Over HTTP

Runs in browser? Yes. HTTP calls in this section work in Pyodide.

What if you don’t want to download anything? Maybe you have no GPU. Maybe you’re building a web app. Maybe you just want fast results.

The Inference API solves this. Send an HTTP request to Hugging Face’s servers. They run the model. You get the result. No downloads. No torch.

Using InferenceClient

The huggingface_hub library gives you InferenceClient. It wraps the HTTP API and handles auth, retries, and errors for you.

Here’s the same text generation task. Same model. But this time it runs on HF’s servers, not yours.

from huggingface_hub import InferenceClient

client = InferenceClient(token="hf_your_token_here")

response = client.text_generation(
    prompt="Explain neural networks in two sentences.",
    model="google/gemma-2-2b-it",
    max_new_tokens=100,
    temperature=0.7
)

print(response)

Same model, same prompt — but nothing was downloaded. The GPU lives on Hugging Face’s side.

Chat Completions (OpenAI-Compatible)

Used OpenAI’s API before? The chat_completion method uses the exact same format. You can switch from OpenAI to Hugging Face by changing two lines: the client and the model name.

from huggingface_hub import InferenceClient

client = InferenceClient(token="hf_your_token_here")

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Write a function to reverse a string."}
]

response = client.chat_completion(
    messages=messages,
    model="meta-llama/Llama-3.1-8B-Instruct",
    max_tokens=200
)

print(response.choices[0].message.content)

The response looks just like OpenAI’s: choices[0].message.content. Switching between the two is painless.

Streaming Responses

When generating long text, waiting for the full response feels slow. Streaming sends tokens one at a time — just like ChatGPT.

The stream=True parameter turns on streaming. You loop over tokens and print each one as it arrives.

from huggingface_hub import InferenceClient

client = InferenceClient(token="hf_your_token_here")

stream = client.text_generation(
    prompt="Write a haiku about Python.",
    model="google/gemma-2-2b-it",
    max_new_tokens=50,
    stream=True
)

for token in stream:
    print(token, end="", flush=True)
print()

The flush=True pushes each token to screen right away. Without it, Python buffers output and you lose the streaming effect.

Raw HTTP Requests (No Library Needed)

Runs in browser? Yes. Swap requests for pyodide.http.pyfetch.

You don’t need any HF library at all. Plain HTTP works. The endpoint follows this pattern: https://api-inference.huggingface.co/models/{model_id}.

import requests

API_URL = "https://api-inference.huggingface.co/models/google/gemma-2-2b-it"
headers = {"Authorization": "Bearer hf_your_token_here"}

payload = {
    "inputs": "Explain neural networks in two sentences.",
    "parameters": {"max_new_tokens": 100, "temperature": 0.7}
}

response = requests.post(API_URL, headers=headers, json=payload)
print(response.json()[0]["generated_text"])

One URL. One POST. One JSON response. I prefer InferenceClient for production because it handles retries. But for quick tests or browser code, raw HTTP is fine.

Quick check: What HTTP method do you use to call the HF Inference API? (Answer: POST, not GET. You’re sending data, not just fetching.)

{
  type: 'exercise',
  id: 'inference-api-call',
  title: 'Exercise 1: Summarize Text with the Inference API',
  difficulty: 'beginner',
  exerciseType: 'write',
  instructions: 'Use the Hugging Face Inference API (via `requests`) to summarize the given text using the `facebook/bart-large-cnn` model. Print only the summary text from the response.',
  starterCode: 'import requests\n\nAPI_URL = "https://api-inference.huggingface.co/models/facebook/bart-large-cnn"\nheaders = {"Authorization": "Bearer hf_your_token_here"}\n\ntext = "Hugging Face is a company that develops tools for building machine learning applications. The company is most notable for its Transformers library, which provides thousands of pretrained models. Founded in 2016, Hugging Face has grown into one of the most important platforms in the AI ecosystem."\n\n# Send a POST request with the text as input\n# Print the summary text from the response\n',
  testCases: [
    { id: 'tc1', input: 'print("DONE")', expectedOutput: 'DONE', description: 'Code runs without errors' },
    { id: 'tc2', input: 'print(type(result))', expectedOutput: "<class 'list'>", description: 'Response should be a list' },
  ],
  hints: [
    'The payload format is: {"inputs": text, "parameters": {"max_length": 50}}',
    'Full answer: response = requests.post(API_URL, headers=headers, json={"inputs": text, "parameters": {"max_length": 50}})\nresult = response.json()\nprint(result[0]["summary_text"])',
  ],
  solution: 'import requests\n\nAPI_URL = "https://api-inference.huggingface.co/models/facebook/bart-large-cnn"\nheaders = {"Authorization": "Bearer hf_your_token_here"}\n\ntext = "Hugging Face is a company that develops tools for building machine learning applications. The company is most notable for its Transformers library, which provides thousands of pretrained models. Founded in 2016, Hugging Face has grown into one of the most important platforms in the AI ecosystem."\n\nresponse = requests.post(API_URL, headers=headers, json={"inputs": text, "parameters": {"max_length": 50}})\nresult = response.json()\nprint(result[0]["summary_text"])',
  solutionExplanation: 'We POST the text to the BART summarization endpoint. The API returns a JSON list. The first element has the summary under the key "summary_text".',
  xpReward: 15,
}

Batch Inference — Processing Multiple Inputs

Both the Pipeline API and InferenceClient take lists of inputs. This is faster than one prompt at a time. The model runs them all in one go.

With the local Pipeline, just pass a list of strings. The model returns a list of results in the same order.

prompts = [
    "Summarize: AI is transforming healthcare.",
    "Summarize: Python is popular for data science.",
    "Summarize: Cloud computing reduces costs."
]

results = generator(prompts, max_new_tokens=30, batch_size=3)

for i, result in enumerate(results):
    print(f"Prompt {i+1}: {result[0]['generated_text'][:80]}...")

The batch_size sets how many inputs run at once. Bigger batches are faster but use more memory. Start with 4-8 and tweak from there.

Local Pipeline vs. Inference API — When to Use Which

Should you run models locally or call the API? It depends.

Factor	Local Pipeline	Inference API
Setup	Download model (minutes)	API key (seconds)
Hardware	CPU + RAM or GPU	None — HF servers
Cost	Free (your power bill)	Free tier, then pay-per-use
Latency	Fast after loading	+200-500ms network overhead
Privacy	Data stays local	Data goes to HF servers
Offline	Yes (after first download)	No — needs internet
Best for	Production, privacy, batches	Prototyping, web apps

My rule of thumb: start with the Inference API to experiment. Try models fast. When you find the one you like, switch to local Pipeline for production.

Warning: The free Inference API has rate limits. For real workloads, get the Pro plan ($9/month) or use dedicated Inference Endpoints. Don’t build production on the free tier.

Cost Comparison

Real numbers for 10,000 calls per day with a 7B model:

Local GPU (A10G on AWS): ~$1/hour. Each call takes ~2 seconds. 10K calls = ~5.5 hours. That’s ~$5.50/day.

Inference API (Pro): $9/month + usage charges. Roughly $30-60/month for 10K calls/day.

Free tier: Won’t work. Rate limits stop you well before 10K calls.

For low-volume experiments, use the free tier. For production, do the math for your workload.

Handling Errors

You’ll hit these three errors when working with Hugging Face. Here’s how to handle each one.

Error 1: 503 — Model Loading (Cold Start)

The Inference API uses cold starts. If a model hasn’t been called recently, it loads first. You get a 503 while it warms up.

The fix: retry with a delay. The response includes an estimated_time field.

python

import time
import requests

def call_with_retry(url, headers, payload, max_retries=5):
    for attempt in range(max_retries):
        response = requests.post(url, headers=headers, json=payload)
        if response.status_code == 503:
            wait = response.json().get("estimated_time", 30)
            print(f"Loading... wait {wait:.0f}s")
            time.sleep(min(wait, 60))
            continue
        return response.json()
    raise TimeoutError("Model failed to load")

result = call_with_retry(API_URL, headers, payload)
print(result)

Error 2: CUDA Out of Memory (Local)

Your GPU can’t hold the model. Three options:

python

# Option 1: Use CPU (slow but works)
gen = pipeline("text-generation", model="google/gemma-2-2b-it", device="cpu")

# Option 2: 8-bit quantization (half the memory)
# pip install bitsandbytes
gen = pipeline(
    "text-generation",
    model="google/gemma-2-2b-it",
    model_kwargs={"load_in_8bit": True}
)

Quantization uses smaller numbers to shrink the model. An 8-bit model uses half the memory of 16-bit. The quality trade-off is tiny for most tasks.

Error 3: 429 — Rate Limited

You’ve sent too many requests. Add a delay between calls.

import time

for prompt in prompts:
    result = client.text_generation(
        prompt=prompt, model="google/gemma-2-2b-it"
    )
    print(result)
    time.sleep(1)  # stay under rate limits

{
  type: 'exercise',
  id: 'retry-wrapper',
  title: 'Exercise 2: Build a Robust API Caller',
  difficulty: 'intermediate',
  exerciseType: 'write',
  instructions: 'Complete the `safe_inference` function. It should call the HF Inference API and handle 503 errors by retrying up to 3 times with exponential backoff (wait 2^attempt seconds). Return the text on success, or "FAILED" after all retries.',
  starterCode: 'import requests\nimport time\n\ndef safe_inference(api_url, headers, payload, max_retries=3):\n    """Call HF API with retry for 503 errors."""\n    for attempt in range(max_retries):\n        response = requests.post(api_url, headers=headers, json=payload)\n        # TODO: Check status code\n        # If 503: wait 2^attempt seconds, then retry\n        # If 200: return the generated text\n        pass\n    return "FAILED"\n\n# Test\nurl = "https://api-inference.huggingface.co/models/gpt2"\nhdrs = {"Authorization": "Bearer hf_your_token_here"}\ndata = {"inputs": "Hello world"}\nprint(safe_inference(url, hdrs, data))',
  testCases: [
    { id: 'tc1', input: 'print(type(safe_inference.__code__.co_varnames))', expectedOutput: "<class 'tuple'>", description: 'Function defined correctly' },
    { id: 'tc2', input: 'print("DONE")', expectedOutput: 'DONE', description: 'Code runs without errors' },
  ],
  hints: [
    'Check response.status_code == 503 for loading, 200 for success. Use time.sleep(2 ** attempt) for backoff.',
    'if response.status_code == 200: return response.json()[0]["generated_text"]\nelif response.status_code == 503: time.sleep(2 ** attempt)',
  ],
  solution: 'import requests\nimport time\n\ndef safe_inference(api_url, headers, payload, max_retries=3):\n    """Call HF API with retry for 503 errors."""\n    for attempt in range(max_retries):\n        response = requests.post(api_url, headers=headers, json=payload)\n        if response.status_code == 200:\n            return response.json()[0]["generated_text"]\n        elif response.status_code == 503:\n            time.sleep(2 ** attempt)\n    return "FAILED"\n\nurl = "https://api-inference.huggingface.co/models/gpt2"\nhdrs = {"Authorization": "Bearer hf_your_token_here"}\ndata = {"inputs": "Hello world"}\nprint(safe_inference(url, hdrs, data))',
  solutionExplanation: 'On status 200, we return the text. On 503, we wait with exponential backoff (1s, 2s, 4s). After 3 failed retries, we return "FAILED".',
  xpReward: 20,
}

When NOT to Use Hugging Face Inference

Hugging Face isn’t always the right tool. Three cases where you should look elsewhere.

Ultra-low latency. Need sub-100ms responses? Self-host with vLLM or TensorRT-LLM. Network overhead alone adds 100-500ms per API call.

Closed-source models. GPT-4, Claude, and Gemini Pro aren’t on the Hub. For those, use their native APIs. Hugging Face is the open-source ecosystem.

Massive batch jobs. Processing millions of documents? Download the model. Run it on a GPU cluster. API pricing adds up fast at that scale.

Note: Hugging Face also offers Inference Endpoints. These are private GPU boxes for your model. Prices start at ~$0.06/hour (CPU) and ~$0.60/hour (GPU). They sit between the free API and full self-hosting.

Common Mistakes and How to Fix Them

Mistake 1: No `max_new_tokens`

Without this, some models generate until they hit context limits. That’s 2048, 4096, or even 8192 tokens. Long waits. High costs.

python

# Wrong — runs until max context
result = generator("Tell me about Python.")

# Right — cap the output
result = generator("Tell me about Python.", max_new_tokens=100)

Mistake 2: Wrong task name format

Pipeline tasks use hyphens, not underscores. This catches many people.

python

# Wrong — underscore causes an error
generator = pipeline("text_generation", model="gpt2")

# Right — use hyphens
generator = pipeline("text-generation", model="gpt2")

Mistake 3: Input too long for API

The free API has input limits. A 10,000-word doc won’t work. Split it into chunks.

def chunk_text(text, max_words=500):
    words = text.split()
    return [" ".join(words[i:i+max_words])
            for i in range(0, len(words), max_words)]

chunks = chunk_text(long_document)
summaries = [summarizer(chunk)[0]["summary_text"]
             for chunk in chunks]
full_summary = " ".join(summaries)

Summary

Hugging Face gives you two paths to LLM inference. The Pipeline API runs models on your machine — great for privacy, offline use, and production. The Inference API calls models over HTTP — perfect for prototyping and no-GPU setups.

Start with the API to experiment. Switch to local Pipeline for production. That’s the typical workflow.

Practice Exercise

Challenge: Build a model comparison tool

Write a function that takes a prompt and a list of model names. For each model, call the Inference API and print the response with timing.

import time
import requests

def compare_models(prompt, models, token):
    headers = {"Authorization": f"Bearer {token}"}

    for model in models:
        url = f"https://api-inference.huggingface.co/models/{model}"
        payload = {"inputs": prompt, "parameters": {"max_new_tokens": 50}}

        start = time.time()
        resp = requests.post(url, headers=headers, json=payload)
        elapsed = time.time() - start

        if resp.status_code == 200:
            text = resp.json()[0]["generated_text"]
            print(f"\n--- {model} ({elapsed:.2f}s) ---")
            print(text[:200])
        else:
            print(f"\n--- {model}: Error {resp.status_code} ---")

compare_models(
    prompt="The best way to learn ML is",
    models=["gpt2", "google/gemma-2-2b-it"],
    token="hf_your_token_here"
)

Complete Code

Click to expand the full script (copy-paste and run)

# Complete code from: Hugging Face for LLM Inference
# Requires: pip install transformers huggingface_hub torch requests
# Python 3.9+

import os
import time
import requests
from transformers import pipeline
from huggingface_hub import InferenceClient

# --- Setup ---
TOKEN = "hf_your_token_here"
os.environ["HF_TOKEN"] = TOKEN

# --- Local Pipeline: Text Generation ---
generator = pipeline(
    task="text-generation",
    model="google/gemma-2-2b-it",
    device="cpu"
)
result = generator(
    "Explain neural networks in two sentences.",
    max_new_tokens=100, temperature=0.7
)
print("=== Local Text Generation ===")
print(result[0]["generated_text"])

# --- Local Pipeline: Summarization ---
summarizer = pipeline(task="summarization", model="facebook/bart-large-cnn")
long_text = """Hugging Face develops tools for building ML applications.
The company is notable for its Transformers library with thousands of
pretrained models. Founded in 2016, it hosts over 800,000 models."""

summary = summarizer(long_text, max_length=50, min_length=20)
print("\n=== Summarization ===")
print(summary[0]["summary_text"])

# --- Inference API: InferenceClient ---
client = InferenceClient(token=TOKEN)
response = client.text_generation(
    prompt="Explain neural networks in two sentences.",
    model="google/gemma-2-2b-it",
    max_new_tokens=100, temperature=0.7
)
print("\n=== Inference API ===")
print(response)

# --- Chat Completion ---
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Write a function to reverse a string."}
]
chat = client.chat_completion(
    messages=messages,
    model="meta-llama/Llama-3.1-8B-Instruct",
    max_tokens=200
)
print("\n=== Chat Completion ===")
print(chat.choices[0].message.content)

# --- Raw HTTP ---
API_URL = "https://api-inference.huggingface.co/models/google/gemma-2-2b-it"
headers = {"Authorization": f"Bearer {TOKEN}"}
payload = {
    "inputs": "Explain neural networks in two sentences.",
    "parameters": {"max_new_tokens": 100, "temperature": 0.7}
}
http_resp = requests.post(API_URL, headers=headers, json=payload)
print("\n=== Raw HTTP ===")
print(http_resp.json()[0]["generated_text"])

# --- Streaming ---
print("\n=== Streaming ===")
stream = client.text_generation(
    prompt="Write a haiku about Python.",
    model="google/gemma-2-2b-it",
    max_new_tokens=50, stream=True
)
for token in stream:
    print(token, end="", flush=True)
print()

print("\nScript completed successfully.")

Frequently Asked Questions

Is the Hugging Face Inference API free?

There’s a free tier with monthly credits. For heavier use, the Pro plan costs $9/month with extra credits and pay-as-you-go pricing. Dedicated Endpoints start at $0.06/hour.

Can I run Hugging Face models without an API key?

For local Pipeline, many models work without auth. Just call pipeline("text-generation", model="gpt2"). But gated models (like Llama) need you to accept their license and use a token. The Inference API always needs a key.

How do I choose between Pipeline and InferenceClient?

Use pipeline() to run models on your hardware. Use InferenceClient for serverless inference on HF’s GPUs. Same model, same results — just different hardware.

What’s the difference between the free API and Inference Endpoints?

The free API runs on shared GPUs with other users. It has rate limits and cold starts. Inference Endpoints give you a dedicated GPU — always warm, no rate limits. It’s shared hosting vs. a private server.

References

Hugging Face — Transformers Pipeline Tutorial. Link
Hugging Face — Serverless Inference API Docs. Link
Hugging Face — InferenceClient Reference. Link
Hugging Face — Model Cards Documentation. Link
Hugging Face — Inference Providers. Link
Hugging Face — Text Generation with LLMs. Link
Hugging Face — Pricing and Billing. Link
Wolf, T. et al. — “Transformers: State-of-the-Art NLP.” EMNLP 2020. Link

Free Course

Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course →

Trusted by 50,000+ learners

Written by

Selva Prabhakaran →

Related Course

Master Gen AI — Hands-On

Join 5,000+ students at edu.machinelearningplus.com

Explore Course

Hugging Face Inference API Tutorial in Python

What Is Hugging Face Inference?

How to Find Models on the Hugging Face Hub

Understanding Model Sizes

Setting Up Your Environment

Prerequisites

Hugging Face Pipeline API — Running Models Locally

Summarization with Pipeline

Supported Pipeline Tasks

Hugging Face Inference API — Calling Models Over HTTP

Using InferenceClient

Chat Completions (OpenAI-Compatible)

Streaming Responses

Raw HTTP Requests (No Library Needed)

Batch Inference — Processing Multiple Inputs

Local Pipeline vs. Inference API — When to Use Which

Cost Comparison

Handling Errors

Error 1: 503 — Model Loading (Cold Start)

Error 2: CUDA Out of Memory (Local)

Error 3: 429 — Rate Limited

When NOT to Use Hugging Face Inference

Common Mistakes and How to Fix Them

Mistake 1: No `max_new_tokens`

Mistake 2: Wrong task name format

Mistake 3: Input too long for API

Summary

Practice Exercise

Complete Code

Frequently Asked Questions

Is the Hugging Face Inference API free?

Can I run Hugging Face models without an API key?

How do I choose between Pipeline and InferenceClient?

What’s the difference between the free API and Inference Endpoints?

References

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

What Is Hugging Face Inference?

How to Find Models on the Hugging Face Hub

Understanding Model Sizes

Setting Up Your Environment

Prerequisites

Hugging Face Pipeline API — Running Models Locally

Summarization with Pipeline

Supported Pipeline Tasks

Hugging Face Inference API — Calling Models Over HTTP

Using InferenceClient

Chat Completions (OpenAI-Compatible)

Streaming Responses

Raw HTTP Requests (No Library Needed)

Batch Inference — Processing Multiple Inputs

Local Pipeline vs. Inference API — When to Use Which

Cost Comparison

Handling Errors

Error 1: 503 — Model Loading (Cold Start)

Error 2: CUDA Out of Memory (Local)

Error 3: 429 — Rate Limited

When NOT to Use Hugging Face Inference

Common Mistakes and How to Fix Them

Mistake 1: No max_new_tokens

Mistake 2: Wrong task name format

Mistake 3: Input too long for API

Summary

Practice Exercise

Complete Code

Frequently Asked Questions

Is the Hugging Face Inference API free?

Can I run Hugging Face models without an API key?

How do I choose between Pipeline and InferenceClient?

What’s the difference between the free API and Inference Endpoints?

References

Related Articles

LLM Temperature, Top-P, and Top-K Explained — With Python Simulations

OpenAI API Python Tutorial — Chat Completions, Streaming & Error Handling

Zero-Shot vs Few-Shot Prompting: Complete Guide

Get Your Free AI/ML Engineer Roadmap

Want help choosing the right AI/ML path?

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Mistake 1: No `max_new_tokens`