machine learning +
Zero-Shot vs Few-Shot Prompting: Complete Guide
Hugging Face Inference API Tutorial in Python
Master Hugging Face inference in 20 minutes. Run LLMs locally with Pipeline API or serverless via HTTP — with Python examples you can copy and run.
Run LLMs locally with two lines of code, or call them over HTTP without any GPU — your choice.
This post has interactive code — click ‘Run’ or press Ctrl+Enter on any code block to execute it directly in your browser. The first run may take a few seconds to initialize.
You want to run an LLM. Maybe to generate text. Maybe to summarize a long document. You open the Hugging Face Model Hub and see 800,000+ models staring back at you. Which one do you pick? How do you run it? Do you need a GPU?
This article answers all of that. You’ll learn two ways to run models: the Pipeline API (local) and the Inference API (over HTTP). By the end, you’ll know when to use each one.
What Is Hugging Face Inference?
Hugging Face is the GitHub of machine learning. It hosts over 800,000 pre-trained models that you can download and use for free. Here’s a quick taste of what that looks like in code:
python
from transformers import pipeline
generator = pipeline("text-generation", model="gpt2")
print(generator("Machine learning is", max_new_tokens=20))
Two lines to load a model. One line to generate text. That’s the Pipeline API in action.
The platform has three parts you care about:
- Model Hub — where models live. You browse it at huggingface.co/models.
- Transformers library — Python tools to load and run models on your machine.
- Inference API — call models over HTTP. No downloads, no GPU, no setup.
Key Insight: Hugging Face gives you two paths to run any model: download it and run locally (Pipeline API), or call it over the internet (Inference API). Everything else is details.
How to Find Models on the Hugging Face Hub
Before you write any code, you need to find the right model. The Hub can feel overwhelming at first. Here’s how to cut through the noise.
Go to huggingface.co/models. Use the sidebar filters. Start with task — click “Text Generation” and the list narrows to models built for that job.
Each model has a model card. Think of it as a README for the model. Here’s what to check:
| Section | What It Tells You | Why You Care |
|---|---|---|
| Description | What the model does | Does it fit your task? |
| Parameters | 1B, 7B, 13B, 70B | Bigger = smarter but needs more RAM |
| License | MIT, Apache 2.0, etc. | Can you use it in production? |
| Usage example | Code snippet | Copy-paste starting point |
| Downloads | Popularity count | Popular = better community support |
Tip: Sort by “Most downloads” when starting out. Models like `meta-llama/Llama-3.1-8B-Instruct` and `google/gemma-2-2b-it` have large communities. More users means more help when you get stuck.
Understanding Model Sizes
Model size tells you what hardware you need. Here’s the rule of thumb.
One billion parameters needs about 2 GB of RAM in float16. A 7B model needs ~14 GB. A 13B model needs ~26 GB.
Got a laptop with 16 GB RAM? You can run models up to 7B. No GPU? Models under 3B run on CPU. They’re slow, but they work.
For anything bigger, skip the download. Use the Inference API instead.
Quick check: A model has 3 billion parameters. Roughly how much RAM does it need in float16? (Answer: about 6 GB.)
Setting Up Your Environment
You need Python 3.9+ and a few packages.
bash
pip install transformers huggingface_hub torch
The transformers and torch packages are for local pipeline use. If you only want the Inference API, huggingface_hub alone is enough.
You also need a Hugging Face token. Go to huggingface.co/settings/tokens. Create a token with “Read” access.
import micropip
await micropip.install('requests')
import os
from transformers import pipeline
from huggingface_hub import InferenceClient
os.environ["HF_TOKEN"] = "hf_your_token_here"
Prerequisites
- Python version: 3.9+
- Required libraries: transformers (4.40+), huggingface_hub (0.23+), torch (2.0+)
- Install:
pip install transformers huggingface_hub torch - API token: Free account + token (create here)
- Time to complete: 20-25 minutes
- Local inference hardware: 8+ GB RAM (CPU) or GPU with 6+ GB VRAM
Hugging Face Pipeline API — Running Models Locally
Runs in browser? No. This section needs local Python with
transformersandtorch.
The Pipeline API is the simplest way to run a model on your machine. It needs two things: a task and a model. It downloads the model on first use, caches it, and handles all processing.
Here’s text generation with google/gemma-2-2b-it — a 2B instruction-tuned model that runs on CPU.
python
from transformers import pipeline
generator = pipeline(
task="text-generation",
model="google/gemma-2-2b-it",
device="cpu" # Use "cuda" if you have a GPU
)
result = generator(
"Explain what a neural network is in two sentences.",
max_new_tokens=100,
temperature=0.7
)
print(result[0]["generated_text"])
Your output will differ from mine — text generation is random by default. But you’ll see a clear explanation of neural networks.
What do these parameters do?
task— tells the pipeline what kind of work to domodel— which model to load from the Hubdevice— CPU or GPUmax_new_tokens— caps how long the response can betemperature— higher = more creative, lower = more predictable
Warning: First run downloads the model. For `gemma-2-2b-it`, that’s ~5 GB. After that, it loads from cache in seconds.
Summarization with Pipeline
The Pipeline API handles dozens of tasks. You just change the task name and the model. Let me show you summarization.
We’ll use facebook/bart-large-cnn. This model takes long text and returns a short summary. The max_length and min_length parameters control how long the summary can be.
python
summarizer = pipeline(
task="summarization",
model="facebook/bart-large-cnn"
)
long_text = """
Hugging Face is a company that develops tools for building
machine learning applications. The company is most notable
for its Transformers library, which provides thousands of
pretrained models. Founded in 2016, Hugging Face has grown
into one of the most important AI platforms, hosting over
800,000 models.
"""
summary = summarizer(long_text, max_length=50, min_length=20)
print(summary[0]["summary_text"])
Result:
python
Hugging Face develops tools for building machine learning applications. The company is most notable for its Transformers library. It hosts over 800,000 models.
See the pattern? Same pipeline() call. Different task. Different model. Everything else is the same.
Supported Pipeline Tasks
Here are the most useful tasks for LLM work:
| Task | What It Does | Example Model |
|---|---|---|
text-generation | Generate from a prompt | google/gemma-2-2b-it |
summarization | Condense long text | facebook/bart-large-cnn |
text-classification | Sentiment, topic | distilbert-base-uncased-finetuned-sst-2-english |
question-answering | Answer from context | deepset/roberta-base-squad2 |
translation | Translate text | Helsinki-NLP/opus-mt-en-fr |
Pick a task. Pick a model. Call pipeline(). That’s the whole workflow.
Key Insight: The Pipeline API hides tokenization, model loading, and decoding. You focus on WHAT you want — not HOW it works under the hood.
Hugging Face Inference API — Calling Models Over HTTP
Runs in browser? Yes. HTTP calls in this section work in Pyodide.
What if you don’t want to download anything? Maybe you have no GPU. Maybe you’re building a web app. Maybe you just want fast results.
The Inference API solves this. Send an HTTP request to Hugging Face’s servers. They run the model. You get the result. No downloads. No torch.
Using InferenceClient
The huggingface_hub library gives you InferenceClient. It wraps the HTTP API and handles auth, retries, and errors for you.
Here’s the same text generation task. Same model. But this time it runs on HF’s servers, not yours.
from huggingface_hub import InferenceClient
client = InferenceClient(token="hf_your_token_here")
response = client.text_generation(
prompt="Explain neural networks in two sentences.",
model="google/gemma-2-2b-it",
max_new_tokens=100,
temperature=0.7
)
print(response)
Same model, same prompt — but nothing was downloaded. The GPU lives on Hugging Face’s side.
Chat Completions (OpenAI-Compatible)
Used OpenAI’s API before? The chat_completion method uses the exact same format. You can switch from OpenAI to Hugging Face by changing two lines: the client and the model name.
from huggingface_hub import InferenceClient
client = InferenceClient(token="hf_your_token_here")
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a function to reverse a string."}
]
response = client.chat_completion(
messages=messages,
model="meta-llama/Llama-3.1-8B-Instruct",
max_tokens=200
)
print(response.choices[0].message.content)
The response looks just like OpenAI’s: choices[0].message.content. Switching between the two is painless.
Streaming Responses
When generating long text, waiting for the full response feels slow. Streaming sends tokens one at a time — just like ChatGPT.
The stream=True parameter turns on streaming. You loop over tokens and print each one as it arrives.
from huggingface_hub import InferenceClient
client = InferenceClient(token="hf_your_token_here")
stream = client.text_generation(
prompt="Write a haiku about Python.",
model="google/gemma-2-2b-it",
max_new_tokens=50,
stream=True
)
for token in stream:
print(token, end="", flush=True)
print()
The flush=True pushes each token to screen right away. Without it, Python buffers output and you lose the streaming effect.
Raw HTTP Requests (No Library Needed)
Runs in browser? Yes. Swap
requestsforpyodide.http.pyfetch.
You don’t need any HF library at all. Plain HTTP works. The endpoint follows this pattern: https://api-inference.huggingface.co/models/{model_id}.
import requests
API_URL = "https://api-inference.huggingface.co/models/google/gemma-2-2b-it"
headers = {"Authorization": "Bearer hf_your_token_here"}
payload = {
"inputs": "Explain neural networks in two sentences.",
"parameters": {"max_new_tokens": 100, "temperature": 0.7}
}
response = requests.post(API_URL, headers=headers, json=payload)
print(response.json()[0]["generated_text"])
One URL. One POST. One JSON response. I prefer InferenceClient for production because it handles retries. But for quick tests or browser code, raw HTTP is fine.
Quick check: What HTTP method do you use to call the HF Inference API? (Answer: POST, not GET. You’re sending data, not just fetching.)
{
type: 'exercise',
id: 'inference-api-call',
title: 'Exercise 1: Summarize Text with the Inference API',
difficulty: 'beginner',
exerciseType: 'write',
instructions: 'Use the Hugging Face Inference API (via `requests`) to summarize the given text using the `facebook/bart-large-cnn` model. Print only the summary text from the response.',
starterCode: 'import requests\n\nAPI_URL = "https://api-inference.huggingface.co/models/facebook/bart-large-cnn"\nheaders = {"Authorization": "Bearer hf_your_token_here"}\n\ntext = "Hugging Face is a company that develops tools for building machine learning applications. The company is most notable for its Transformers library, which provides thousands of pretrained models. Founded in 2016, Hugging Face has grown into one of the most important platforms in the AI ecosystem."\n\n# Send a POST request with the text as input\n# Print the summary text from the response\n',
testCases: [
{ id: 'tc1', input: 'print("DONE")', expectedOutput: 'DONE', description: 'Code runs without errors' },
{ id: 'tc2', input: 'print(type(result))', expectedOutput: "<class 'list'>", description: 'Response should be a list' },
],
hints: [
'The payload format is: {"inputs": text, "parameters": {"max_length": 50}}',
'Full answer: response = requests.post(API_URL, headers=headers, json={"inputs": text, "parameters": {"max_length": 50}})\nresult = response.json()\nprint(result[0]["summary_text"])',
],
solution: 'import requests\n\nAPI_URL = "https://api-inference.huggingface.co/models/facebook/bart-large-cnn"\nheaders = {"Authorization": "Bearer hf_your_token_here"}\n\ntext = "Hugging Face is a company that develops tools for building machine learning applications. The company is most notable for its Transformers library, which provides thousands of pretrained models. Founded in 2016, Hugging Face has grown into one of the most important platforms in the AI ecosystem."\n\nresponse = requests.post(API_URL, headers=headers, json={"inputs": text, "parameters": {"max_length": 50}})\nresult = response.json()\nprint(result[0]["summary_text"])',
solutionExplanation: 'We POST the text to the BART summarization endpoint. The API returns a JSON list. The first element has the summary under the key "summary_text".',
xpReward: 15,
}
Batch Inference — Processing Multiple Inputs
Both the Pipeline API and InferenceClient take lists of inputs. This is faster than one prompt at a time. The model runs them all in one go.
With the local Pipeline, just pass a list of strings. The model returns a list of results in the same order.
prompts = [
"Summarize: AI is transforming healthcare.",
"Summarize: Python is popular for data science.",
"Summarize: Cloud computing reduces costs."
]
results = generator(prompts, max_new_tokens=30, batch_size=3)
for i, result in enumerate(results):
print(f"Prompt {i+1}: {result[0]['generated_text'][:80]}...")
The batch_size sets how many inputs run at once. Bigger batches are faster but use more memory. Start with 4-8 and tweak from there.
Local Pipeline vs. Inference API — When to Use Which
Should you run models locally or call the API? It depends.
| Factor | Local Pipeline | Inference API |
|---|---|---|
| Setup | Download model (minutes) | API key (seconds) |
| Hardware | CPU + RAM or GPU | None — HF servers |
| Cost | Free (your power bill) | Free tier, then pay-per-use |
| Latency | Fast after loading | +200-500ms network overhead |
| Privacy | Data stays local | Data goes to HF servers |
| Offline | Yes (after first download) | No — needs internet |
| Best for | Production, privacy, batches | Prototyping, web apps |
My rule of thumb: start with the Inference API to experiment. Try models fast. When you find the one you like, switch to local Pipeline for production.
Warning: The free Inference API has rate limits. For real workloads, get the Pro plan ($9/month) or use dedicated Inference Endpoints. Don’t build production on the free tier.
Cost Comparison
Real numbers for 10,000 calls per day with a 7B model:
Local GPU (A10G on AWS): ~$1/hour. Each call takes ~2 seconds. 10K calls = ~5.5 hours. That’s ~$5.50/day.
Inference API (Pro): $9/month + usage charges. Roughly $30-60/month for 10K calls/day.
Free tier: Won’t work. Rate limits stop you well before 10K calls.
For low-volume experiments, use the free tier. For production, do the math for your workload.
Handling Errors
You’ll hit these three errors when working with Hugging Face. Here’s how to handle each one.
Error 1: 503 — Model Loading (Cold Start)
The Inference API uses cold starts. If a model hasn’t been called recently, it loads first. You get a 503 while it warms up.
The fix: retry with a delay. The response includes an estimated_time field.
python
import time
import requests
def call_with_retry(url, headers, payload, max_retries=5):
for attempt in range(max_retries):
response = requests.post(url, headers=headers, json=payload)
if response.status_code == 503:
wait = response.json().get("estimated_time", 30)
print(f"Loading... wait {wait:.0f}s")
time.sleep(min(wait, 60))
continue
return response.json()
raise TimeoutError("Model failed to load")
result = call_with_retry(API_URL, headers, payload)
print(result)
Error 2: CUDA Out of Memory (Local)
Your GPU can’t hold the model. Three options:
python
# Option 1: Use CPU (slow but works)
gen = pipeline("text-generation", model="google/gemma-2-2b-it", device="cpu")
# Option 2: 8-bit quantization (half the memory)
# pip install bitsandbytes
gen = pipeline(
"text-generation",
model="google/gemma-2-2b-it",
model_kwargs={"load_in_8bit": True}
)
Quantization uses smaller numbers to shrink the model. An 8-bit model uses half the memory of 16-bit. The quality trade-off is tiny for most tasks.
Error 3: 429 — Rate Limited
You’ve sent too many requests. Add a delay between calls.
import time
for prompt in prompts:
result = client.text_generation(
prompt=prompt, model="google/gemma-2-2b-it"
)
print(result)
time.sleep(1) # stay under rate limits
{
type: 'exercise',
id: 'retry-wrapper',
title: 'Exercise 2: Build a Robust API Caller',
difficulty: 'intermediate',
exerciseType: 'write',
instructions: 'Complete the `safe_inference` function. It should call the HF Inference API and handle 503 errors by retrying up to 3 times with exponential backoff (wait 2^attempt seconds). Return the text on success, or "FAILED" after all retries.',
starterCode: 'import requests\nimport time\n\ndef safe_inference(api_url, headers, payload, max_retries=3):\n """Call HF API with retry for 503 errors."""\n for attempt in range(max_retries):\n response = requests.post(api_url, headers=headers, json=payload)\n # TODO: Check status code\n # If 503: wait 2^attempt seconds, then retry\n # If 200: return the generated text\n pass\n return "FAILED"\n\n# Test\nurl = "https://api-inference.huggingface.co/models/gpt2"\nhdrs = {"Authorization": "Bearer hf_your_token_here"}\ndata = {"inputs": "Hello world"}\nprint(safe_inference(url, hdrs, data))',
testCases: [
{ id: 'tc1', input: 'print(type(safe_inference.__code__.co_varnames))', expectedOutput: "<class 'tuple'>", description: 'Function defined correctly' },
{ id: 'tc2', input: 'print("DONE")', expectedOutput: 'DONE', description: 'Code runs without errors' },
],
hints: [
'Check response.status_code == 503 for loading, 200 for success. Use time.sleep(2 ** attempt) for backoff.',
'if response.status_code == 200: return response.json()[0]["generated_text"]\nelif response.status_code == 503: time.sleep(2 ** attempt)',
],
solution: 'import requests\nimport time\n\ndef safe_inference(api_url, headers, payload, max_retries=3):\n """Call HF API with retry for 503 errors."""\n for attempt in range(max_retries):\n response = requests.post(api_url, headers=headers, json=payload)\n if response.status_code == 200:\n return response.json()[0]["generated_text"]\n elif response.status_code == 503:\n time.sleep(2 ** attempt)\n return "FAILED"\n\nurl = "https://api-inference.huggingface.co/models/gpt2"\nhdrs = {"Authorization": "Bearer hf_your_token_here"}\ndata = {"inputs": "Hello world"}\nprint(safe_inference(url, hdrs, data))',
solutionExplanation: 'On status 200, we return the text. On 503, we wait with exponential backoff (1s, 2s, 4s). After 3 failed retries, we return "FAILED".',
xpReward: 20,
}
When NOT to Use Hugging Face Inference
Hugging Face isn’t always the right tool. Three cases where you should look elsewhere.
Ultra-low latency. Need sub-100ms responses? Self-host with vLLM or TensorRT-LLM. Network overhead alone adds 100-500ms per API call.
Closed-source models. GPT-4, Claude, and Gemini Pro aren’t on the Hub. For those, use their native APIs. Hugging Face is the open-source ecosystem.
Massive batch jobs. Processing millions of documents? Download the model. Run it on a GPU cluster. API pricing adds up fast at that scale.
Note: Hugging Face also offers Inference Endpoints. These are private GPU boxes for your model. Prices start at ~$0.06/hour (CPU) and ~$0.60/hour (GPU). They sit between the free API and full self-hosting.
Common Mistakes and How to Fix Them
Mistake 1: No max_new_tokens
Without this, some models generate until they hit context limits. That’s 2048, 4096, or even 8192 tokens. Long waits. High costs.
python
# Wrong — runs until max context
result = generator("Tell me about Python.")
# Right — cap the output
result = generator("Tell me about Python.", max_new_tokens=100)
Mistake 2: Wrong task name format
Pipeline tasks use hyphens, not underscores. This catches many people.
python
# Wrong — underscore causes an error
generator = pipeline("text_generation", model="gpt2")
# Right — use hyphens
generator = pipeline("text-generation", model="gpt2")
Mistake 3: Input too long for API
The free API has input limits. A 10,000-word doc won’t work. Split it into chunks.
def chunk_text(text, max_words=500):
words = text.split()
return [" ".join(words[i:i+max_words])
for i in range(0, len(words), max_words)]
chunks = chunk_text(long_document)
summaries = [summarizer(chunk)[0]["summary_text"]
for chunk in chunks]
full_summary = " ".join(summaries)
Summary
Hugging Face gives you two paths to LLM inference. The Pipeline API runs models on your machine — great for privacy, offline use, and production. The Inference API calls models over HTTP — perfect for prototyping and no-GPU setups.
Start with the API to experiment. Switch to local Pipeline for production. That’s the typical workflow.
Practice Exercise
Complete Code
Frequently Asked Questions
Is the Hugging Face Inference API free?
There’s a free tier with monthly credits. For heavier use, the Pro plan costs $9/month with extra credits and pay-as-you-go pricing. Dedicated Endpoints start at $0.06/hour.
Can I run Hugging Face models without an API key?
For local Pipeline, many models work without auth. Just call pipeline("text-generation", model="gpt2"). But gated models (like Llama) need you to accept their license and use a token. The Inference API always needs a key.
How do I choose between Pipeline and InferenceClient?
Use pipeline() to run models on your hardware. Use InferenceClient for serverless inference on HF’s GPUs. Same model, same results — just different hardware.
What’s the difference between the free API and Inference Endpoints?
The free API runs on shared GPUs with other users. It has rate limits and cold starts. Inference Endpoints give you a dedicated GPU — always warm, no rate limits. It’s shared hosting vs. a private server.
References
- Hugging Face — Transformers Pipeline Tutorial. Link
- Hugging Face — Serverless Inference API Docs. Link
- Hugging Face — InferenceClient Reference. Link
- Hugging Face — Model Cards Documentation. Link
- Hugging Face — Inference Providers. Link
- Hugging Face — Text Generation with LLMs. Link
- Hugging Face — Pricing and Billing. Link
- Wolf, T. et al. — “Transformers: State-of-the-Art NLP.” EMNLP 2020. Link
Free Course
Master Core Python — Your First Step into AI/ML
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →Trusted by 50,000+ learners
Related Course
Master Gen AI — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
