Introduction to LLMs and the OpenAI API in Python

Start using the OpenAI API Python SDK today. Build your first LLM-powered app with runnable code for chat completions, streaming, and token management.

Written by Selva Prabhakaran | 24 min read

A large language model (LLM) predicts the next word in a sequence based on patterns it learned from massive amounts of text. The OpenAI API lets you tap into these models — like GPT-4o — directly from Python, so you can generate text, hold multi-turn chats, and stream replies in real time.

You’ve typed a question into ChatGPT and watched it write back something that sounds genuinely smart. But how does that actually work behind the scenes? And more to the point — how do you bring that same power into your own Python projects?

That’s what I’ll cover in this post. By the end, you’ll know what LLMs really do under the hood, and you’ll have working Python code that calls the OpenAI API to create text, manage back-and-forth chats, and stream output live.

What Is a Large Language Model?

Strip it down to one sentence: an LLM is a program that guesses what word comes next. Feed it “The capital of France is” and it bets on “Paris.”

What makes it “large”? Sheer size. GPT-4 has billions of tunable weights — numbers the model fine-tuned by reading a massive sea of text. Books, blog posts, GitHub repos, forums — it has absorbed patterns from all corners of the internet.

Here’s the part most people miss at first: the model doesn’t “know” facts the way you do. What it has are statistical habits — it learned which words tend to show up after which other words, and in what context. When it gives a correct answer, that’s because the pattern was strong enough in its training data.

KEY INSIGHT: An LLM doesn’t look up facts in a database. It predicts the most likely next token based on learned patterns. Keeping this in mind helps you see where the model is strong (producing fluent text) and where it stumbles (exact factual recall).

How Do LLMs Work: Tokens, Transformers, and Attention?

You don’t need to code a transformer from zero to use the API. But peeking under the hood will help you see why some calls cost more, why token caps exist, and why the model sometimes “forgets” earlier parts of a long chat.

What Are Tokens? The Model’s Alphabet

LLMs don’t process whole words — they work with tokens. A token is a bite-sized piece of text, roughly 3–4 characters in English. The word “understanding” gets split into “under” + “standing.” The word “cat” is a single token.

Why does this matter to you? Two reasons: you pay per token, and every call has a cap on how many tokens fit. Longer prompts eat more tokens and run up your bill.

The good news is you can count tokens before you send anything. OpenAI’s tiktoken library encodes text into the same tokens the model sees. Use encoding_for_model() to pick the right encoding, then .encode() to get a list of token IDs.

python

import tiktoken

encoder = tiktoken.encoding_for_model("gpt-4o")
text = "What is machine learning?"
tokens = encoder.encode(text)
print(f"Text: {text}")
print(f"Tokens: {tokens}")
print(f"Token count: {len(tokens)}")

Five tokens for five words — a clean one-to-one match this time. But that ratio won’t always hold. Unusual or long words often get chopped into two or more tokens.

The Transformer Architecture (30-Second Tour)

The transformer is the neural network behind every modern LLM. Here’s a bird’s-eye view of how it works, step by step:

Embedding — Each token turns into a vector — a list of numbers that captures meaning. “King” and “queen” end up near each other in this number space; “king” and “bicycle” end up far apart.
Attention — The big idea. The model scans all tokens at once and asks: “Which other tokens matter most for guessing the next one?” In “The cat sat on the mat because it was tired,” attention links “it” to “cat,” not “mat.”
Feed-forward layers — Once attention has mapped the key links, small neural nets at each spot extract deeper patterns.
Stack and repeat — Steps 2 and 3 run many times in a row. Each round adds more depth to the model’s grasp of the input.
Pick the next token — A final layer ranks every token in the vocab. The model grabs the top pick (or samples from the best few, based on settings you control).

UNDER THE HOOD: Attention looks at every token at once, so compute scales roughly with the square of input length. A 4,000-token prompt costs much more than a 1,000-token one. Writing shorter prompts is a direct way to cut your bill.

Now that you have the mental model, let’s get your setup ready and make a live API call.

How Do You Set Up the OpenAI Python SDK?

You need three things before you write any code: Python on your machine, the OpenAI package, and an API key.

Prerequisites

Python version: 3.9+
Required library: openai (1.0+)
Install: pip install openai tiktoken
Time to complete: 15-20 minutes

How Do You Get Your API Key?

Head over to platform.openai.com and create a fresh key. Copy it on the spot — you won’t get a second look at it.

Store it as an environment variable. Hard-coding keys into scripts is a recipe for trouble.

bash

# On macOS/Linux
export OPENAI_API_KEY="sk-your-key-here"

# On Windows (Command Prompt)
set OPENAI_API_KEY=sk-your-key-here

# On Windows (PowerShell)
$env:OPENAI_API_KEY="sk-your-key-here"

Another option: put the key in a .env file inside your project folder and load it with python-dotenv:

python

# .env file (add this to .gitignore!)
# OPENAI_API_KEY=sk-your-key-here

from dotenv import load_dotenv
load_dotenv()  # loads .env into environment variables

WARNING: Never commit API keys to git. Add .env to .gitignore. A leaked key means anyone can run calls on your account — and you pay for every one.

The SDK reads OPENAI_API_KEY from the environment for you. No need to type it into your code:

python

from openai import OpenAI

client = OpenAI()  # reads OPENAI_API_KEY from environment

Done. The client object is your gateway to every OpenAI model.

How Do You Make Your First API Call?

Here’s where it all comes together. You call client.chat.completions.create(), hand it a model name and a list of messages, and the model sends back a completion.

python

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": "What is Python used for?"}
    ]
)

print(response.choices[0].message.content)

You’ll get a slightly different reply each run — LLMs are not deterministic by default. Here’s what each part means:

model="gpt-4o-mini" — A quick, low-cost model. Ideal while you’re learning.
messages — A list of dicts, each carrying a role and content.
response.choices[0] — The API can return multiple options. We grab the first.
.message.content — The raw text the model wrote.

But there’s more inside the response object than just the reply text. Let’s dig in so you see what you get back — and what shows up on your bill.

python

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Say hello in French."}]
)

print(f"Model used: {response.model}")
print(f"Response text: {response.choices[0].message.content}")
print(f"Finish reason: {response.choices[0].finish_reason}")
print(f"Prompt tokens: {response.usage.prompt_tokens}")
print(f"Completion tokens: {response.usage.completion_tokens}")
print(f"Total tokens: {response.usage.total_tokens}")

Field	What It Means
`response.model`	The exact model version that handled your request
`choices[0].message.content`	The text the model generated
`choices[0].finish_reason`	Why the model stopped: `"stop"` (natural end) or `"length"` (hit the token cap)
`usage.prompt_tokens`	Tokens in your input (you pay for these)
`usage.completion_tokens`	Tokens the model produced (you also pay for these)
`usage.total_tokens`	Both added together — this is your billing number

KEY INSIGHT: You pay for tokens on both sides — input and output. A 500-token prompt that gets a 500-token reply costs the same total as a 900-token prompt that gets a 100-token reply. Keep prompts lean, and set max_tokens when you don’t need long answers.

Quick check: Before you read on, guess what finish_reason would say if you set max_tokens=5 on a question that needs a full paragraph. (Answer: "length" — the model was forced to stop before it could finish.)

python

{
  type: 'exercise',
  id: 'first-api-call',
  title: 'Exercise 1: Make Your First API Call',
  difficulty: 'beginner',
  exerciseType: 'write',
  instructions: 'Use the OpenAI client to ask the model "Explain what an API is in one sentence." Print just the response text. Use the model "gpt-4o-mini".',
  starterCode: 'from openai import OpenAI\n\nclient = OpenAI()\n\n# Make your API call here\nresponse = client.chat.completions.create(\n    model="gpt-4o-mini",\n    messages=[\n        # Add your message here\n    ]\n)\n\n# Print the response content\nprint(___)',
  testCases: [
    { id: 'tc1', input: 'print("DONE")', expectedOutput: 'DONE', description: 'Code runs without errors', hidden: false },
    { id: 'tc2', input: 'print(type(response).__name__)', expectedOutput: 'ChatCompletion', description: 'Response is a ChatCompletion object', hidden: true },
  ],
  hints: [
    'The message should have role "user" and content "Explain what an API is in one sentence."',
    'Print response.choices[0].message.content to get the text.',
  ],
  solution: 'from openai import OpenAI\n\nclient = OpenAI()\n\nresponse = client.chat.completions.create(\n    model="gpt-4o-mini",\n    messages=[\n        {"role": "user", "content": "Explain what an API is in one sentence."}\n    ]\n)\n\nprint(response.choices[0].message.content)',
  solutionExplanation: 'We create a message with role "user" and pass it inside the messages list. The model returns a ChatCompletion object, and we access the generated text through response.choices[0].message.content.',
  xpReward: 15,
}

What Are Messages, Roles, and How Do You Control the Model?

The real power of the API lives in the messages list. Each message has a role — a label that tells the model who said what. Three roles cover nearly every use case:

system — Hidden stage directions. You set the tone, guardrails, and persona here. The end user never sees this message, but the model obeys it throughout the chat.

user — That’s you (or your app). It holds the question, prompt, or task you want a reply to.

assistant — Earlier replies from the model. Include these when you want a multi-turn chat so the model keeps context across turns.

Let me show you what a system message does in practice. We’ll instruct the model to play the role of a Python tutor that keeps answers short.

python

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": "You are a Python tutor. Give short, clear answers in 2-3 sentences max."
        },
        {
            "role": "user",
            "content": "What's the difference between a list and a tuple?"
        }
    ]
)

print(response.choices[0].message.content)

Take that system message away and you’ll get a longer, less focused reply. With it, the model sticks to the rules you set.

TIP: System messages are the best lever you have for output quality. A sharp prompt like “You are a data analyst. Respond only with pandas code. No prose unless asked.” will crush a vague instruction every time.

What Do temperature, max_tokens, and top_p Control?

The create() call accepts a handful of knobs that change the way the model writes. I’ll focus on the three you’ll use most.

temperature is the randomness dial. It goes from 0 to 2. Low values produce tight, repeatable text. High values unlock more surprise and variety.

python

response_low = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Name a fruit."}],
    temperature=0
)
print(f"temp=0: {response_low.choices[0].message.content}")

response_high = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Name a fruit."}],
    temperature=1.5
)
print(f"temp=1.5: {response_high.choices[0].message.content}")

At zero, you’ll almost always see the same fruit. At 1.5, expect wild cards like “Rambutan” or “Persimmon.”

My rule of thumb: temperature=0 for factual jobs (code, data pulls) and 0.7–1.0 for creative jobs (brainstorming, copy).

max_tokens puts a hard limit on how long the reply can be. If the model runs out of room, it stops mid-thought. The telltale sign is finish_reason = "length" instead of "stop".

top_p (nucleus sampling) is a second knob for randomness. It tells the model to only consider the top P% of likely tokens. top_p=0.1 means just the top 10%. Pick one — either temperature or top_p. OpenAI’s advice is to tweak one and leave the other at its default.

python

{
  type: 'exercise',
  id: 'system-message-exercise',
  title: 'Exercise 2: Craft a System Message',
  difficulty: 'beginner',
  exerciseType: 'write',
  instructions: 'Create a system message that instructs the model to act as a SQL expert who responds only with SQL queries (no explanations). Then ask it to write a query that selects all users older than 25 from a "users" table. Print the response.',
  starterCode: 'from openai import OpenAI\n\nclient = OpenAI()\n\nresponse = client.chat.completions.create(\n    model="gpt-4o-mini",\n    messages=[\n        {\n            "role": "system",\n            "content": ___  # Your system message here\n        },\n        {\n            "role": "user",\n            "content": ___  # Your question here\n        }\n    ]\n)\n\nprint(response.choices[0].message.content)',
  testCases: [
    { id: 'tc1', input: 'print("DONE")', expectedOutput: 'DONE', description: 'Code runs without errors', hidden: false },
    { id: 'tc2', input: 'print(len(response.choices[0].message.content) > 10)', expectedOutput: 'True', description: 'Response contains SQL content', hidden: true },
  ],
  hints: [
    'Set the system message to something like "You are a SQL expert. Respond only with SQL queries, no explanations."',
    'Set the user content to "Write a query to select all users older than 25 from a users table."',
  ],
  solution: 'from openai import OpenAI\n\nclient = OpenAI()\n\nresponse = client.chat.completions.create(\n    model="gpt-4o-mini",\n    messages=[\n        {\n            "role": "system",\n            "content": "You are a SQL expert. Respond only with SQL queries, no explanations."\n        },\n        {\n            "role": "user",\n            "content": "Write a query to select all users older than 25 from a users table."\n        }\n    ]\n)\n\nprint(response.choices[0].message.content)',
  solutionExplanation: 'The system message constrains the model to behave as a SQL expert and return only queries. The model responds with a SQL SELECT statement filtering by age > 25.',
  xpReward: 15,
}

How Do You Build Multi-Turn Conversations?

Here’s something that surprises many people: every API call starts with a blank slate. The model has zero memory of what you asked before. To create a conversation, you must ship the entire chat log in messages on every call.

The recipe: after the model replies, tack its response onto the list, add the user’s next message, and send the whole thing again. Let me show you.

python

conversation = [
    {
        "role": "system",
        "content": "You are a helpful cooking assistant."
    },
    {
        "role": "user",
        "content": "How do I make scrambled eggs?"
    }
]

# First turn
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=conversation
)
assistant_reply = response.choices[0].message.content
print(f"Assistant: {assistant_reply}\n")

# Add the assistant's reply to history
conversation.append({"role": "assistant", "content": assistant_reply})

# Second turn — the model now has context
conversation.append({"role": "user", "content": "What cheese goes best with that?"})

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=conversation
)
print(f"Assistant: {response.choices[0].message.content}")

That second reply works because the model sees the whole thread above it. It can tell “that” refers to scrambled eggs. Strip the history and “that” means nothing. This is one of the trickiest parts of the API — you manage the chat state. Drop a message, and the model loses its place.

Quick thought experiment: What if you deleted the assistant entry before the second call? The model would have no clue which dish you mean, so you’d get a generic cheese tip instead of one matched to eggs.

WARNING: Every message in the history eats tokens. A 50-turn chat ships all 50 messages on each new call. For long chats, you’ll need to prune old entries or condense them into a summary so you don’t blow through the token cap.

How Do Streaming and Error Handling Work?

Before you ship anything to users, you need two more skills: streaming (so the UI feels fast) and error handling (so your app doesn’t blow up when the API hiccups).

How Does Streaming Work?

Without streaming, the API holds the full reply until the model is done writing. For long answers, that wait feels slow. Streaming fixes this — it sends tokens to you the moment they’re ready, just like the ChatGPT typing effect.

All you do is set stream=True and loop over the result. Each chunk carries a tiny slice of the output.

python

stream = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": "Write a haiku about Python programming."}
    ],
    stream=True
)

for chunk in stream:
    content = chunk.choices[0].delta.content
    if content is not None:
        print(content, end="", flush=True)

print()  # newline at the end

Notice: it’s delta.content here, not message.content. The delta only carries what’s new since the last chunk. Some chunks have None (just metadata), which is why we check before printing.

Does streaming save money? No — same tokens, same price. But the user’s feel is night and day. I turn on streaming for every user-facing feature. Text that flows in word by word feels alive; a three-second blank screen followed by a wall of text feels broken.

How Should You Handle API Errors?

The SDK raises typed exceptions, which makes catching errors clean. These are the four you’ll see most often:

python

from openai import (
    OpenAI,
    AuthenticationError,
    RateLimitError,
    APIConnectionError,
    BadRequestError
)

client = OpenAI()

try:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "Hello"}]
    )
    print(response.choices[0].message.content)

except AuthenticationError:
    print("Invalid API key. Check your OPENAI_API_KEY.")

except RateLimitError:
    print("Rate limit hit. Wait a moment and retry.")

except APIConnectionError:
    print("Can't reach OpenAI servers. Check your internet.")

except BadRequestError as e:
    print(f"Bad request: {e}")

Error	Common Cause	Fix
`AuthenticationError`	Wrong or expired API key	Regenerate key at platform.openai.com
`RateLimitError`	Too many requests per minute	Add retry logic with exponential backoff
`APIConnectionError`	Network issue or OpenAI outage	Check connectivity, retry after a few seconds
`BadRequestError`	Invalid model name, too many tokens	Check your parameters against the docs

In production, you want auto-retries. The tenacity library wraps exponential backoff in a single decorator:

python

# pip install tenacity
from tenacity import retry, wait_exponential, stop_after_attempt

@retry(wait=wait_exponential(min=1, max=60), stop=stop_after_attempt(3))
def call_openai(messages):
    return client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages
    )

That decorator retries up to 3 times with gaps that double each round (1s, 2s, 4s, capped at 60s). Most rate limit blips clear in seconds.

How Do You Choose the Right Model?

OpenAI has a lineup of models. Which one fits depends on what you’re building, how much you want to spend, and how fast you need a reply.

Model	Best For	Speed	Cost	Context Window
`gpt-4o-mini`	Prototyping, simple tasks, high volume	Fast	Lowest	128K tokens
`gpt-4o`	Complex reasoning, coding, analysis	Medium	Moderate	128K tokens
`gpt-4.1`	Production apps, nuanced tasks	Medium	Higher	1M tokens
`gpt-4.1-mini`	Balanced performance and cost	Fast	Low	1M tokens

My advice for beginners: start on gpt-4o-mini. It’s cheap, it’s quick, and it handles most learning tasks with ease. Move to a bigger model once you need deeper reasoning or a wider context window.

OpenAI also offers embeddings (text → numeric vectors for search) and image generation (DALL-E). Those hit different endpoints but share the same SDK and auth you set up above. We’ll tackle them in future posts.

NOTE: OpenAI rolled out the Responses API in 2025 as a simpler option alongside Chat Completions. Here’s a side-by-side look:

Feature	Chat Completions	Responses API
Input format	`messages` array (roles required)	`input` string or messages
Output access	`response.choices[0].message.content`	`response.output_text`
Built-in tools	None	Web search, file search, code interpreter
Multi-turn state	Manual (pass full history)	Automatic with `store=True`
Best for	Learning, existing integrations	New projects, agentic workflows

My take: learn Chat Completions first. It’s what most tutorials, LangChain libraries, and live codebases rely on right now. But keep an eye on the Responses API — that’s the direction OpenAI is moving.

WARNING: When NOT to use the OpenAI API. The API isn’t right for every job. Skip it when: (1) you need replies in under 100ms — API round trips add 500ms–2s at minimum; (2) your data can’t leave your network — look at self-hosted models like Llama or Mistral instead; (3) you need guaranteed identical output — even at temperature=0, replies can shift slightly between API versions; (4) cost matters at huge scale — millions of calls per day may make hosting your own model cheaper.

What Are the Most Common Mistakes (and How Do You Fix Them)?

Mistake 1: Forgetting to Set the API Key

python

# ❌ Wrong — no API key configured
from openai import OpenAI
client = OpenAI(api_key="")  # empty string

python

AuthenticationError: No API key provided.

python

# ✅ Correct — set the environment variable first
# export OPENAI_API_KEY="sk-your-actual-key"
from openai import OpenAI
client = OpenAI()  # reads from environment

Mistake 2: Using the Old SDK Syntax

The SDK got a ground-up rewrite at version 1.0. Any tutorial written before late 2023 uses the old API and won’t run with the current package.

python

# ❌ Old syntax (pre v1.0) — this will error
import openai
openai.api_key = "sk-..."
response = openai.ChatCompletion.create(...)

python

# ✅ Current syntax (v1.0+)
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(...)

If you see AttributeError: module 'openai' has no attribute 'ChatCompletion', you’re mixing old code with the new SDK.

Mistake 3: Ignoring Token Limits

Send a 200,000-token prompt to a model with a 128K window and you’ll get a BadRequestError. Check your input size before hitting send.

python

# ✅ Check token count before sending
import tiktoken

def count_tokens(messages, model="gpt-4o-mini"):
    """Approximate token count for a messages list."""
    encoder = tiktoken.encoding_for_model(model)
    total = 0
    for msg in messages:
        # +4 accounts for message formatting overhead
        total += len(encoder.encode(msg["content"])) + 4
    return total

Mistake 4: Not Handling Streaming Correctly

python

# ❌ Wrong — treating stream like a regular response
stream = client.chat.completions.create(..., stream=True)
print(stream.choices[0].message.content)  # AttributeError!

python

# ✅ Correct — iterate over chunks
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

How Do You Put It All Together? A Code Review Assistant

Time to tie it all together in a hands-on project. We’ll build an assistant that takes a Python function, analyzes it, and flags improvements.

Below, review_code() sends a system message that casts the model as a senior Python dev. The user message carries the code. I set temperature=0.3 for sharp, focused feedback and cap output at 500 tokens to keep reviews concise.

python

from openai import OpenAI

client = OpenAI()

def review_code(code_snippet):
    """Send a code snippet to GPT for review."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a senior Python developer. "
                    "Review the code for bugs, style issues, "
                    "and performance. Be specific and concise."
                )
            },
            {
                "role": "user",
                "content": f"Review this Python code:\n\n{code_snippet}"
            }
        ],
        temperature=0.3,
        max_tokens=500
    )
    return response.choices[0].message.content

Now let’s feed it a function that has a few deliberate flaws — range(len()) instead of direct iteration, no empty-list guard, and a missed sum() builtin:

python

sample_code = """
def get_avg(numbers):
    total = 0
    for i in range(len(numbers)):
        total = total + numbers[i]
    avg = total / len(numbers)
    return avg
"""

review = review_code(sample_code)
print(review)

The model catches all three problems — the clunky loop, the missing empty-list check, and the ignored sum() builtin. Twenty lines of code and you have a working review tool. That’s the payoff of pairing a solid system message with the right temperature and a focused prompt.

Summary

Let me recap what you now know:

LLMs guess the next token by drawing on patterns from training data — they don’t “understand” the way a human does.
Tokens are the unit of both cost and capacity. You pay per token, and each model has a context window cap.
The OpenAI Python SDK gives you a single call — client.chat.completions.create() — to interact with any model.
Messages use three roles: system (rules), user (input), assistant (past replies).
Temperature tunes randomness. Set it to 0 for exact tasks, 0.7–1.0 for creative work.
Multi-turn chats work by shipping the full history on each call.
Streaming pushes tokens to the user as they arrive, making the app feel fast.
Always add error handling with retries before going to production.

Practice Exercise

Build a terminal-based chatbot that keeps context alive across turns. The user types, the bot replies, and the thread grows with each exchange.

Click to see the solution

python

from openai import OpenAI

client = OpenAI()

conversation = [
    {"role": "system", "content": "You are a helpful assistant. Keep answers concise."}
]

print("Chatbot ready! Type 'quit' to exit.\n")

while True:
    user_input = input("You: ")
    if user_input.lower() == "quit":
        break

    conversation.append({"role": "user", "content": user_input})

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=conversation
    )

    reply = response.choices[0].message.content
    conversation.append({"role": "assistant", "content": reply})
    print(f"Bot: {reply}\n")

This exercise puts the multi-turn pattern into practice. The conversation list expands on every exchange, feeding the model the full thread each time.

Complete Code

Click to expand the full script (copy-paste and run)

python

# Complete code from: Introduction to LLMs and the OpenAI API in Python
# Requires: pip install openai tiktoken
# Python 3.9+

from openai import OpenAI
import tiktoken

# --- Section 1: Token Counting ---
encoder = tiktoken.encoding_for_model("gpt-4o")
text = "What is machine learning?"
tokens = encoder.encode(text)
print(f"Text: {text}")
print(f"Tokens: {tokens}")
print(f"Token count: {len(tokens)}")

# --- Section 2: First API Call ---
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": "What is Python used for?"}
    ]
)
print(f"\nFirst call: {response.choices[0].message.content}")

# --- Section 3: Response Object ---
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Say hello in French."}]
)
print(f"\nModel: {response.model}")
print(f"Text: {response.choices[0].message.content}")
print(f"Finish reason: {response.choices[0].finish_reason}")
print(f"Total tokens: {response.usage.total_tokens}")

# --- Section 4: System Message ---
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": "You are a Python tutor. Give short, clear answers in 2-3 sentences max."
        },
        {
            "role": "user",
            "content": "What's the difference between a list and a tuple?"
        }
    ]
)
print(f"\nWith system message: {response.choices[0].message.content}")

# --- Section 5: Temperature ---
response_low = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Name a fruit."}],
    temperature=0
)
print(f"\ntemp=0: {response_low.choices[0].message.content}")

# --- Section 6: Multi-Turn Conversation ---
conversation = [
    {"role": "system", "content": "You are a helpful cooking assistant."},
    {"role": "user", "content": "How do I make scrambled eggs?"}
]

response = client.chat.completions.create(
    model="gpt-4o-mini", messages=conversation
)
assistant_reply = response.choices[0].message.content
print(f"\nAssistant: {assistant_reply}")

conversation.append({"role": "assistant", "content": assistant_reply})
conversation.append({"role": "user", "content": "What cheese goes best with that?"})

response = client.chat.completions.create(
    model="gpt-4o-mini", messages=conversation
)
print(f"Assistant: {response.choices[0].message.content}")

# --- Section 7: Streaming ---
print("\nStreaming response:")
stream = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": "Write a haiku about Python programming."}
    ],
    stream=True
)
for chunk in stream:
    content = chunk.choices[0].delta.content
    if content is not None:
        print(content, end="", flush=True)
print()

# --- Section 8: Code Review Assistant ---
def review_code(code_snippet):
    """Send a code snippet to GPT for review."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a senior Python developer. "
                    "Review the code for bugs, style issues, "
                    "and performance. Be specific and concise."
                )
            },
            {
                "role": "user",
                "content": f"Review this Python code:\n\n{code_snippet}"
            }
        ],
        temperature=0.3,
        max_tokens=500
    )
    return response.choices[0].message.content

sample_code = """
def get_avg(numbers):
    total = 0
    for i in range(len(numbers)):
        total = total + numbers[i]
    avg = total / len(numbers)
    return avg
"""

print(f"\nCode Review:\n{review_code(sample_code)}")

print("\nScript completed successfully.")

Frequently Asked Questions

How much does the OpenAI API cost?

Pricing varies by model and token volume. In 2025, gpt-4o-mini costs roughly \(0.15 per million input tokens and \)0.60 per million output tokens. A hands-on learning session with a few dozen calls runs just a few cents. See openai.com/pricing for current numbers.

What’s the difference between ChatGPT and the OpenAI API?

ChatGPT is a consumer app with a web UI. The API opens up the same models to your code. You can set system messages, tweak parameters, manage the chat thread, and wire the model into your own software — things the ChatGPT interface doesn’t allow.

Do I need a GPU to use the OpenAI API?

Nope. The heavy math runs on OpenAI’s servers. Your machine just ships text out and gets text back. A laptop that can run Python and hit the web is all you need.

Can I use other LLM providers with the same code?

Yes — many providers (Anthropic, Google, Mistral, local models via Ollama) expose endpoints that match OpenAI’s format. Most of the time you just swap the base URL and key. Libraries like LiteLLM wrap all of them behind one unified call.

References

OpenAI Platform — API Reference: Chat Completions. Link
OpenAI Platform — Developer Quickstart. Link
OpenAI — Responses API vs Chat Completions Guide. Link
Vaswani, A. et al. — “Attention Is All You Need.” NeurIPS 2017. arXiv
OpenAI Python SDK — GitHub Repository. Link
OpenAI — Tokenizer and tiktoken Library. Link
OpenAI — Model Pricing. Link
OpenAI — API Data Usage Policies. Link

Free Course

Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course →

Trusted by 50,000+ learners

Written by

Selva Prabhakaran →

Related Course

Master Gen AI — Hands-On

Join 5,000+ students at edu.machinelearningplus.com

Explore Course

Introduction to LLMs and the OpenAI API in Python

What Is a Large Language Model?

How Do LLMs Work: Tokens, Transformers, and Attention?

What Are Tokens? The Model’s Alphabet

The Transformer Architecture (30-Second Tour)

How Do You Set Up the OpenAI Python SDK?

Prerequisites

How Do You Get Your API Key?

How Do You Make Your First API Call?

What Are Messages, Roles, and How Do You Control the Model?

What Do temperature, max_tokens, and top_p Control?

How Do You Build Multi-Turn Conversations?

How Do Streaming and Error Handling Work?

How Does Streaming Work?

How Should You Handle API Errors?

How Do You Choose the Right Model?

What Are the Most Common Mistakes (and How Do You Fix Them)?

Mistake 1: Forgetting to Set the API Key

Mistake 2: Using the Old SDK Syntax

Mistake 3: Ignoring Token Limits

Mistake 4: Not Handling Streaming Correctly

How Do You Put It All Together? A Code Review Assistant

Summary

Practice Exercise

Complete Code

Frequently Asked Questions

How much does the OpenAI API cost?

What’s the difference between ChatGPT and the OpenAI API?

Do I need a GPU to use the OpenAI API?

Can I use other LLM providers with the same code?

References

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

What Is a Large Language Model?

How Do LLMs Work: Tokens, Transformers, and Attention?

What Are Tokens? The Model’s Alphabet

The Transformer Architecture (30-Second Tour)

How Do You Set Up the OpenAI Python SDK?

Prerequisites

How Do You Get Your API Key?

How Do You Make Your First API Call?

What Are Messages, Roles, and How Do You Control the Model?

What Do temperature, max_tokens, and top_p Control?

How Do You Build Multi-Turn Conversations?

How Do Streaming and Error Handling Work?

How Does Streaming Work?

How Should You Handle API Errors?

How Do You Choose the Right Model?

What Are the Most Common Mistakes (and How Do You Fix Them)?

Mistake 1: Forgetting to Set the API Key

Mistake 2: Using the Old SDK Syntax

Mistake 3: Ignoring Token Limits

Mistake 4: Not Handling Streaming Correctly

How Do You Put It All Together? A Code Review Assistant

Summary

Practice Exercise

Complete Code

Frequently Asked Questions

How much does the OpenAI API cost?

What’s the difference between ChatGPT and the OpenAI API?

Do I need a GPU to use the OpenAI API?

Can I use other LLM providers with the same code?

References

Related Articles

LLM Temperature, Top-P, and Top-K Explained — With Python Simulations

OpenAI API Python Tutorial — Chat Completions, Streaming & Error Handling

Zero-Shot vs Few-Shot Prompting: Complete Guide

Python.SQL. NumPy. All free.

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python.
SQL. NumPy.
All free.