Menu

Python Regular Expressions Tutorial and Examples: A Simplified Guide

Regular expressions, also called regex, is a syntax or rather a language to search, extract and manipulate specific string patterns from a larger text. In python, it is implemented in the re module. You will first get introduced to the 5 main features of the re module and then see how to create common regex in python.

Written by Selva Prabhakaran | 10 min read

Regular expressions (regex) let you search, extract, and manipulate text patterns in Python. They power text validation, NLP projects, and text mining workflows.

Regular Expressions in Python: A Simplified Tutorial. Photo by Sarah Crutchfield.

This post has interactive code — click ‘Run’ or press Ctrl+Enter on any code block to execute it directly in your browser. The first run may take a few seconds to initialize.

1. Contents

  1. Introduction to regular expressions
  2. What is a regex pattern and how to compile one?
  3. How to split a string separated by a regex?
  4. Finding pattern matches using findall, search and match
  5. What does re.findall() do?
    5.1. re.search() vs re.match()
  6. How to substitute one text with another using regex?
  7. Regex groups
  8. What is greedy matching in regex?
  9. Most common regular expression syntax and patterns
  10. Regular Expressions Examples
    10.1. Any character except for a new line
    10.2. A period
    10.3. Any digit
    10.4. Anything but a digit
    10.5. Any character, including digits
    10.6. Anything but a character
    10.7. Collection of characters
    10.8. Match something upto ‘n’ times
    10.9. Match 1 or more occurrences
    10.10. Match any number of occurrences (0 or more times)
    10.11. Match exactly zero or one occurrence
    10.12. Match word boundaries
  11. Practice Exercises
  12. Conclusion

1. Introduction to regular expressions

Python implements regex through its built-in re module. You will find regex in NLP, web apps that validate input (like email addresses), and most text mining projects.

This tutorial has two parts. First, you will learn the five main features of the re module. Then you will see how to build common regex patterns in Python.

By the end, you will know how to construct almost any string pattern you need for text mining work.

2. What is a regex pattern and how to compile one?

A regex pattern is a special language that represents generic text, numbers, or symbols. You use it to extract text that fits the pattern.

Take the basic example '\s+'. Here '\s' matches any whitespace character. The '+' at the end makes it match one or more spaces.

This pattern also matches tab characters ('\t'). A full list of regex patterns appears at the end of this post. First, let’s see how to compile and use regular expressions.

import re
regex = re.compile('\s+')
print("Compiled:", regex)
print("Pattern:", regex.pattern)

The code above imports re and compiles a pattern that matches one or more whitespace characters.

3. How to split a string separated by a regex?

Consider this piece of text with three course items:

import re

text = """101 COM    Computers
205 MAT   Mathematics
189 ENG   English"""

print(text)

Each line follows the format: “[Course Number] [Course Code] [Course Name]”. The spacing between words is uneven. You want to split these into individual words.

You can split in two ways:

1. Use the re.split method.
2. Call the split method on a compiled regex object.

import re

text = """101 COM    Computers
205 MAT   Mathematics
189 ENG   English"""

regex = re.compile('\s+')

# Both methods produce the same result
print(re.split('\s+', text))
print(regex.split(text))

Both methods work. Which should you use? If you plan to reuse a pattern, compile it first. That avoids recompiling the same pattern over and over.

4. Finding pattern matches using findall, search and match

Suppose you want to extract only the course numbers (101, 205, 189) from the text above. How do you do that?

4.1 What does re.findall() do?

import re

text = """101 COM    Computers
205 MAT   Mathematics
189 ENG   English"""

print(text)
print()

regex_num = re.compile('\d+')
print(regex_num.findall(text))

The special character '\d' matches any digit. Adding '+' requires at least one digit to be present.

There is also a '*' symbol. It requires zero or more digits, which makes the digit optional for a match. More on this later.

The findall method extracts all occurrences and returns them as a list.

4.2 re.search() vs re.match()

regex.search() looks for the pattern anywhere in the text. Unlike findall, it returns a match object with the start and end positions of the first match.

regex.match() also returns a match object. The difference? It only checks the beginning of the text.

import re

text2 = """COM Computers 205 MAT Mathematics 189"""

regex_num = re.compile('\d+')
s = regex_num.search(text2)
print('Starting Position: ', s.start())
print('Ending Position: ', s.end())
print(text2[s.start():s.end()])

You can also get the matched text using the group() method:

import re

text2 = """COM Computers 205 MAT Mathematics 189"""

regex_num = re.compile('\d+')
s = regex_num.search(text2)
print(s.group())

Now see what happens with match() when the text does not start with a digit:

import re

text2 = """COM Computers 205 MAT Mathematics 189"""

regex_num = re.compile('\d+')
m = regex_num.match(text2)
print(m)

It returns None because there is no digit at the start of the string.

5. How to substitute one text with another using regex?

Use regex.sub() to replace text. Consider this version of the courses text with extra tabs after each course code:

import re

text = """101 COM \t Computers
205 MAT \t Mathematics
189 ENG \t English"""

print(text)

You want to collapse all extra spaces into a single space and put everything on one line. Use regex.sub to replace '\s+' with a single space:

import re

text = """101 COM \t Computers
205 MAT \t Mathematics
189 ENG \t English"""

regex = re.compile('\s+')
print(regex.sub(' ', text))
print()
print(re.sub('\s+', ' ', text))

What if you want to keep course entries on separate lines but remove extra spaces within each line? Use a negative lookahead (?!\n). It excludes newline characters from the match:

import re

text = """101 COM \t Computers
205 MAT \t Mathematics
189 ENG \t English"""

regex = re.compile('((?!\n)\s+)')
print(regex.sub(' ', text))

6. Regex groups

Regex groups let you extract matched parts as separate items. Suppose you want the course number, code, and name as individual pieces.

Without groups, you need three separate patterns:

import re

text = """101   COM   Computers
205   MAT   Mathematics
189   ENG    English"""

# Extract each part separately
print(re.findall('[0-9]+', text))
print(re.findall('[A-Z]{3}', text))
print(re.findall('[A-Za-z]{4,}', text))

Here is what each pattern does:

[0-9]+ matches one or more digits (0 through 9). If you know the number has exactly 3 digits, use [0-9]{3} instead.

[A-Z]{3} matches exactly three uppercase letters A through Z.

[A-Za-z]{4,} matches four or more letters (upper or lower case). This assumes course names have at least 4 characters.

That took three separate calls. Regex groups offer a better way. Place each part inside parentheses () within a single pattern:

import re

text = """101   COM   Computers
205   MAT   Mathematics
189   ENG    English"""

course_pattern = '([0-9]+)\s*([A-Z]{3})\s*([A-Za-z]{4,})'
print(re.findall(course_pattern, text))

Each set of parentheses creates a group. The result is a list of tuples with the matched groups.

7. What is greedy matching in regex?

By default, regex is greedy. It grabs as much text as possible while still matching the pattern.

Here is an example with HTML tags:

import re

text = "< body>Regex Greedy Matching Example < /body>"
print(re.findall('<.*>', text))

Instead of stopping at the first >, regex grabbed the entire string. That is greedy behavior: “take it all.”

Lazy matching does the opposite: “take as little as possible.” Add ? after the pattern to make it lazy:

import re

text = "< body>Regex Greedy Matching Example < /body>"
print(re.findall('<.*?>', text))

To get only the first match, use search instead of findall:

import re

text = "< body>Regex Greedy Matching Example < /body>"
print(re.search('<.*?>', text).group())

8. Most common regular expression syntax and patterns

Now that you understand the re module, here are the most common wildcard patterns:

Basic Syntax
.             One character except new line
\.            A period. \ escapes a special character.
\d            One digit
\D            One non-digit
\w            One word character including digits
\W            One non-word character
\s            One whitespace
\S            One non-whitespace
\b            Word boundary
\n            Newline
\t            Tab

Modifiers
$             End of string
^             Start of string
ab|cd         Matches ab or de.
[ab-d]        One character of: a, b, c, d
[^ab-d]       One character except: a, b, c, d
()            Items within parenthesis are retrieved
(a(bc))       Items within the sub-parenthesis are retrieved

Repetitions
[ab]{2}       Exactly 2 continuous occurrences of a or b
[ab]{2,5}     2 to 5 continuous occurrences of a or b
[ab]{2,}      2 or more continuous occurrences of a or b
+             One or more
*             Zero or more
?             0 or 1

9. Regular Expressions Examples

9.1. Any character except for a new line

The dot . matches any single character except a newline:

import re

text = 'machinelearningplus.com'
print(re.findall('.', text))
print(re.findall('...', text))

9.2. A period

Use \. to match a literal period:

import re

text = 'machinelearningplus.com'
print(re.findall('\.', text))
print(re.findall('[^\.]', text))

9.3. Any digit

\d+ matches one or more digits:

import re

text = '01, Jan 2015'
print(re.findall('\d+', text))

9.4. Anything but a digit

\D+ matches one or more non-digit characters:

import re

text = '01, Jan 2015'
print(re.findall('\D+', text))

9.5. Any character, including digits

\w+ matches word characters (letters, digits, underscore):

import re

text = '01, Jan 2015'
print(re.findall('\w+', text))

9.6. Anything but a character

\W+ matches non-word characters:

import re

text = '01, Jan 2015'
print(re.findall('\W+', text))

9.7. Collection of characters

Square brackets [] match any character inside them:

import re

text = '01, Jan 2015'
print(re.findall('[a-zA-Z]+', text))

9.8. Match something upto ‘n’ times

Use {n} to match exactly n repetitions:

import re

text = '01, Jan 2015'
print(re.findall('\d{4}', text))
print(re.findall('\d{2,4}', text))

9.9. Match 1 or more occurrences

The + symbol matches one or more occurrences:

import re

print(re.findall(r'Co+l', 'So Cooool'))

9.10. Match any number of occurrences (0 or more times)

The * symbol matches zero or more occurrences:

import re

print(re.findall(r'Pi*lani', 'Pilani'))

9.11. Match exactly zero or one occurrence

The ? symbol matches zero or one occurrence:

import re

print(re.findall(r'colou?r', 'color'))

9.12. Match word boundaries

The \b boundary matches where one side is a word character and the other is whitespace. For example, \btoy matches ‘toy’ in ‘toy cat’ but not in ‘tolstoy’.

To match ‘toy’ in ‘tolstoy’, use toy\b. To match only the standalone word ‘toy’, place \b on both sides.

\B matches non-boundaries. So \Btoy\B matches ‘toy’ only when surrounded by word characters, as in ‘antoynet’.

import re

print(re.findall(r'\btoy\b', 'play toy broke toys'))

10. Practice Exercises

Try these exercises. Run the code blocks to check your answers.

Exercise 1: Extract the user id, domain name, and suffix from these email addresses.

import re

emails = """zuck26@facebook.com
page33@google.com
jeff42@amazon.com"""

desired_output = [('zuck26', 'facebook', 'com'),
 ('page33', 'google', 'com'),
 ('jeff42', 'amazon', 'com')]

print("Desired:", desired_output)

Solution: Use groups with () to capture each part:

import re

emails = """zuck26@facebook.com
page33@google.com
jeff42@amazon.com"""

pattern = r'(\w+)@([A-Z0-9]+)\.([A-Z]{2,4})'
result = re.findall(pattern, emails, flags=re.IGNORECASE)
print(result)

Exercise 2: Retrieve all words starting with ‘b’ or ‘B’ from this text.

import re

text = """Betty bought a bit of butter, But the butter was so bitter, So she bought some better butter, To make the bitter butter better."""

result = re.findall(r'\bB\w+', text, flags=re.IGNORECASE)
print(result)

The \b before ‘B’ requires a word boundary on the left, so the word must start with ‘B’. The re.IGNORECASE flag makes it case insensitive.

Exercise 3: Split this irregular sentence into clean words.

import re

sentence = """A, very   very; irregular_sentence"""

result = " ".join(re.split('[;,\s_]+', sentence))
print(result)

Exercise 4: Clean a tweet by removing URLs, hashtags, mentions, punctuation, RT, and CC.

import re

tweet = 'Good advice! RT @TheNextWeb: What I would do differently if I was learning to code today http://t.co/lbwej0pxOd cc: @garybernhardt #rstats'

def clean_tweet(tweet):
    tweet = re.sub('http\S+\s*', '', tweet)  # remove URLs
    tweet = re.sub('RT|cc', '', tweet)  # remove RT and cc
    tweet = re.sub('#\S+', '', tweet)  # remove hashtags
    tweet = re.sub('@\S+', '', tweet)  # remove mentions
    tweet = re.sub('[%s]' % re.escape('!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'), '', tweet)  # remove punctuations
    tweet = re.sub('\s+', ' ', tweet)  # remove extra whitespace
    return tweet

print(clean_tweet(tweet))

Exercise 5: Extract all text between HTML tags. This exercise requires the requests library, so here is the pattern to study:

python
import re
import requests

r = requests.get("https://raw.githubusercontent.com/selva86/datasets/master/sample.html")
print(re.findall('<.*?>(.*)', r.text))

The pattern <.*?>(.*)</.*?> captures everything between opening and closing tags.

11. Conclusion

This tutorial covered regular expressions in Python from the ground up. You learned how to compile patterns, split strings, find matches, substitute text, use groups, and control greedy vs. lazy matching.

Use the reference table and practice exercises to build your regex skills. Keep this guide handy for your next text mining project.

Free Course
Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Related Course
Master Python — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Free Callback - Limited Slots
Not Sure Which Course to Start With?
Talk to our AI Counsellors and Practitioners. We'll help you clear all your questions for your background and goals, bridging the gap between your current skills and a career in AI.
10-digit mobile number
📞
Thank You!
We'll Call You Soon!
Our learning advisor will reach out within 24 hours.
(Check your inbox too — we've sent a confirmation)
⚡ Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

🐍
Core Python — from first line to expert level
📈
NumPy & Pandas — the #1 libraries every DS job needs
🗃️
SQL Levels I–III — basics to Window Functions
📄
Real industry data — Jupyter notebooks included
R A M S K
57,000+ students
★★★★★ Rated 4.9/5
⚡ Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  ★★★★★ 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
🔒 100% free ☕ No spam, ever ✓ Instant access
🚀
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course →
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science