machine learning +
101 Polars Exercises for Data Analysis (with Solutions)
Python Regular Expressions Tutorial and Examples: A Simplified Guide
Regular expressions, also called regex, is a syntax or rather a language to search, extract and manipulate specific string patterns from a larger text. In python, it is implemented in the re module. You will first get introduced to the 5 main features of the re module and then see how to create common regex in python.
Regular expressions (regex) let you search, extract, and manipulate text patterns in Python. They power text validation, NLP projects, and text mining workflows.
Regular Expressions in Python: A Simplified Tutorial. Photo by Sarah Crutchfield.
This post has interactive code — click ‘Run’ or press Ctrl+Enter on any code block to execute it directly in your browser. The first run may take a few seconds to initialize.
1. Contents
- Introduction to regular expressions
- What is a regex pattern and how to compile one?
- How to split a string separated by a regex?
- Finding pattern matches using findall, search and match
- What does re.findall() do?
5.1. re.search() vs re.match() - How to substitute one text with another using regex?
- Regex groups
- What is greedy matching in regex?
- Most common regular expression syntax and patterns
- Regular Expressions Examples
10.1. Any character except for a new line
10.2. A period
10.3. Any digit
10.4. Anything but a digit
10.5. Any character, including digits
10.6. Anything but a character
10.7. Collection of characters
10.8. Match something upto ‘n’ times
10.9. Match 1 or more occurrences
10.10. Match any number of occurrences (0 or more times)
10.11. Match exactly zero or one occurrence
10.12. Match word boundaries - Practice Exercises
- Conclusion
1. Introduction to regular expressions
Python implements regex through its built-in re module. You will find regex in NLP, web apps that validate input (like email addresses), and most text mining projects.
This tutorial has two parts. First, you will learn the five main features of the re module. Then you will see how to build common regex patterns in Python.
By the end, you will know how to construct almost any string pattern you need for text mining work.
2. What is a regex pattern and how to compile one?
A regex pattern is a special language that represents generic text, numbers, or symbols. You use it to extract text that fits the pattern.
Take the basic example '\s+'. Here '\s' matches any whitespace character. The '+' at the end makes it match one or more spaces.
This pattern also matches tab characters ('\t'). A full list of regex patterns appears at the end of this post. First, let’s see how to compile and use regular expressions.
import re
regex = re.compile('\s+')
print("Compiled:", regex)
print("Pattern:", regex.pattern)The code above imports re and compiles a pattern that matches one or more whitespace characters.
3. How to split a string separated by a regex?
Consider this piece of text with three course items:
import re text = """101 COM Computers 205 MAT Mathematics 189 ENG English""" print(text)
Each line follows the format: “[Course Number] [Course Code] [Course Name]”. The spacing between words is uneven. You want to split these into individual words.
You can split in two ways:
1. Use the re.split method.
2. Call the split method on a compiled regex object.
import re
text = """101 COM Computers
205 MAT Mathematics
189 ENG English"""
regex = re.compile('\s+')
# Both methods produce the same result
print(re.split('\s+', text))
print(regex.split(text))Both methods work. Which should you use? If you plan to reuse a pattern, compile it first. That avoids recompiling the same pattern over and over.
4. Finding pattern matches using findall, search and match
Suppose you want to extract only the course numbers (101, 205, 189) from the text above. How do you do that?
4.1 What does re.findall() do?
import re
text = """101 COM Computers
205 MAT Mathematics
189 ENG English"""
print(text)
print()
regex_num = re.compile('\d+')
print(regex_num.findall(text))The special character '\d' matches any digit. Adding '+' requires at least one digit to be present.
There is also a '*' symbol. It requires zero or more digits, which makes the digit optional for a match. More on this later.
The findall method extracts all occurrences and returns them as a list.
4.2 re.search() vs re.match()
regex.search() looks for the pattern anywhere in the text. Unlike findall, it returns a match object with the start and end positions of the first match.
regex.match() also returns a match object. The difference? It only checks the beginning of the text.
import re
text2 = """COM Computers 205 MAT Mathematics 189"""
regex_num = re.compile('\d+')
s = regex_num.search(text2)
print('Starting Position: ', s.start())
print('Ending Position: ', s.end())
print(text2[s.start():s.end()])You can also get the matched text using the group() method:
import re
text2 = """COM Computers 205 MAT Mathematics 189"""
regex_num = re.compile('\d+')
s = regex_num.search(text2)
print(s.group())Now see what happens with match() when the text does not start with a digit:
import re
text2 = """COM Computers 205 MAT Mathematics 189"""
regex_num = re.compile('\d+')
m = regex_num.match(text2)
print(m)It returns None because there is no digit at the start of the string.
5. How to substitute one text with another using regex?
Use regex.sub() to replace text. Consider this version of the courses text with extra tabs after each course code:
import re text = """101 COM \t Computers 205 MAT \t Mathematics 189 ENG \t English""" print(text)
You want to collapse all extra spaces into a single space and put everything on one line. Use regex.sub to replace '\s+' with a single space:
import re
text = """101 COM \t Computers
205 MAT \t Mathematics
189 ENG \t English"""
regex = re.compile('\s+')
print(regex.sub(' ', text))
print()
print(re.sub('\s+', ' ', text))What if you want to keep course entries on separate lines but remove extra spaces within each line? Use a negative lookahead (?!\n). It excludes newline characters from the match:
import re
text = """101 COM \t Computers
205 MAT \t Mathematics
189 ENG \t English"""
regex = re.compile('((?!\n)\s+)')
print(regex.sub(' ', text))6. Regex groups
Regex groups let you extract matched parts as separate items. Suppose you want the course number, code, and name as individual pieces.
Without groups, you need three separate patterns:
import re
text = """101 COM Computers
205 MAT Mathematics
189 ENG English"""
# Extract each part separately
print(re.findall('[0-9]+', text))
print(re.findall('[A-Z]{3}', text))
print(re.findall('[A-Za-z]{4,}', text))Here is what each pattern does:
[0-9]+ matches one or more digits (0 through 9). If you know the number has exactly 3 digits, use [0-9]{3} instead.
[A-Z]{3} matches exactly three uppercase letters A through Z.
[A-Za-z]{4,} matches four or more letters (upper or lower case). This assumes course names have at least 4 characters.
That took three separate calls. Regex groups offer a better way. Place each part inside parentheses () within a single pattern:
import re
text = """101 COM Computers
205 MAT Mathematics
189 ENG English"""
course_pattern = '([0-9]+)\s*([A-Z]{3})\s*([A-Za-z]{4,})'
print(re.findall(course_pattern, text))Each set of parentheses creates a group. The result is a list of tuples with the matched groups.
7. What is greedy matching in regex?
By default, regex is greedy. It grabs as much text as possible while still matching the pattern.
Here is an example with HTML tags:
import re
text = "< body>Regex Greedy Matching Example < /body>"
print(re.findall('<.*>', text))Instead of stopping at the first >, regex grabbed the entire string. That is greedy behavior: “take it all.”
Lazy matching does the opposite: “take as little as possible.” Add ? after the pattern to make it lazy:
import re
text = "< body>Regex Greedy Matching Example < /body>"
print(re.findall('<.*?>', text))To get only the first match, use search instead of findall:
import re
text = "< body>Regex Greedy Matching Example < /body>"
print(re.search('<.*?>', text).group())8. Most common regular expression syntax and patterns
Now that you understand the re module, here are the most common wildcard patterns:
Basic Syntax
. One character except new line
\. A period. \ escapes a special character.
\d One digit
\D One non-digit
\w One word character including digits
\W One non-word character
\s One whitespace
\S One non-whitespace
\b Word boundary
\n Newline
\t Tab
Modifiers
$ End of string
^ Start of string
ab|cd Matches ab or de.
[ab-d] One character of: a, b, c, d
[^ab-d] One character except: a, b, c, d
() Items within parenthesis are retrieved
(a(bc)) Items within the sub-parenthesis are retrieved
Repetitions
[ab]{2} Exactly 2 continuous occurrences of a or b
[ab]{2,5} 2 to 5 continuous occurrences of a or b
[ab]{2,} 2 or more continuous occurrences of a or b
+ One or more
* Zero or more
? 0 or 19. Regular Expressions Examples
9.1. Any character except for a new line
The dot . matches any single character except a newline:
import re
text = 'machinelearningplus.com'
print(re.findall('.', text))
print(re.findall('...', text))9.2. A period
Use \. to match a literal period:
import re
text = 'machinelearningplus.com'
print(re.findall('\.', text))
print(re.findall('[^\.]', text))9.3. Any digit
\d+ matches one or more digits:
import re
text = '01, Jan 2015'
print(re.findall('\d+', text))9.4. Anything but a digit
\D+ matches one or more non-digit characters:
import re
text = '01, Jan 2015'
print(re.findall('\D+', text))9.5. Any character, including digits
\w+ matches word characters (letters, digits, underscore):
import re
text = '01, Jan 2015'
print(re.findall('\w+', text))9.6. Anything but a character
\W+ matches non-word characters:
import re
text = '01, Jan 2015'
print(re.findall('\W+', text))9.7. Collection of characters
Square brackets [] match any character inside them:
import re
text = '01, Jan 2015'
print(re.findall('[a-zA-Z]+', text))9.8. Match something upto ‘n’ times
Use {n} to match exactly n repetitions:
import re
text = '01, Jan 2015'
print(re.findall('\d{4}', text))
print(re.findall('\d{2,4}', text))9.9. Match 1 or more occurrences
The + symbol matches one or more occurrences:
import re print(re.findall(r'Co+l', 'So Cooool'))
9.10. Match any number of occurrences (0 or more times)
The * symbol matches zero or more occurrences:
import re print(re.findall(r'Pi*lani', 'Pilani'))
9.11. Match exactly zero or one occurrence
The ? symbol matches zero or one occurrence:
import re print(re.findall(r'colou?r', 'color'))
9.12. Match word boundaries
The \b boundary matches where one side is a word character and the other is whitespace. For example, \btoy matches ‘toy’ in ‘toy cat’ but not in ‘tolstoy’.
To match ‘toy’ in ‘tolstoy’, use toy\b. To match only the standalone word ‘toy’, place \b on both sides.
\B matches non-boundaries. So \Btoy\B matches ‘toy’ only when surrounded by word characters, as in ‘antoynet’.
import re print(re.findall(r'\btoy\b', 'play toy broke toys'))
10. Practice Exercises
Try these exercises. Run the code blocks to check your answers.
Exercise 1: Extract the user id, domain name, and suffix from these email addresses.
import re
emails = """zuck26@facebook.com
page33@google.com
jeff42@amazon.com"""
desired_output = [('zuck26', 'facebook', 'com'),
('page33', 'google', 'com'),
('jeff42', 'amazon', 'com')]
print("Desired:", desired_output)Solution: Use groups with () to capture each part:
import re
emails = """zuck26@facebook.com
page33@google.com
jeff42@amazon.com"""
pattern = r'(\w+)@([A-Z0-9]+)\.([A-Z]{2,4})'
result = re.findall(pattern, emails, flags=re.IGNORECASE)
print(result)Exercise 2: Retrieve all words starting with ‘b’ or ‘B’ from this text.
import re text = """Betty bought a bit of butter, But the butter was so bitter, So she bought some better butter, To make the bitter butter better.""" result = re.findall(r'\bB\w+', text, flags=re.IGNORECASE) print(result)
The \b before ‘B’ requires a word boundary on the left, so the word must start with ‘B’. The re.IGNORECASE flag makes it case insensitive.
Exercise 3: Split this irregular sentence into clean words.
import re
sentence = """A, very very; irregular_sentence"""
result = " ".join(re.split('[;,\s_]+', sentence))
print(result)Exercise 4: Clean a tweet by removing URLs, hashtags, mentions, punctuation, RT, and CC.
import re
tweet = 'Good advice! RT @TheNextWeb: What I would do differently if I was learning to code today http://t.co/lbwej0pxOd cc: @garybernhardt #rstats'
def clean_tweet(tweet):
tweet = re.sub('http\S+\s*', '', tweet) # remove URLs
tweet = re.sub('RT|cc', '', tweet) # remove RT and cc
tweet = re.sub('#\S+', '', tweet) # remove hashtags
tweet = re.sub('@\S+', '', tweet) # remove mentions
tweet = re.sub('[%s]' % re.escape('!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'), '', tweet) # remove punctuations
tweet = re.sub('\s+', ' ', tweet) # remove extra whitespace
return tweet
print(clean_tweet(tweet))Exercise 5: Extract all text between HTML tags. This exercise requires the requests library, so here is the pattern to study:
python
import re
import requests
r = requests.get("https://raw.githubusercontent.com/selva86/datasets/master/sample.html")
print(re.findall('<.*?>(.*)', r.text))The pattern <.*?>(.*)</.*?> captures everything between opening and closing tags.
11. Conclusion
This tutorial covered regular expressions in Python from the ground up. You learned how to compile patterns, split strings, find matches, substitute text, use groups, and control greedy vs. lazy matching.
Use the reference table and practice exercises to build your regex skills. Keep this guide handy for your next text mining project.
Free Course
Master Core Python — Your First Step into AI/ML
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →Trusted by 50,000+ learners
Related Course
Master Python — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course

