Menu

How to detect outliers using IQR and Boxplots?

Written by Selva Prabhakaran | 6 min read

Let’s understand what are outliers, how to identify them using IQR and Boxplots and how to treat them if appropriate.

1. What are outliers?

In statistics, outliers are those specific data points that differ significantly from other data points in the dataset.

There can be various reasons behind the outliers. It can be because of some event or some experimental/data entry error. Outliers are usually categorized as either point or pattern outliers.

Point outliers are the one which are single instances/datapoints of something abnormal, on the other hand pattern outliers are the clusters of instances/datapoints of something abnormal.

2. Why should you treat the outliers?

Outliers present in the data can cause various problems:

  1. Outliers might force the algorithm to fit the model away from the true relationship. Various algorithms work on minimizing the error/cost function, which can change because of outliers. The image below shows the impact.

  2. They can affect the various statistics and significance tests you might do on the data. For example, it can impact the correlation you calculate between two numeric variables. So, it is a good practice to treat / remove outliers before you calculate correlations.

Note: Outliers are not necessarily a bad thing to have in the data. Sometimes these are just observations that are not following the same pattern than the other ones.

But it can also be the case that an outlier is very interesting for Science.

For example, if in a vaccination experiment, a person is infected with COVID-19 whereas all other vaccinated people are immune to COVID-19, then it would be very interesting to understand why. This could lead to new scientific discoveries. So, it is important to detect outliers.

So whenever you do identify outliers, don’t simply remove or treat them. Maybe such extreme data points can occur again? then consider including those datapoints in your data and let ML learn from them.

3. Detecting Outliers using Box and Whisker Plot

Box Plot is the visual representation to see how a numerical data is spread. It can also be used to detect the outlier.

It captures the summary of the data efficiently with a simple box and whiskers and allows us to compare data distribution easily across groups.

Box and Whiskers plot.jpg

So how to spot outliers in a box plot?

Those points that lie outside the whiskers are generally considered as outliers. Where, the whiskers are placed at a distance of 1.5 times the Interquartile Range (IQR) from the edge of the respective box. IQR is nothing but the difference between 3rd quartile and the 1st quartile.

Usually the outlier datapoints are marked as dots in the box plot.

Import Data

The only packages we need for this are numpy and pandas for data wrangling, and matplotlib and seaborn for visualization.

python
# Import libraries 
import matplotlib.pyplot as plt
import seaborn as sns

# Data Manipulation
import numpy as np 
import pandas as pd

# Set pandas options to show more rows and columns
pd.set_option('display.max_rows', 800)
pd.set_option('display.max_columns', 500)
%matplotlib inline

Load dataset

Let’s define the numeric and categorical columns.

python
# Target class name
input_target_class = "Exited"

# Columns to be removed
input_drop_col = "CustomerId"

# Categorical columns
input_cat_columns = ['Surname', 'Geography', 'Gender', 'Gender', 'HasCrCard', 'IsActiveMember', 'Exited']

# Numerical columns
input_num_columns = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']

Now, import the dataset as pandas dataframe.

python
# Read data in form of a csv file
df = pd.read_csv("Churn_Modelling.csv")

# First 5 rows of the dataset
df.head()
RowNumber CustomerId Surname CreditScore Geography Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited
0 1 15634602 Hargrave 619 France Female 42 2 0.00 1 1 1 101348.88 1
1 2 15647311 Hill 608 Spain Female 41 1 83807.86 1 0 1 112542.58 0
2 3 15619304 Onio 502 France Female 42 8 159660.80 3 1 0 113931.57 1
3 4 15701354 Boni 699 France Female 39 1 0.00 2 0 0 93826.63 0
4 5 15737888 Mitchell 850 Spain Female 43 2 125510.82 1 1 1 79084.10 0

Draw boxplot for all columns one by one

Iterate over each column and draw boxplot for each.

python
# Draw boxplot for each numeric column.
for column in df:
    if column in input_num_columns:
        plt.figure()
        plt.gca().set_title(column)
        df.boxplot([column])

Inference

Outliers are visible for ‘Number of Products’, ‘Age’ and ‘Credit Score’.

Draw boxplot for all columns at once using seaborn

python
df.head()
RowNumber CustomerId Surname CreditScore Geography Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited
0 1 15634602 Hargrave 619 France Female 42 2 0.00 1 1 1 101348.88 1
1 2 15647311 Hill 608 Spain Female 41 1 83807.86 1 0 1 112542.58 0
2 3 15619304 Onio 502 France Female 42 8 159660.80 3 1 0 113931.57 1
3 4 15701354 Boni 699 France Female 39 1 0.00 2 0 0 93826.63 0
4 5 15737888 Mitchell 850 Spain Female 43 2 125510.82 1 1 1 79084.10 0

4. Compare Boxplots side by side, against each class of the target variable.

We can do this with seaborn using sns.boxplot.

Credit Score

python
fig, ax = plt.subplots(figsize=(15,10))
sns.boxplot(data=df, width= 0.5, ax=ax,  fliersize=3, y="CreditScore", x="Exited");

Number of products

python
fig, ax = plt.subplots(figsize=(15,10))
sns.boxplot(data=df, width= 0.5, ax=ax,  fliersize=3, y="NumOfProducts", x="Exited");

Age

python
fig, ax = plt.subplots(figsize=(15,10))
sns.boxplot(data=df, width= 0.5, ax=ax,  fliersize=3, y="Age", x="Exited");

Inferences:

  • By observing the above boxplot you can manually detect the outlier values.
  • Example: In the above boxplots Credit score contains more outlier values compared to others.

Let’s find these points mathematically, not visually. Let’s look at Interquartile Range (IQR)

5. Outlier Detection using Interquartile Range (IQR)

The interquartile range (IQR) is a measure of stastical dispersion which is equal to the difference between 1st and 3rd quartile. It’s basically first quartile subtracted from the third quartile.

IQR = Q₃ − Q₁

How to detect outliers now IQR?

All the values above Q3 + 1.5*IQR and the values below Q1 – 1.5*IQR are outliers. That’s basically all the points outside the whiskers.

Steps to perform Outlier Detection by identifying the lowerbound and upperbound of the data:

  1. Arrange your data in ascending order
  2. Calculate Q1 ( the first Quarter)
  3. Calculate Q3 ( the third Quartile)
  4. Find IQR = (Q3 – Q1)
  5. Find the lower Range = Q1 -(1.5 * IQR)
  6. Find the upper Range = Q3 + (1.5 * IQR)

Let’s find the outliers in the LSTAT feaure in boston df

python
# Sort the data
# data = boston_df.LSTAT 
data = df.CreditScore
sort_data = np.sort(data) 
sort_data
python
array([350, 350, 350, ..., 850, 850, 850], dtype=int64)

Find the 1st and 3rd quartiles.

python
# Find the 1st and 3rd quartiles
# We use the nanpercentile function to ignore the missing value just in case.
q1 = np.nanpercentile(data, 25, method='midpoint', ) 
q2 = np.nanpercentile(data, 50, method='midpoint') 
q3 = np.nanpercentile(data, 75, method='midpoint') 

IQR = q3 - q1 
print('Interquartile range is', IQR) 
python
Interquartile range is 134.0

Plot the boxplot

python
sns.boxplot(data=sort_data, width= 0.5, fliersize=3);

Calculate the upper and lower limit for outliers.

python
lower_limit = q1 - 1.5*(q3 - q1)
upper_limit = q3 + 1.5*(q3 - q1)
print(lower_limit)
print(upper_limit)

lower_limitoutliers = sort_data[sort_data < lower_limit]
upper_limitoutliers = sort_data[sort_data > upper_limit]
python
383.0
919.0

Let’s see the upper and lower limit outliers.

python
upper_limitoutliers
python
array([], dtype=int64)
python
lower_limitoutliers
python
array([350, 350, 350, 350, 350, 351, 358, 359, 363, 365, 367, 373, 376,
       376, 382], dtype=int64)

Inference:

So, Outliers are found only at the lower tail.

Treating Outliers

Optionally, you can replace the values outside the limits with respective threshold. But in this context, it’s not needed. So, I am commenting out the following code.

python
# sort_data[sort_data < lower_limit] = lower_limit
# sort_data[sort_data > upper_limit] = upper_limit
Free Course
Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Related Course
Master Machine Learning — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Get the full course,
completely free.
Join 57,000+ students learning Python, SQL & ML. One year of access, all resources included.
📚 10 Courses
🐍 Python & ML
🗄️ SQL
📦 Downloads
📅 1 Year Access
No thanks
🎓
Free AI/ML Starter Kit
Python · SQL · ML · 10 Courses · 57,000+ students
🎉   You're in! Check your inbox (or Promotions/Spam) for the access link.
⚡ Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

🐍
Core Python — from first line to expert level
📈
NumPy & Pandas — the #1 libraries every DS job needs
🗃️
SQL Levels I–III — basics to Window Functions
📄
Real industry data — Jupyter notebooks included
R A M S K
57,000+ students
★★★★★ Rated 4.9/5
⚡ Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  ★★★★★ 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
🔒 100% free ☕ No spam, ever ✓ Instant access
🚀
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course →
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science