Menu

How to reduce the memory size of Pandas Data frame

Let's see how to reduce the memory size of a pandas dataframe.

Written by Selva Prabhakaran | 4 min read

After importing with pandas read_csv(), dataframes tend to occupy more memory than needed. This is a default behavior in Pandas, in order to ensure all data is read properly. It’s possible to optimize that, because, lighter the dataframe, faster will be the operations you do on them later on.

So, let’s first check how much memory (RAM) this dataframe occupies in MB.

Check memory usage of pandas dataframe in Mb

python
# size occupied by dataframe in mb.
df.memory_usage(deep=True).sum() / 1024**2
python
2.63516902923584

That’s about 2.6MB. That’s not a lot for modern computers, and by reducing the size of this data, you will not see a noticeable difference in processing times.

But, when the dataset size is large, reducing the dataframe size without affecting the content can matter.

So how do we go about reducing the size of the dataframe?

How to reduce the size of the dataframe without affecting the content?

The idea is, for a variable like ‘Age’, it will probably have values less than 100 ever. So, it is sufficient to use ‘int8’ datatype for this column instead of ‘int64’.

Because, int8 can hold values between -128 to 127. Whereas, int64 can hold much larger numbers, thereby requires more memory when stored.

This idea applies for other variables like ‘NumofProducts’, ‘CreditScore’ etc as well.

Certain features occupy more memory than what is needed to store them. Reducing memory usage by changing data type will speed up the computations. So, whereever possible, it’s better to avoid using the largest possible datatype.

How to know how much bytes a given datatype require?

Let’s create a function for that:

  • int8 / uint8 : consumes 1 byte of memory, range between -128/127 or 0/255. The ‘u’ in uint stands for unsigned.
  • bool : consumes 1 byte, true or false
  • float16 / int16 / uint16: consumes 2 bytes of memory, range between -32768 and 32767 or 0/65535
  • float32 / int32 / uint32 : consumes 4 bytes of memory, range between -2147483648 and 2147483647
  • float64 / int64 / uint64: consumes 8 bytes of memory
python
print('int64 min: ', np.iinfo(np.int64).min)
print('int64 max: ', np.iinfo(np.int64).max)
print('int8 min: ', np.iinfo(np.int8).min)
print('int8 max: ', np.iinfo(np.int8).max)
python
int64 min:  -9223372036854775808
int64 max:  9223372036854775807
int8 min:  -128
int8 max:  127

Function to optimize memory usage

The idea is quite straight: For each column, check the max and min value for each variable and decide what datatype will be suitable. Then change the datatype.

Apply this logic for all columns in the dataset. Doing this often saves the size the dataframe significantly.

python
# Reduce memory usage
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage(deep=True).sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':  # for integers
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:  # for floats.
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage(deep=True).sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

Apply

python
# Reduce the memory size of the dataframe
df_o   = reduce_mem_usage(df)
python
Mem. usage decreased to  2.02 Mb (23.5% reduction)

Now’ lets check the memory usage.

python
df_o.info()
python
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int16  
 1   CustomerId       10000 non-null  int32  
 2   Surname          10000 non-null  object 
 3   CreditScore      9999 non-null   float16
 4   Geography        10000 non-null  object 
 5   Gender           9986 non-null   object 
 6   Age              9960 non-null   float16
 7   Tenure           10000 non-null  int8   
 8   Balance          9963 non-null   float32
 9   NumOfProducts    10000 non-null  int8   
 10  HasCrCard        10000 non-null  int8   
 11  IsActiveMember   10000 non-null  int8   
 12  EstimatedSalary  9999 non-null   float32
 13  Exited           10000 non-null  int8   
dtypes: float16(2), float32(2), int16(1), int32(1), int8(5), object(3)
memory usage: 459.1+ KB

Check memory usage again

python
df.memory_usage(deep=True).sum() / 1024**2
python
2.0152807235717773

Nice!.

This effect is more pronounced and impactful as for larger datasets.

[Next] Lesson 5: Exploratory Data Analysis (EDA)

Free Course
Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Related Course
Master Data Manipulation — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Free Callback - Limited Slots
Not Sure Which Course to Start With?
Talk to our AI Counsellors and Practitioners. We'll help you clear all your questions for your background and goals, bridging the gap between your current skills and a career in AI.
10-digit mobile number
📞
Thank You!
We'll Call You Soon!
Our learning advisor will reach out within 24 hours.
(Check your inbox too — we've sent a confirmation)
⚡ Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

🐍
Core Python — from first line to expert level
📈
NumPy & Pandas — the #1 libraries every DS job needs
🗃️
SQL Levels I–III — basics to Window Functions
📄
Real industry data — Jupyter notebooks included
R A M S K
57,000+ students
★★★★★ Rated 4.9/5
⚡ Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  ★★★★★ 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
🔒 100% free ☕ No spam, ever ✓ Instant access
🚀
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course →
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science