101 Pandas Exercises for Data Analysis (Interactive)

101 interactive pandas exercises with solutions. Edit and run every code block directly in your browser — no installation needed.

Written by Selva Prabhakaran | 38 min read

This is an interactive version of the popular 101 Pandas Exercises. You can edit and run every code block directly in your browser — no installation needed. All code runs locally in your browser and nothing is sent to any server.

Click ‘Run’ or press Ctrl+Enter on any code block to execute it. The first run may take a few seconds to initialize.

1. How to import pandas and check the version?

Import pandas and check the version used.

Input:

# Task: Import pandas and check the version
# Write your code below

Desired Output:

# Your pandas version will be shown here
# e.g., 2.1.0

Show Solution

import numpy as np # optional
import pandas as pd
print(pd.__version__)
print(pd.show_versions())

2. How to create a series from a list, numpy array and dict?

Create a pandas series from each of the items below: a list, numpy and a dictionary

Input:

# Task: Create a series from a list, numpy array and dict
import pandas as pd
import numpy as np
mylist = list('abcedfghijklmnopqrstuvwxyz')
myarr = np.arange(26)
mydict = dict(zip(mylist, myarr))

# Write your code below

Desired Output:

a    0
b    1
c    2
d    4
e    3
dtype: int64

Show Solution

import pandas as pd
# Inputs
import numpy as np
mylist = list('abcedfghijklmnopqrstuvwxyz')
myarr = np.arange(26)
mydict = dict(zip(mylist, myarr))

# Solution
ser1 = pd.Series(mylist)
ser2 = pd.Series(myarr)
ser3 = pd.Series(mydict)
print(ser3.head())

3. How to convert the index of a series into a column of a dataframe?

Difficulty Level: L1

Convert the series ser into a dataframe with its index as another column on the dataframe.

Input:

# Task: Convert the index of a series into a column of a dataframe
import numpy as np
import pandas as pd
mylist = list('abcedfghijklmnopqrstuvwxyz')
myarr = np.arange(26)
mydict = dict(zip(mylist, myarr))
ser = pd.Series(mydict)

# Write your code below

Desired Output:

Show Solution

import numpy as np
import pandas as pd
# Input
mylist = list('abcedfghijklmnopqrstuvwxyz')
myarr = np.arange(26)
mydict = dict(zip(mylist, myarr))
ser = pd.Series(mydict)

# Solution
df = ser.to_frame().reset_index()
print(df.head())

4. How to combine many series to form a dataframe?

Difficulty Level: L1

Combine ser1 and ser2 to form a dataframe.

Input:

# Task: Combine many series to form a dataframe
import pandas as pd
import numpy as np
ser1 = pd.Series(list('abcedfghijklmnopqrstuvwxyz'))
ser2 = pd.Series(np.arange(26))

# Write your code below

Desired Output:

  col1  col2
0    a     0
1    b     1
2    c     2
3    e     3
4    d     4

Show Solution

import pandas as pd
# Input
import numpy as np
ser1 = pd.Series(list('abcedfghijklmnopqrstuvwxyz'))
ser2 = pd.Series(np.arange(26))

# Solution 1
df = pd.concat([ser1, ser2], axis=1)

# Solution 2
df = pd.DataFrame({'col1': ser1, 'col2': ser2})
print(df.head())

5. How to assign name to the series’ index?

Difficulty Level: L1

Give a name to the series ser calling it ‘alphabets’.

Input:

# Task: Assign name to the series' index
import numpy as np
import pandas as pd
ser = pd.Series(list('abcedfghijklmnopqrstuvwxyz'))

# Write your code below

Desired Output:

0    a
1    b
2    c
3    e
4    d
Name: alphabets, dtype: object

Show Solution

import numpy as np
import pandas as pd
# Input
ser = pd.Series(list('abcedfghijklmnopqrstuvwxyz'))

# Solution
ser.name = 'alphabets'
print(ser.head())

6. How to get the items of series A not present in series B?

Difficulty Level: L2

From ser1 remove items present in ser2.

Input:

# Task: Get the items of series A not present in series B
import numpy as np
import pandas as pd
ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])

# Write your code below

Desired Output:

0    1
1    2
2    3
dtype: int64

Show Solution

import numpy as np
import pandas as pd
# Input
ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])

# Solution
print(ser1[~ser1.isin(ser2)])

7. How to get the items not common to both series A and series B?

Difficulty Level: L2

Get all items of ser1 and ser2 not common to both.

Input:

# Task: Get the items not common to both series A and series B
import numpy as np
import pandas as pd
ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])

# Write your code below

Desired Output:

0    1
1    2
2    3
5    6
6    7
7    8
dtype: int64

Show Solution

import numpy as np
import pandas as pd
# Input
ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])

# Solution
ser_u = pd.Series(np.union1d(ser1, ser2)) # union
ser_i = pd.Series(np.intersect1d(ser1, ser2)) # intersect
print(ser_u[~ser_u.isin(ser_i)])

8. How to get the minimum, 25th percentile, median, 75th, and max of a numeric series?

Difficuty Level: L2 Compute the minimum, 25th percentile, median, 75th, and maximum of ser.

Input:

# Task: Get the minimum, 25th percentile, median, 75th, and max of a numeric series
import numpy as np
import pandas as pd
ser = pd.Series(np.random.normal(10, 5, 25))

# Write your code below

Desired Output:

array([ -1.39,   6.49,  10.26,  13.07,  25.81])
# (exact values vary — output shows the 5 percentiles)

Show Solution

import numpy as np
import pandas as pd
# Input
ser = pd.Series(np.random.normal(10, 5, 25))

# Solution
print(np.percentile(ser, q=[0, 25, 50, 75, 100]))

9. How to get frequency counts of unique items of a series?

Difficulty Level: L1

Calculte the frequency counts of each unique value ser.

Input:

# Task: Get frequency counts of unique items of a series
import numpy as np
import pandas as pd
ser = pd.Series(np.take(list('abcdefgh'), np.random.randint(8, size=30)))

# Write your code below

Desired Output:

# Output shows value_counts, e.g.:
b    6
a    5
c    4
...
dtype: int64

Show Solution

import numpy as np
import pandas as pd
# Input
ser = pd.Series(np.take(list('abcdefgh'), np.random.randint(8, size=30)))

# Solution
print(ser.value_counts())

10. How to keep only top 2 most frequent values as it is and replace everything else as ‘Other’?

Difficulty Level: L2

From ser, keep the top 2 most frequent items as it is and replace everything else as ‘Other’.

Input:

# Task: Keep only top 2 most frequent values as it is and replace everything else as 'Other'
import numpy as np
import pandas as pd
np.random.seed(100)
ser = pd.Series(np.random.randint(1, 5, [12]))

# Write your code below

Desired Output:

Top 2 Freq: 4    5
3    3
...

0     Other
1     Other
2         3
...

Show Solution

import numpy as np
import pandas as pd
# Input
np.random.seed(100)
ser = pd.Series(np.random.randint(1, 5, [12]))

# Solution
print("Top 2 Freq:", ser.value_counts())
ser[~ser.isin(ser.value_counts().index[:2])] = 'Other'
print(ser)

11. How to bin a numeric series to 10 groups of equal size?

Difficulty Level: L2

Bin the series ser into 10 equal deciles and replace the values with the bin name.

Input:

# Task: Bin a numeric series to 10 groups of equal size
import numpy as np
import pandas as pd
ser = pd.Series(np.random.random(20))

# Write your code below

Desired Output:

0    0.556912
1    0.892955
...
dtype: float64

0    7th
1    9th
...
dtype: category
Categories (10, object): [1st < 2nd < 3rd < ... < 9th < 10th]

Show Solution

import numpy as np
import pandas as pd
# Input
ser = pd.Series(np.random.random(20))
print(ser.head())

# Solution
print(pd.qcut(ser, q=[0, .10, .20, .3, .4, .5, .6, .7, .8, .9, 1], 
 labels=['1st', '2nd', '3rd', '4th', '5th', '6th', '7th', '8th', '9th', '10th']).head())

12. How to convert a numpy array to a dataframe of given shape? (L1)

Difficulty Level: L1

Reshape the series ser into a dataframe with 7 rows and 5 columns

Input:

# Task: Convert a numpy array to a dataframe of given shape? (L1)
import numpy as np
import pandas as pd
ser = pd.Series(np.random.randint(1, 10, 35))

# Write your code below

Desired Output:

    0  1  2  3  4
0   1  2  1  2  5
1   1  2  4  5  2
2   1  3  3  2  8
3   8  6  4  9  6
4   2  1  1  8  5
5   3  2  8  5  6
6   1  5  5  4  6
# (values vary — output is a 7x5 DataFrame)

Show Solution

import numpy as np
import pandas as pd
# Input
ser = pd.Series(np.random.randint(1, 10, 35))

# Solution
df = pd.DataFrame(ser.values.reshape(7,5))
print(df)

13. How to find the positions of numbers that are multiples of 3 from a series?

Difficulty Level: L2

Find the positions of numbers that are multiples of 3 from ser.

Input:

# Task: Find the positions of numbers that are multiples of 3 from a series
import numpy as np
import pandas as pd
ser = pd.Series(np.random.randint(1, 10, 7))

# Write your code below

Desired Output:

0    6
1    8
...
dtype: int64

array([[0],
       [2],
       [4]])
# (positions where values are multiples of 3)

Show Solution

import numpy as np
import pandas as pd
# Input
ser = pd.Series(np.random.randint(1, 10, 7))

# Solution
print(ser)
print(np.argwhere(ser % 3==0))

14. How to extract items at given positions from a series

Difficulty Level: L1

From ser, extract the items at positions in list pos.

Input:

# Task: Extract items at given positions from a series
import numpy as np
import pandas as pd
ser = pd.Series(list('abcdefghijklmnopqrstuvwxyz'))
pos = [0, 4, 8, 14, 20]

# Write your code below

Desired Output:

0     a
4     e
8     i
14    o
20    u
dtype: object

Show Solution

import numpy as np
import pandas as pd
# Input
ser = pd.Series(list('abcdefghijklmnopqrstuvwxyz'))
pos = [0, 4, 8, 14, 20]

# Solution
print(ser.take(pos))

15. How to stack two series vertically and horizontally ?

Difficulty Level: L1

Stack ser1 and ser2 vertically and horizontally (to form a dataframe).

Input:

# Task: Stack two series vertically and horizontally 
import numpy as np
import pandas as pd
ser1 = pd.Series(range(5))
ser2 = pd.Series(list('abcde'))

# Write your code below

Desired Output:

Show Solution

import numpy as np
import pandas as pd
# Input
ser1 = pd.Series(range(5))
ser2 = pd.Series(list('abcde'))

# Output
# Vertical
print(pd.concat([ser1, ser2]))

# Horizontal
df = pd.concat([ser1, ser2], axis=1)
print(df)

16. How to get the positions of items of series A in another series B?

Difficulty Level: L2

Get the positions of items of ser2 in ser1 as a list.

Input:

# Task: Get the positions of items of series A in another series B
import numpy as np
import pandas as pd
ser1 = pd.Series([10, 9, 6, 5, 3, 1, 12, 8, 13])
ser2 = pd.Series([1, 3, 10, 13])

# Write your code below

Desired Output:

[5, 4, 0, 8]

Show Solution

import numpy as np
import pandas as pd
# Input
ser1 = pd.Series([10, 9, 6, 5, 3, 1, 12, 8, 13])
ser2 = pd.Series([1, 3, 10, 13])

# Solution 1
[np.where(i == ser1)[0].tolist()[0] for i in ser2]

# Solution 2
print([pd.Index(ser1).get_loc(i) for i in ser2])

17. How to compute the mean squared error on a truth and predicted series?

Difficulty Level: L2

Compute the mean squared error of truth and pred series.

Input:

# Task: Compute the mean squared error on a truth and predicted series
import numpy as np
import pandas as pd
truth = pd.Series(range(10))
pred = pd.Series(range(10)) + np.random.random(10)

# Write your code below

Show Solution

import numpy as np
import pandas as pd
# Input
truth = pd.Series(range(10))
pred = pd.Series(range(10)) + np.random.random(10)

# Solution
print(np.mean((truth-pred)**2))

18. How to convert the first character of each element in a series to uppercase?

Difficulty Level: L2

Change the first character of each word to upper case in each word of ser.

Input:

# Task: Convert the first character of each element in a series to uppercase
import numpy as np
import pandas as pd
ser = pd.Series(['how', 'to', 'kick', 'ass?'])

# Write your code below

Desired Output:

0     How
1      To
2    Kick
3    Ass?
dtype: object

Show Solution

import numpy as np
import pandas as pd
# Input
ser = pd.Series(['how', 'to', 'kick', 'ass?'])

# Solution 1
ser.map(lambda x: x.title())

# Solution 2
ser.map(lambda x: x[0].upper() + x[1:])

# Solution 3
print(pd.Series([i.title() for i in ser]))

19. How to calculate the number of characters in each word in a series?

Difficulty Level: L2

Input:

# Task: Calculate the number of characters in each word in a series
import numpy as np
import pandas as pd
ser = pd.Series(['how', 'to', 'kick', 'ass?'])

# Write your code below

Desired Output:

0    3
1    2
2    4
3    4
dtype: int64

Show Solution

import numpy as np
import pandas as pd
# Input
ser = pd.Series(['how', 'to', 'kick', 'ass?'])

# Solution
print(ser.map(lambda x: len(x)))

20. How to compute difference of differences between consequtive numbers of a series?

Difficulty Level: L1

Difference of differences between the consequtive numbers of ser.

Input:

# Task: Compute difference of differences between consequtive numbers of a series
import numpy as np
import pandas as pd
ser = pd.Series([1, 3, 6, 10, 15, 21, 27, 35])

# Write your code below

Desired Output:

[nan, 2.0, 3.0, 4.0, 5.0, 6.0, 6.0, 8.0]
[nan, nan, 1.0, 1.0, 1.0, 1.0, 0.0, 2.0]

Show Solution

import numpy as np
import pandas as pd
# Input
ser = pd.Series([1, 3, 6, 10, 15, 21, 27, 35])

# Solution
print(ser.diff().tolist())
print(ser.diff().diff().tolist())

21. How to convert a series of date-strings to a timeseries?

Difficiulty Level: L2

Input:

# Task: Convert a series of date-strings to a timeseries
import numpy as np
import pandas as pd
ser = pd.Series(['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])

# Write your code below

Desired Output:

0   2010-01-01 00:00:00
1   2011-02-02 00:00:00
2   2012-03-03 00:00:00
3   2013-04-04 00:00:00
4   2014-05-05 00:00:00
5   2015-06-06 12:20:00
dtype: datetime64[ns]

Show Solution

import numpy as np
import pandas as pd
# Input
ser = pd.Series(['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])

# Solution 1
from dateutil.parser import parse
ser.map(lambda x: parse(x))

# Solution 2
print(pd.to_datetime(ser))

22. How to get the day of month, week number, day of year and day of week from a series of date strings?

Difficiulty Level: L2 Get the day of month, week number, day of year and day of week from ser.

Input:

# Task: Get the day of month, week number, day of year and day of week from a series of date strings
import numpy as np
import pandas as pd
ser = pd.Series(['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])

# Write your code below

Show Solution

import numpy as np
import pandas as pd
# Input
ser = pd.Series(['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])

# Solution
from dateutil.parser import parse
ser_ts = ser.map(lambda x: parse(x))

# day of month
print("Date: ", ser_ts.dt.day.tolist())

# week number
print("Week number: ", ser_ts.dt.isocalendar().week.tolist())

# day of year
print("Day num of year: ", ser_ts.dt.dayofyear.tolist())

# day of week
print("Day of week: ", ser_ts.dt.day_name().tolist())

23. How to convert year-month string to dates corresponding to the 4th day of the month?

Difficiulty Level: L2 Change ser to dates that start with 4th of the respective months.

Input:

# Task: Convert year-month string to dates corresponding to the 4th day of the month
import numpy as np
import pandas as pd
ser = pd.Series(['Jan 2010', 'Feb 2011', 'Mar 2012'])

# Write your code below

Desired Output:

0   2010-01-04
1   2011-02-04
2   2012-03-04
dtype: datetime64[ns]

Show Solution

import numpy as np
import pandas as pd
# Input
ser = pd.Series(['Jan 2010', 'Feb 2011', 'Mar 2012'])

# Solution 1
from dateutil.parser import parse
# Parse the date
ser_ts = ser.map(lambda x: parse(x))

# Construct date string with date as 4
ser_datestr = ser_ts.dt.year.astype('str') + '-' + ser_ts.dt.month.astype('str') + '-' + '04'

# Format it.
[parse(i).strftime('%Y-%m-%d') for i in ser_datestr]

# Solution 2
print(ser.map(lambda x: parse('04 ' + x)))

24. How to filter words that contain atleast 2 vowels from a series?

Difficiulty Level: L3 From ser, extract words that contain atleast 2 vowels.

Input:

# Task: Filter words that contain atleast 2 vowels from a series
import numpy as np
import pandas as pd
ser = pd.Series(['Apple', 'Orange', 'Plan', 'Python', 'Money'])

# Write your code below

Desired Output:

0     Apple
1    Orange
4     Money
dtype: object

Show Solution

import numpy as np
import pandas as pd
# Input
ser = pd.Series(['Apple', 'Orange', 'Plan', 'Python', 'Money'])

# Solution
from collections import Counter
mask = ser.map(lambda x: sum([Counter(x.lower()).get(i, 0) for i in list('aeiou')]) >= 2)
print(ser[mask])

25. How to filter valid emails from a series?

Difficiulty Level: L3 Extract the valid emails from the series emails. The regex pattern for valid emails is provided as reference.

Input:

# Task: Filter valid emails from a series
import numpy as np
import pandas as pd
emails = pd.Series(['buying books at amazom.com', 'rameses@egypt.com', 'matt@t.co', 'narendra@modi.com'])
pattern ='[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,4}'

# Write your code below

Desired Output:

['rameses@egypt.com', 'matt@t.co', 'narendra@modi.com']

Show Solution

import numpy as np
import pandas as pd
# Input
emails = pd.Series(['buying books at amazom.com', 'rameses@egypt.com', 'matt@t.co', 'narendra@modi.com'])

# Solution 1 (as series of strings)
import re
pattern ='[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,4}'
mask = emails.map(lambda x: bool(re.match(pattern, x)))
emails[mask]

# Solution 2 (as series of list)
emails.str.findall(pattern, flags=re.IGNORECASE)

# Solution 3 (as list)
print([x[0] for x in [re.findall(pattern, email) for email in emails] if len(x) > 0])

26. How to get the mean of a series grouped by another series?

Difficiulty Level: L2 Compute the mean of weights of each fruit.

Input:

# Task: Get the mean of a series grouped by another series
import numpy as np
import pandas as pd
fruit = pd.Series(np.random.choice(['apple', 'banana', 'carrot'], 10))
weights = pd.Series(np.linspace(1, 10, 10))
print(weights.tolist())
print(fruit.tolist())
#> [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]
#> ['banana', 'carrot', 'apple', 'carrot', 'carrot', 'apple', 'banana', 'carrot', 'apple', 'carrot']

# Write your code below

Show Solution

import numpy as np
import pandas as pd
# Input
fruit = pd.Series(np.random.choice(['apple', 'banana', 'carrot'], 10))
weights = pd.Series(np.linspace(1, 10, 10))

# Solution
print(weights.groupby(fruit).mean())

27. How to compute the euclidean distance between two series?

Difficiulty Level: L2 Compute the euclidean distance between series (points) p and q, without using a packaged formula.

Input:

# Task: Compute the euclidean distance between two series
import numpy as np
import pandas as pd
p = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
q = pd.Series([10, 9, 8, 7, 6, 5, 4, 3, 2, 1])

# Write your code below

Desired Output:

18.165902124584949

Show Solution

import numpy as np
import pandas as pd
# Input
p = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
q = pd.Series([10, 9, 8, 7, 6, 5, 4, 3, 2, 1])

# Solution 
sum((p - q)**2)**.5

# Solution (using func)
print(np.linalg.norm(p-q))

28. How to find all the local maxima (or peaks) in a numeric series?

Difficiulty Level: L3 Get the positions of peaks (values surrounded by smaller values on both sides) in ser.

Input:

# Task: Find all the local maxima (or peaks) in a numeric series
import numpy as np
import pandas as pd
ser = pd.Series([2, 10, 3, 4, 9, 10, 2, 7, 3])

# Write your code below

Desired Output:

array([1, 5, 7])

Show Solution

import numpy as np
import pandas as pd
# Input
ser = pd.Series([2, 10, 3, 4, 9, 10, 2, 7, 3])

# Solution
dd = np.diff(np.sign(np.diff(ser)))
peak_locs = np.where(dd == -2)[0] + 1
print(peak_locs)

29. How to replace missing spaces in a string with the least frequent character?

Replace the spaces in my_str with the least frequent character. Difficiulty Level: L2

Input:

# Task: Replace missing spaces in a string with the least frequent character
import numpy as np
import pandas as pd
my_str = 'dbc deb abed gade'

# Write your code below

Show Solution

import numpy as np
import pandas as pd
# Input
my_str = 'dbc deb abed gade'

# Solution
ser = pd.Series(list('dbc deb abed gade'))
freq = ser.value_counts()
print(freq)
least_freq = freq.dropna().index[-1]
print("".join(ser.replace(' ', least_freq)))

30. How to create a TimeSeries starting ‘2000-01-01’ and 10 weekends (saturdays) after that having random numbers as values?

Difficiulty Level: L2

Input:

# Task: Create a TimeSeries starting '2000-01-01' and 10 weekends (saturdays) after that having random numbers as values
import numpy as np
import pandas as pd
# Write your code below

Show Solution

import numpy as np
import pandas as pd
# Solution
ser = pd.Series(np.random.randint(1,10,10), pd.date_range('2000-01-01', periods=10, freq='W-SAT'))
print(ser)

31. How to fill an intermittent time series so all missing dates show up with values of previous non-missing date?

Difficiulty Level: L2 ser has missing dates and values. Make all missing dates appear and fill up with value from previous date.

Input:

# Task: Fill an intermittent time series so all missing dates show up with values of previous non-missing date
import numpy as np
import pandas as pd
ser = pd.Series([1,10,3,np.nan], index=pd.to_datetime(['2000-01-01', '2000-01-03', '2000-01-06', '2000-01-08']))
print(ser)
#> 2000-01-01 1.0
#> 2000-01-03 10.0
#> 2000-01-06 3.0
#> 2000-01-08 NaN
#> dtype: float64

# Write your code below

Show Solution

import numpy as np
import pandas as pd
# Input
ser = pd.Series([1,10,3, np.nan], index=pd.to_datetime(['2000-01-01', '2000-01-03', '2000-01-06', '2000-01-08']))

# Solution
print(ser.resample('D').ffill())

32. How to compute the autocorrelations of a numeric series?

Difficiulty Level: L3 Compute autocorrelations for the first 10 lags of ser. Find out which lag has the largest correlation.

Input:

# Task: Compute the autocorrelations of a numeric series
import numpy as np
import pandas as pd
ser = pd.Series(np.arange(20) + np.random.normal(1, 10, 20))

# Write your code below

Show Solution

import numpy as np
import pandas as pd
# Input
ser = pd.Series(np.arange(20) + np.random.normal(1, 10, 20))

# Solution
autocorrelations = [ser.autocorr(i).round(2) for i in range(11)]
print(autocorrelations[1:])
print('Lag having highest correlation: ', np.argmax(np.abs(autocorrelations[1:]))+1)

33. How to import only every nth row from a csv file to create a dataframe?

Difficiulty Level: L2 Import every 50th row of BostonHousing dataset as a dataframe.

Show Solution

Input:

import numpy as np
import pandas as pd
# Solution: Use chunks and list comprehension
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv', chunksize=50)
df2 = pd.concat([chunk.iloc[0] for chunk in df], axis=1).T
print(df2.head())

34. How to change column values when importing csv to a dataframe?

Difficulty Level: L2

Import the boston housing dataset, but while importing change the ‘medv’ (median house value) column so that values 25 becomes ‘High’.

Show Solution

import numpy as np
import pandas as pd
# Solution 1: Using converter parameter
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv', 
 converters={'medv': lambda x: 'High' if float(x) > 25 else 'Low'})

# Solution 2: Using csv reader
import csv
with open('BostonHousing.csv', 'r') as f:
 reader = csv.reader(f)
 out = []
 for i, row in enumerate(reader):
 if i > 0:
 row[13] = 'High' if float(row[13]) > 25 else 'Low'
 out.append(row)

df = pd.DataFrame(out[1:], columns=out[0])
print(df.head())

35. How to create a dataframe with rows as strides from a given series?

Difficiulty Level: L3

# Task: Create a dataframe with rows as strides from a given series
import numpy as np
import pandas as pd
L = pd.Series(range(15))

# Write your code below

Show Solution

import numpy as np
import pandas as pd
L = pd.Series(range(15))

def gen_strides(a, stride_len=5, window_len=5):
 n_strides = ((a.size-window_len)//stride_len) + 1
 return np.array([a[s:(s+window_len)] for s in np.arange(0, a.size, stride_len)[:n_strides]])

print(gen_strides(L, stride_len=2, window_len=4))

36. How to import only specified columns from a csv file?

Difficulty Level: L1

Import ‘crim’ and ‘medv’ columns of the BostonHousing dataset as a dataframe.

Show Solution

Input:

import numpy as np
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv', usecols=['crim', 'medv'])
print(df.head())

37. How to get the nrows, ncolumns, datatype, summary stats of each column of a dataframe? Also get the array and list equivalent.

Difficulty Level: L2

Get the number of rows, columns, datatype and summary statistics of each column of the Cars93 dataset. Also get the numpy array and list equivalent of the dataframe.

Show Solution

import numpy as np
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

# number of rows and columns
print(df.shape)

# datatypes
print(df.dtypes)

# how many columns under each dtype
print(df.dtypes.value_counts())

# summary statistics
df_stats = df.describe()

# numpy array 
df_arr = df.values

# list
df_list = df.values.tolist()

38. How to extract the row and column number of a particular cell with given criterion?

Difficulty Level: L1

Which manufacturer, model and type has the highest Price? What is the row and column number of the cell with the highest Price value?

# Task: Extract the row and column number of a particular cell with given criterion
import numpy as np
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

# Write your code below

Show Solution

import numpy as np
import pandas as pd
# Input
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

# Solution
# Get Manufacturer with highest price
df.loc[df.Price == np.max(df.Price), ['Manufacturer', 'Model', 'Type']]

# Get Row and Column number
row, col = np.where(df.values == np.max(df.Price))

# Get the value
df.iat[row[0], col[0]]
df.iloc[row[0], col[0]]

# Alternates
print(df.at[row[0], 'Price'])

# The difference between `iat` - `iloc` vs `at` - `loc` is:
# `iat` snd `iloc` accepts row and column numbers. 
# Whereas `at` and `loc` accepts index and column names.

39. How to rename a specific columns in a dataframe?

Difficulty Level: L2

Rename the column Type as CarType in df and replace the ‘.’ in column names with ‘_’.

Input:

# Task: Rename a specific columns in a dataframe
import numpy as np
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
print(df.columns)
#> Index(['Manufacturer', 'Model', 'Type', 'Min.Price', 'Price', 'Max.Price',
#> 'MPG.city', 'MPG.highway', 'AirBags', 'DriveTrain', 'Cylinders',
#> 'EngineSize', 'Horsepower', 'RPM', 'Rev.per.mile', 'Man.trans.avail',
#> 'Fuel.tank.capacity', 'Passengers', 'Length', 'Wheelbase', 'Width',
#> 'Turn.circle', 'Rear.seat.room', 'Luggage.room', 'Weight', 'Origin',
#> 'Make'],
#> dtype='object')

# Write your code below

Show Solution

import numpy as np
import pandas as pd
# Input
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

# Solution
# Step 1:
df=df.rename(columns = {'Type':'CarType'})
# or
df.columns.values[2] = "CarType"

# Step 2:
df.columns = df.columns.map(lambda x: x.replace('.', '_'))
print(df.columns)

40. How to check if a dataframe has any missing values?

Difficulty Level: L1

Check if df has any missing values.

Input:

# Task: Check if a dataframe has any missing values
import numpy as np
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

# Write your code below

Desired Output:

True

Show Solution

import numpy as np
import pandas as pd
# Input
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

# Solution
print(df.isnull().values.any())

41. How to count the number of missing values in each column?

Difficulty Level: L2

Count the number of missing values in each column of df. Which column has the maximum number of missing values?

Input:

# Task: Count the number of missing values in each column
import numpy as np
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

# Write your code below

Desired Output:

'Luggage.room'

Show Solution

import numpy as np
import pandas as pd
# Input
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

# Solution
n_missings_each_col = df.apply(lambda x: x.isnull().sum())
print(n_missings_each_col.idxmax())

42. How to replace missing values of multiple numeric columns with the mean?

Difficulty Level: L2

Replace missing values in Min.Price and Max.Price columns with their respective mean.

Input:

# Task: Replace missing values of multiple numeric columns with the mean
import numpy as np
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

# Write your code below

Show Solution

import numpy as np
import pandas as pd
# Input
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

# Solution
df_out = df[['Min.Price', 'Max.Price']] = df[['Min.Price', 'Max.Price']].apply(lambda x: x.fillna(x.mean()))
print(df_out.head())

43. How to use apply function on existing columns with global variables as additional arguments?

Difficulty Level: L3

In df, use apply method to replace the missing values in Min.Price with the column’s mean and those in Max.Price with the column’s median. Use Hint from StackOverflow

Input:

# Task: Use apply function on existing columns with global variables as additional arguments
import numpy as np
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

# Write your code below

Show Solution

import numpy as np
import pandas as pd
# Input
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

# Solution
d = {'Min.Price': np.nanmean, 'Max.Price': np.nanmedian}
df[['Min.Price', 'Max.Price']] = df[['Min.Price', 'Max.Price']].apply(lambda x, d: x.fillna(d[x.name](x)), args=(d, ))
print(df[['Min.Price', 'Max.Price']].head())

44. How to select a specific column from a dataframe as a dataframe instead of a series?

Difficulty Level: L2

Get the first column (a) in df as a dataframe (rather than as a Series).

Input:

# Task: Select a specific column from a dataframe as a dataframe instead of a series
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde'))

# Write your code below

Desired Output:

pandas.core.series.Series

Show Solution

import numpy as np
import pandas as pd
# Input
df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde'))

# Solution
type(df[['a']])
type(df.loc[:, ['a']])
type(df.iloc[:, [0]])

# Alternately the following returns a Series
type(df.a)
type(df['a'])
type(df.loc[:, 'a'])
print(type(df.iloc[:, 1]))

45. How to change the order of columns of a dataframe?

Difficulty Level: L3

Actually 3 questions. Create a generic function to interchange two columns, without hardcoding column names. Sort the columns in reverse alphabetical order, that is colume ‘e’ first through column ‘a’ last.

Input:

# Task: Change the order of columns of a dataframe
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde'))

# Write your code below

Show Solution

import numpy as np
import pandas as pd
# Input
df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde'))

# Solution Q1
df[list('cbade')]

# Solution Q2 - No hard coding
def switch_columns(df, col1=None, col2=None):
 colnames = df.columns.tolist()
 i1, i2 = colnames.index(col1), colnames.index(col2)
 colnames[i2], colnames[i1] = colnames[i1], colnames[i2]
 return df[colnames]

df1 = switch_columns(df, 'a', 'c')

# Solution Q3
df[sorted(df.columns)]
# or
df.sort_index(axis=1, ascending=False, inplace=True)
print(df)

46. How to set the number of rows and columns displayed in the output?

Difficulty Level: L2

Change the pamdas display settings on printing the dataframe df it shows a maximum of 10 rows and 10 columns.

Input:

# Task: Set the number of rows and columns displayed in the output
import numpy as np
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

# Write your code below

Show Solution

import numpy as np
import pandas as pd
# Input
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

# Solution
pd.set_option('display.max_columns', 10)
pd.set_option('display.max_rows', 10)
print(df)
# df

# Show all available options
# pd.describe_option()

47. How to format or suppress scientific notations in a pandas dataframe?

Difficulty Level: L2

Suppress scientific notations like ‘e-03’ in df and print upto 4 numbers after decimal.

Input:

# Task: Format or suppress scientific notations in a pandas dataframe
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random(4)**10, columns=['random'])
df
#> random
#> 0 3.474280e-03
#> 1 3.951517e-05
#> 2 7.469702e-02
#> 3 5.541282e-28

# Write your code below

Show Solution

import numpy as np
import pandas as pd
# Input
df = pd.DataFrame(np.random.random(4)**10, columns=['random'])

# Solution 1: Rounding
df.round(4)

# Solution 2: Use apply to change format
df.apply(lambda x: '%.4f' % x, axis=1)
# or
df.applymap(lambda x: '%.4f' % x)

# Solution 3: Use set_option
pd.set_option('display.float_format', lambda x: '%.4f' % x)

# Solution 4: Assign display.float_format
pd.options.display.float_format = '{:.4f}'.format
print(df)

# Reset/undo float formatting
pd.options.display.float_format = None

48. How to format all the values in a dataframe as percentages?

Difficulty Level: L2

Format the values in column ‘random’ of df as percentages.

Input:

# Task: Format all the values in a dataframe as percentages
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random(4), columns=['random'])
df
#> random
#> 0 .689723
#> 1 .957224
#> 2 .159157
#> 3 .21082

# Write your code below

Show Solution

import numpy as np
import pandas as pd
# Input
df = pd.DataFrame(np.random.random(4), columns=['random'])

# Solution
out = df.style.format({
 'random': '{0:.2%}'.format,
})

print(out)

49. How to filter every nth row in a dataframe?

Difficulty Level: L1

From df, filter the ‘Manufacturer’, ‘Model’ and ‘Type’ for every 20th row starting from 1st (row 0).

Input:

# Task: Filter every nth row in a dataframe
import numpy as np
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

# Write your code below

Desired Output:

  Manufacturer    Model     Type
0         Acura  Integra    Small
20     Chrysler  LeBaron  Compact
40        Honda  Prelude   Sporty
60      Mercury   Cougar  Midsize
80       Subaru   Loyale    Small

Show Solution

import numpy as np
import pandas as pd
# Input
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

# Solution
print(df.iloc[::20, :][['Manufacturer', 'Model', 'Type']])

50. How to create a primary key index by combining relevant columns?

Difficulty Level: L2

In df, Replace NaNs with ‘missing’ in columns ‘Manufacturer’, ‘Model’ and ‘Type’ and create a index as a combination of these three columns and check if the index is a primary key.

Input:

# Task: Create a primary key index by combining relevant columns
import numpy as np
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', usecols=[0,1,2,3,5])

# Write your code below

Desired Output:

True

Show Solution

import numpy as np
import pandas as pd
# Input
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', usecols=[0,1,2,3,5])

# Solution
df[['Manufacturer', 'Model', 'Type']] = df[['Manufacturer', 'Model', 'Type']].fillna('missing')
df.index = df.Manufacturer + '_' + df.Model + '_' + df.Type
print(df.index.is_unique)

51. How to get the row number of the nth largest value in a column?

Difficulty Level: L2

Find the row position of the 5th largest value of column ‘a’ in df.

Input:

# Task: Get the row number of the nth largest value in a column
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(1, 30, 30).reshape(10,-1), columns=list('abc'))

# Write your code below

Show Solution

import numpy as np
import pandas as pd
# Input
df = pd.DataFrame(np.random.randint(1, 30, 30).reshape(10,-1), columns=list('abc'))

# Solution
n = 5
print(df['a'].argsort()[::-1][n])

52. How to find the position of the nth largest value greater than a given value?

Difficulty Level: L2

In ser, find the position of the 2nd largest value greater than the mean.

Input:

# Task: Find the position of the nth largest value greater than a given value
import numpy as np
import pandas as pd
ser = pd.Series(np.random.randint(1, 100, 15))

# Write your code below

Show Solution

import numpy as np
import pandas as pd
# Input
ser = pd.Series(np.random.randint(1, 100, 15))

# Solution
print('ser: ', ser.tolist(), 'mean: ', round(ser.mean()))
print(np.argwhere(ser > ser.mean())[1])

53. How to get the last n rows of a dataframe with row sum > 100?

Difficulty Level: L2

Get the last two rows of df whose row sum is greater than 100.

Input:

# Task: Get the last n rows of a dataframe with row sum > 100
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(10, 40, 60).reshape(-1, 4))

# Write your code below

Show Solution

import numpy as np
import pandas as pd
# Input
df = pd.DataFrame(np.random.randint(10, 40, 60).reshape(-1, 4))

# Solution
# print row sums
rowsums = df.apply(np.sum, axis=1)

# last two rows with row sum greater than 100
last_two_rows = df.iloc[np.where(rowsums > 100)[0][-2:], :]

54. How to find and cap outliers from a series or dataframe column?

Difficulty Level: L2

Replace all values of ser in the lower 5%ile and greater than 95%ile with respective 5th and 95th %ile value.

Input:

# Task: Find and cap outliers from a series or dataframe column
import numpy as np
import pandas as pd
ser = pd.Series(np.logspace(-2, 2, 30))

# Write your code below

Show Solution

import numpy as np
import pandas as pd
# Input
ser = pd.Series(np.logspace(-2, 2, 30))

# Solution
def cap_outliers(ser, low_perc, high_perc):
 low, high = ser.quantile([low_perc, high_perc])
 print(low_perc, '%ile: ', low, '|', high_perc, '%ile: ', high)
 ser[ser < low] = low
 ser[ser > high] = high
 return(ser)

capped_ser = cap_outliers(ser, .05, .95)

55. How to reshape a dataframe to the largest possible square after removing the negative values?

Difficulty Level: L3

Reshape df to the largest possible square with negative values removed. Drop the smallest values if need be. The order of the positive numbers in the result should remain the same as the original.

Input:

# Task: Reshape a dataframe to the largest possible square after removing the negative values
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(-20, 50, 100).reshape(10,-1))

# Write your code below

Show Solution

import numpy as np
import pandas as pd
# Input
df = pd.DataFrame(np.random.randint(-20, 50, 100).reshape(10,-1))
print(df)

# Solution
# Step 1: remove negative values from arr
arr = df[df > 0].values.flatten()
arr_qualified = arr[~np.isnan(arr)]

# Step 2: find side-length of largest possible square
n = int(np.floor(arr_qualified.shape[0]**.5))

# Step 3: Take top n^2 items without changing positions
top_indexes = np.argsort(arr_qualified)[::-1]
output = np.take(arr_qualified, sorted(top_indexes[:n**2])).reshape(n, -1)
print(output)

56. How to swap two rows of a dataframe?

Difficulty Level: L2

Swap rows 1 and 2 in df.

Input:

# Task: Swap two rows of a dataframe
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(25).reshape(5, -1))

# Write your code below

Desired Output:

    0   1   2   3   4
0   0   1   2   3   4
1  10  11  12  13  14
2   5   6   7   8   9
3  15  16  17  18  19
4  20  21  22  23  24

Show Solution

import numpy as np
import pandas as pd
# Input
df = pd.DataFrame(np.arange(25).reshape(5, -1))

# Solution
def swap_rows(df, i1, i2):
 a, b = df.iloc[i1, :].copy(), df.iloc[i2, :].copy()
 df.iloc[i1, :], df.iloc[i2, :] = b, a
 return df

print(swap_rows(df, 1, 2))

57. How to reverse the rows of a dataframe?

Difficulty Level: L2

Reverse all the rows of dataframe df.

Input:

# Task: Reverse the rows of a dataframe
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(25).reshape(5, -1))

# Write your code below

Desired Output:

    0   1   2   3   4
4  20  21  22  23  24
3  15  16  17  18  19
2  10  11  12  13  14
1   5   6   7   8   9
0   0   1   2   3   4

Show Solution

import numpy as np
import pandas as pd
# Input
df = pd.DataFrame(np.arange(25).reshape(5, -1))

# Solution 1
df.iloc[::-1, :]

# Solution 2
print(df.loc[df.index[::-1], :])

58. How to create one-hot encodings of a categorical variable (dummy variables)?

Difficulty Level: L2

Get one-hot encodings for column ‘a’ in the dataframe df and append it as columns. Output

Input:

# Task: Create one-hot encodings of a categorical variable (dummy variables)
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(25).reshape(5,-1), columns=list('abcde'))

# Write your code below

Show Solution

import numpy as np
import pandas as pd
# Input
df = pd.DataFrame(np.arange(25).reshape(5,-1), columns=list('abcde'))

# Solution
df_onehot = pd.concat([pd.get_dummies(df['a']), df[list('bcde')]], axis=1)
print(df_onehot)

59. Which column contains the highest number of row-wise maximum values?

Difficulty Level: L2

Obtain the column name with the highest number of row-wise maximum’s in df.

Input:

# Task: Which column contains the highest number of row-wise maximum values
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(1,100, 40).reshape(10, -1))

# Write your code below

Show Solution

import numpy as np
import pandas as pd
# Input
df = pd.DataFrame(np.random.randint(1,100, 40).reshape(10, -1))

# Solution
print('Column with highest row maxes: ', df.idxmax(axis=1).value_counts().index[0])

60. How to create a new column that contains the row number of nearest column by euclidean distance?

Difficulty Level: L3

Create a new column such that, each row contains the row number of nearest row-record by euclidean distance.

Input:

# Task: Create a new column that contains the row number of nearest column by euclidean distance
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(1,100, 40).reshape(10, -1), columns=list('pqrs'), index=list('abcdefghij'))
df
# p q r s
# a 57 77 13 62
# b 68 5 92 24
# c 74 40 18 37
# d 80 17 39 60
# e 93 48 85 33
# f 69 55 8 11
# g 39 23 88 53
# h 63 28 25 61
# i 18 4 73 7
# j 79 12 45 34

# Write your code below

Show Solution

import pandas as pd
df = pd.DataFrame(np.random.randint(1,100, 40).reshape(10, -1), columns=list('pqrs'), index=list('abcdefghij'))

# Solution
import numpy as np

# init outputs
nearest_rows = []
nearest_distance = []

# iterate rows.
for i, row in df.iterrows():
 curr = row
 rest = df.drop(i)
 e_dists = {} # init dict to store euclidean dists for current row.
 # iterate rest of rows for current row
 for j, contestant in rest.iterrows():
 # compute euclidean dist and update e_dists
 e_dists.update({j: round(np.linalg.norm(curr.values - contestant.values))})
 # update nearest row to current row and the distance value
 nearest_rows.append(min(e_dists, key=e_dists.get))
 nearest_distance.append(min(e_dists.values()))

df['nearest_row'] = nearest_rows
df['dist'] = nearest_distance
print(df)

61. How to know the maximum possible correlation value of each column against other columns?

Difficulty Level: L2

Compute maximum possible absolute correlation value of each column against other columns in df.

Input:

# Task: Know the maximum possible correlation value of each column against other columns
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1), columns=list('pqrstuvwxy'), index=list('abcdefgh'))

# Write your code below

Show Solution

import numpy as np
import pandas as pd
# Input
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1), columns=list('pqrstuvwxy'), index=list('abcdefgh'))
df

# Solution
abs_corrmat = np.abs(df.corr())
max_corr = abs_corrmat.apply(lambda x: sorted(x)[-2])
print('Maximum Correlation possible for each column: ', np.round(max_corr.tolist(), 2))

62. How to create a column containing the minimum by maximum of each row?

Difficulty Level: L2

Compute the minimum-by-maximum for every row of df.

Input:

# Task: Create a column containing the minimum by maximum of each row
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))

# Write your code below

Show Solution

import numpy as np
import pandas as pd
# Input
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))

# Solution 1
min_by_max = df.apply(lambda x: np.min(x)/np.max(x), axis=1)

# Solution 2
min_by_max = np.min(df, axis=1)/np.max(df, axis=1)
print(min_by_max)

63. How to create a column that contains the penultimate value in each row?

Difficulty Level: L2

Create a new column ‘penultimate’ which has the second largest value of each row of df.

Input:

# Task: Create a column that contains the penultimate value in each row
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))

# Write your code below

Show Solution

import numpy as np
import pandas as pd
# Input
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))

# Solution
out = df.apply(lambda x: x.sort_values().unique()[-2], axis=1)
df['penultimate'] = out
print(df)

64. How to normalize all columns in a dataframe?

Difficulty Level: L2

Don’t use external packages like sklearn.

Input:

# Task: Normalize all columns in a dataframe
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))

# Write your code below

Show Solution

import numpy as np
import pandas as pd
# Input
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))

# Solution Q1
out1 = df.apply(lambda x: ((x - x.mean())/x.std()).round(2))
print('Solution Q1\n',out1)

# Solution Q2
out2 = df.apply(lambda x: ((x.max() - x)/(x.max() - x.min())).round(2))
print('Solution Q2\n', out2)

65. How to compute the correlation of each row with the suceeding row?

Difficulty Level: L2

Compute the correlation of each row of df with its succeeding row.

Input:

# Task: Compute the correlation of each row with the suceeding row
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))

# Write your code below

Show Solution

import numpy as np
import pandas as pd
# Input
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))

# Solution
print([df.iloc[i].corr(df.iloc[i+1]).round(2) for i in range(df.shape[0])[:-1]])

66. How to replace both the diagonals of dataframe with 0?

Difficulty Level: L2

Replace both values in both diagonals of df with 0.

Input:

# Task: Replace both the diagonals of dataframe with 0
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(1,100, 100).reshape(10, -1))
df
# 0 1 2 3 4 5 6 7 8 9
# 0 11 46 26 44 11 62 18 70 68 26
# 1 87 71 52 50 81 43 83 39 3 59
# 2 47 76 93 77 73 2 2 16 14 26
# 3 64 18 74 22 16 37 60 8 66 39
# 4 10 18 39 98 25 8 32 6 3 29
# 5 29 91 27 86 23 84 28 31 97 10
# 6 37 71 70 65 4 72 82 89 12 97
# 7 65 22 97 75 17 10 43 78 12 77
# 8 47 57 96 55 17 83 61 85 26 86
# 9 76 80 28 45 77 12 67 80 7 63

# Write your code below

Show Solution

import numpy as np
import pandas as pd
# Input
df = pd.DataFrame(np.random.randint(1,100, 100).reshape(10, -1))

# Solution
for i in range(df.shape[0]):
 df.iat[i, i] = 0
 df.iat[df.shape[0]-i-1, i] = 0
print(df)

67. How to get the particular group of a groupby dataframe by key?

Difficulty Level: L2

This is a question related to understanding of grouped dataframe. From df_grouped, get the group belonging to ‘apple’ as a dataframe. [/expand]

Input:

# Task: Get the particular group of a groupby dataframe by key
import numpy as np
import pandas as pd
df = pd.DataFrame({'col1': ['apple', 'banana', 'orange'] * 3,
 'col2': np.random.rand(9),
 'col3': np.random.randint(0, 15, 9)})

df_grouped = df.groupby(['col1'])

# Write your code below

Show Solution

import numpy as np
import pandas as pd
# Input
df = pd.DataFrame({'col1': ['apple', 'banana', 'orange'] * 3,
 'col2': np.random.rand(9),
 'col3': np.random.randint(0, 15, 9)})

df_grouped = df.groupby(['col1'])

# Solution 1
df_grouped.get_group('apple')

# Solution 2
for i, dff in df_grouped:
 if i == 'apple':
 print(dff)

68. How to get the n’th largest value of a column when grouped by another column?

Difficulty Level: L2

In df, find the second largest value of ‘taste’ for ‘banana’

Input:

# Task: Get the n'th largest value of a column when grouped by another column
import numpy as np
import pandas as pd
df = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,
 'taste': np.random.rand(9),
 'price': np.random.randint(0, 15, 9)})

# Write your code below

Show Solution

import numpy as np
import pandas as pd
# Input
df = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,
 'taste': np.random.rand(9),
 'price': np.random.randint(0, 15, 9)})

print(df)

# Solution
df_grpd = df['taste'].groupby(df.fruit)
print(df_grpd.get_group('banana').sort_values().iloc[-2])

69. How to compute grouped mean on pandas dataframe and keep the grouped column as another column (not index)?

Difficulty Level: L1

In df, Compute the mean price of every fruit, while keeping the fruit as another column instead of an index.

Input:

# Task: Compute grouped mean on pandas dataframe and keep the grouped column as another column (not index)
import numpy as np
import pandas as pd
df = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,
 'rating': np.random.rand(9),
 'price': np.random.randint(0, 15, 9)})

# Write your code below

Show Solution

import numpy as np
import pandas as pd
# Input
df = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,
 'rating': np.random.rand(9),
 'price': np.random.randint(0, 15, 9)})

# Solution
out = df.groupby('fruit', as_index=False)['price'].mean()
print(out)

70. How to join two dataframes by 2 columns so they have only the common rows?

Difficulty Level: L2

Join dataframes df1 and df2 by ‘fruit-pazham’ and ‘weight-kilo’.

Input:

# Task: Join two dataframes by 2 columns so they have only the common rows
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,
 'weight': ['high', 'medium', 'low'] * 3,
 'price': np.random.randint(0, 15, 9)})

df2 = pd.DataFrame({'pazham': ['apple', 'orange', 'pine'] * 2,
 'kilo': ['high', 'low'] * 3,
 'price': np.random.randint(0, 15, 6)})

# Write your code below

Show Solution

import numpy as np
import pandas as pd
# Input
df1 = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,
 'weight': ['high', 'medium', 'low'] * 3,
 'price': np.random.randint(0, 15, 9)})

df2 = pd.DataFrame({'pazham': ['apple', 'orange', 'pine'] * 2,
 'kilo': ['high', 'low'] * 3,
 'price': np.random.randint(0, 15, 6)})

# Solution
print(pd.merge(df1, df2, how='inner', left_on=['fruit', 'weight'], right_on=['pazham', 'kilo'], suffixes=['_left', '_right']))

71. How to remove rows from a dataframe that are present in another dataframe?

Difficulty Level: L3

From df1, remove the rows that are present in df2. All three columns must be the same.

Input:

# Task: Remove rows from a dataframe that are present in another dataframe
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,
 'weight': ['high', 'medium', 'low'] * 3,
 'price': np.random.randint(0, 15, 9)})

df2 = pd.DataFrame({'pazham': ['apple', 'orange', 'pine'] * 2,
 'kilo': ['high', 'low'] * 3,
 'price': np.random.randint(0, 15, 6)})

# Write your code below

Show Solution

import numpy as np
import pandas as pd
# Input
df1 = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,
 'weight': ['high', 'medium', 'low'] * 3,
 'price': np.random.randint(0, 15, 9)})

df2 = pd.DataFrame({'fruit': ['apple', 'orange', 'pine'] * 2,
 'weight': ['high', 'low'] * 3,
 'price': np.random.randint(0, 15, 6)})

# Solution: rows in df1 whose fruit-weight combo is NOT in df2
common = pd.merge(df1, df2, left_on=['fruit', 'weight'], right_on=['fruit', 'weight'], how='left', indicator=True)
print(df1[common['_merge'] == 'left_only'])

72. How to get the positions where values of two columns match?

Difficulty Level: L2

Show Solution

Input:

import numpy as np
import pandas as pd
# Input
df = pd.DataFrame({'fruit1': np.random.choice(['apple', 'orange', 'banana'], 10),
 'fruit2': np.random.choice(['apple', 'orange', 'banana'], 10)})

# Solution
print(np.where(df.fruit1 == df.fruit2))

73. How to create lags and leads of a column in a dataframe?

Difficulty Level: L2

Create two new columns in df, one of which is a lag1 (shift column a down by 1 row) of column ‘a’ and the other is a lead1 (shift column b up by 1 row).

# Task: Create lags and leads of a column in a dataframe
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(1, 100, 20).reshape(-1, 4), columns = list('abcd'))

# Write your code below

Show Solution

import numpy as np
import pandas as pd
# Input
df = pd.DataFrame(np.random.randint(1, 100, 20).reshape(-1, 4), columns = list('abcd'))

# Solution
df['a_lag1'] = df['a'].shift(1)
df['b_lead1'] = df['b'].shift(-1)
print(df)

74. How to get the frequency of unique values in the entire dataframe?

Difficulty Level: L2

Get the frequency of unique values in the entire dataframe df.

Input:

# Task: Get the frequency of unique values in the entire dataframe
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(1, 10, 20).reshape(-1, 4), columns = list('abcd'))

# Write your code below

Show Solution

import numpy as np
import pandas as pd
# Input
df = pd.DataFrame(np.random.randint(1, 10, 20).reshape(-1, 4), columns = list('abcd'))

# Solution
print(pd.value_counts(df.values.ravel()))

75. How to split a text column into two separate columns?

Difficulty Level: L2

Split the string column in df to form a dataframe with 3 columns as shown.

Input:

# Task: Split a text column into two separate columns
import numpy as np
import pandas as pd
df = pd.DataFrame(["STD, City State",
"33, Kolkata West Bengal",
"44, Chennai Tamil Nadu",
"40, Hyderabad Telengana",
"80, Bangalore Karnataka"], columns=['row'])

print(df)
#> row
#> 0 STD, City\tState
#> 1 33, Kolkata\tWest Bengal
#> 2 44, Chennai\tTamil Nadu
#> 3 40, Hyderabad\tTelengana
#> 4 80, Bangalore\tKarnataka

# Write your code below

Show Solution

import numpy as np
import pandas as pd
# Input
df = pd.DataFrame(["STD, City State",
"33, Kolkata West Bengal",
"44, Chennai Tamil Nadu",
"40, Hyderabad Telengana",
"80, Bangalore Karnataka"], columns=['row'])

# Solution
df_out = df.row.str.split(',\s+', expand=True)

# Make first row as header
new_header = df_out.iloc[0]
df_out = df_out[1:]
df_out.columns = new_header
print(df_out)

Free Course

Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course →

Trusted by 50,000+ learners

Written by

Selva Prabhakaran →

Related Course

Master Python — Hands-On

Join 5,000+ students at edu.machinelearningplus.com

Explore Course

1. How to import pandas and check the version?

2. How to create a series from a list, numpy array and dict?

3. How to convert the index of a series into a column of a dataframe?

4. How to combine many series to form a dataframe?

5. How to assign name to the series’ index?

6. How to get the items of series A not present in series B?

7. How to get the items not common to both series A and series B?

8. How to get the minimum, 25th percentile, median, 75th, and max of a numeric series?

9. How to get frequency counts of unique items of a series?

10. How to keep only top 2 most frequent values as it is and replace everything else as ‘Other’?

11. How to bin a numeric series to 10 groups of equal size?

12. How to convert a numpy array to a dataframe of given shape? (L1)

13. How to find the positions of numbers that are multiples of 3 from a series?

14. How to extract items at given positions from a series

15. How to stack two series vertically and horizontally ?

16. How to get the positions of items of series A in another series B?

17. How to compute the mean squared error on a truth and predicted series?

18. How to convert the first character of each element in a series to uppercase?

19. How to calculate the number of characters in each word in a series?

20. How to compute difference of differences between consequtive numbers of a series?

21. How to convert a series of date-strings to a timeseries?

22. How to get the day of month, week number, day of year and day of week from a series of date strings?

23. How to convert year-month string to dates corresponding to the 4th day of the month?

24. How to filter words that contain atleast 2 vowels from a series?

25. How to filter valid emails from a series?

26. How to get the mean of a series grouped by another series?

27. How to compute the euclidean distance between two series?

28. How to find all the local maxima (or peaks) in a numeric series?

29. How to replace missing spaces in a string with the least frequent character?

30. How to create a TimeSeries starting ‘2000-01-01’ and 10 weekends (saturdays) after that having random numbers as values?

31. How to fill an intermittent time series so all missing dates show up with values of previous non-missing date?

32. How to compute the autocorrelations of a numeric series?

33. How to import only every nth row from a csv file to create a dataframe?

34. How to change column values when importing csv to a dataframe?

35. How to create a dataframe with rows as strides from a given series?

36. How to import only specified columns from a csv file?

37. How to get the nrows, ncolumns, datatype, summary stats of each column of a dataframe? Also get the array and list equivalent.

38. How to extract the row and column number of a particular cell with given criterion?

39. How to rename a specific columns in a dataframe?

40. How to check if a dataframe has any missing values?

41. How to count the number of missing values in each column?

42. How to replace missing values of multiple numeric columns with the mean?

43. How to use apply function on existing columns with global variables as additional arguments?

44. How to select a specific column from a dataframe as a dataframe instead of a series?

45. How to change the order of columns of a dataframe?

46. How to set the number of rows and columns displayed in the output?

47. How to format or suppress scientific notations in a pandas dataframe?

48. How to format all the values in a dataframe as percentages?

49. How to filter every nth row in a dataframe?

50. How to create a primary key index by combining relevant columns?

51. How to get the row number of the nth largest value in a column?

52. How to find the position of the nth largest value greater than a given value?

53. How to get the last n rows of a dataframe with row sum > 100?

54. How to find and cap outliers from a series or dataframe column?

55. How to reshape a dataframe to the largest possible square after removing the negative values?

56. How to swap two rows of a dataframe?

57. How to reverse the rows of a dataframe?

58. How to create one-hot encodings of a categorical variable (dummy variables)?

59. Which column contains the highest number of row-wise maximum values?

60. How to create a new column that contains the row number of nearest column by euclidean distance?

61. How to know the maximum possible correlation value of each column against other columns?

62. How to create a column containing the minimum by maximum of each row?

63. How to create a column that contains the penultimate value in each row?

64. How to normalize all columns in a dataframe?

65. How to compute the correlation of each row with the suceeding row?

66. How to replace both the diagonals of dataframe with 0?

67. How to get the particular group of a groupby dataframe by key?

68. How to get the n’th largest value of a column when grouped by another column?

69. How to compute grouped mean on pandas dataframe and keep the grouped column as another column (not index)?

70. How to join two dataframes by 2 columns so they have only the common rows?

71. How to remove rows from a dataframe that are present in another dataframe?

72. How to get the positions where values of two columns match?

73. How to create lags and leads of a column in a dataframe?

74. How to get the frequency of unique values in the entire dataframe?

75. How to split a text column into two separate columns?

Related Articles

101 Polars Exercises for Data Analysis (with Solutions)

Build Your First AI App with Python — Step-by-Step Tutorial

Why Python Lacks Traditional OOP Encapsulation

Get Your Free AI/ML Engineer Roadmap