101 Polars Exercises for Data Analysis (with Solutions)
Master Polars with 101 hands-on exercises and solutions — covering DataFrames, groupby, joins, window functions, lazy eval, and more.
Practice Polars — the blazing-fast DataFrame library for Python — with these 101 exercises ranging from beginner to advanced.
This post has interactive code — click ‘Run’ or press Ctrl+Enter on any code block to execute it directly in your browser. The first run may take a few seconds to initialize.
Polars is a lightning-fast DataFrame library written in Rust with a Python API. It is designed for performance and ergonomics, offering lazy evaluation, expressive syntax, and first-class support for parallel execution. These 101 exercises will help you master Polars through hands-on practice.
The exercises are organized by increasing difficulty across topics like Series, DataFrames, filtering, groupby, joins, string operations, datetime handling, reshaping, and more.
Before you begin: Run the code block below to install Polars. This only needs to be done once per session.
import polars as pl
print("Polars", pl.__version__, "ready!")
Difficulty Levels:
- L1 — Beginner
- L2 — Intermediate
- L3 — Advanced
1. How to import polars and check the version?
Difficulty Level: L1
Import polars and print the version installed.
Solve:
# Task: Import polars and check the version # Write your code below
Desired Output:
1.39.2
2. How to create a Series from a list, numpy array, and dict?
Difficulty Level: L1
Create a polars Series from each of the following: a list, a numpy array, and a dictionary (keys as name, values as data).
Solve:
import polars as pl
import numpy as np
mylist = list('abcedfghijklmnopqrstuvwxyz')
myarr = np.arange(26)
mydict = dict(zip(mylist, myarr))
# Write your code below
Desired Output:
shape: (10,)
Series: 'values' [i32]
[
0
1
2
3
4
5
6
7
8
9
]
3. How to convert a Series into a DataFrame with the index as a column?
Difficulty Level: L1
Polars doesn’t have an index. Given a dictionary, create a two-column DataFrame with keys in one column and values in another.
Solve:
import polars as pl
import numpy as np
mylist = list('abcedfghijklmnopqrstuvwxyz')
myarr = np.arange(26)
mydict = dict(zip(mylist, myarr))
# Write your code below
Desired Output:
shape: (5, 2)
┌─────┬───────┐
│ key ┆ value │
│ --- ┆ --- │
│ str ┆ i32 │
╞═════╪═══════╡
│ a ┆ 0 │
│ b ┆ 1 │
│ c ┆ 2 │
│ e ┆ 3 │
│ d ┆ 4 │
└─────┴───────┘
4. How to combine many Series to form a DataFrame?
Difficulty Level: L1
Combine ser1 and ser2 to form a DataFrame.
Solve:
import polars as pl
import numpy as np
ser1 = pl.Series("col1", list('abcedfghijklmnopqrstuvwxyz'))
ser2 = pl.Series("col2", np.arange(26).tolist())
# Write your code below
Desired Output:
shape: (5, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ --- ┆ --- │
│ str ┆ i32 │
╞══════╪══════╡
│ a ┆ 0 │
│ b ┆ 1 │
│ c ┆ 2 │
│ e ┆ 3 │
│ d ┆ 4 │
└──────┴──────┘
5. How to assign a name to a Series?
Difficulty Level: L1
Give a name 'alphabets' to the series ser.
Solve:
import polars as pl
ser = pl.Series(list('abcedfghijklmnopqrstuvwxyz'))
# Write your code below
Desired Output:
shape: (10,)
Series: 'alphabets' [str]
[
"a"
"b"
"c"
"e"
"d"
"f"
"g"
"h"
"i"
"j"
]
6. How to get the items of Series A not present in Series B?
Difficulty Level: L2
From ser1, remove items present in ser2.
Solve:
import polars as pl
ser1 = pl.Series("a", [1, 2, 3, 4, 5])
ser2 = pl.Series("b", [4, 5, 6, 7, 8])
# Write your code below
Desired Output:
shape: (3,)
Series: 'a' [i64]
[
1
2
3
]
7. How to get the items not common to both Series A and Series B?
Difficulty Level: L2
Get all items of ser1 and ser2 not common to both.
Solve:
import polars as pl
ser1 = pl.Series("a", [1, 2, 3, 4, 5])
ser2 = pl.Series("b", [4, 5, 6, 7, 8])
# Write your code below
Desired Output:
shape: (6,)
Series: 'union' [i64]
[
1
2
3
6
7
8
]
8. How to get the minimum, 25th percentile, median, 75th, and max of a numeric Series?
Difficulty Level: L1
Compute the min, 25th percentile, median, 75th percentile, and max of ser.
Solve:
import polars as pl
import numpy as np
np.random.seed(42)
ser = pl.Series("data", np.random.normal(10, 5, 25).tolist())
# Write your code below
Desired Output:
[0.43, 7.19, 8.83, 12.48, 17.9]
9. How to get frequency counts of unique items of a Series?
Difficulty Level: L1
Calculate the frequency counts of each unique value in ser.
Solve:
import polars as pl
import numpy as np
np.random.seed(42)
ser = pl.Series("chars", np.take(list('abcdefgh'), np.random.randint(8, size=30)))
# Write your code below
Desired Output:
shape: (7, 2)
┌───────┬───────┐
│ chars ┆ count │
│ --- ┆ --- │
│ str ┆ u32 │
╞═══════╪═══════╡
│ h ┆ 6 │
│ c ┆ 5 │
│ e ┆ 5 │
│ d ┆ 4 │
│ g ┆ 4 │
│ f ┆ 3 │
│ b ┆ 3 │
└───────┴───────┘
10. How to keep only the top 2 most frequent values and replace everything else as ‘Other’?
Difficulty Level: L2
In ser, keep the top 2 most frequent values as-is. Replace all other values with 'Other'.
Solve:
import polars as pl
import numpy as np
np.random.seed(100)
ser = pl.Series("data", np.random.randint(1, 5, [12]).tolist())
# Write your code below
Desired Output:
shape: (12, 1)
┌───────┐
│ data │
│ --- │
│ str │
╞═══════╡
│ 1 │
│ 1 │
│ 4 │
│ 4 │
│ 4 │
│ … │
│ Other │
│ Other │
│ 1 │
│ Other │
│ Other │
└───────┘
11. How to bin a numeric Series to 10 groups of equal size?
Difficulty Level: L2
Bin the series ser into 10 equal-sized decile groups and label them from 1st to 10th.
Solve:
import polars as pl
import numpy as np
np.random.seed(42)
ser = pl.Series("data", np.random.random(20).tolist())
# Write your code below
Desired Output:
shape: (10,)
Series: 'data' [cat]
[
"4th"
"10th"
"8th"
"6th"
"2nd"
"2nd"
"1st"
"9th"
"7th"
"8th"
]
12. How to convert a numpy array to a DataFrame of given shape?
Difficulty Level: L1
Reshape the series ser into a DataFrame with 7 rows and 5 columns.
Solve:
import polars as pl
import numpy as np
np.random.seed(42)
ser = pl.Series("data", np.random.randint(1, 10, 35).tolist())
# Write your code below
Desired Output:
shape: (7, 5)
┌──────┬──────┬──────┬──────┬──────┐
│ col0 ┆ col1 ┆ col2 ┆ col3 ┆ col4 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 ┆ i32 ┆ i32 │
╞══════╪══════╪══════╪══════╪══════╡
│ 7 ┆ 4 ┆ 8 ┆ 5 ┆ 7 │
│ 3 ┆ 7 ┆ 8 ┆ 5 ┆ 4 │
│ 8 ┆ 8 ┆ 3 ┆ 6 ┆ 5 │
│ 2 ┆ 8 ┆ 6 ┆ 2 ┆ 5 │
│ 1 ┆ 6 ┆ 9 ┆ 1 ┆ 3 │
│ 7 ┆ 4 ┆ 9 ┆ 3 ┆ 5 │
│ 3 ┆ 7 ┆ 5 ┆ 9 ┆ 7 │
└──────┴──────┴──────┴──────┴──────┘
13. How to find the positions of numbers that are multiples of 3 from a Series?
Difficulty Level: L2
Find the positions of numbers that are multiples of 3 from ser.
Solve:
import polars as pl
import numpy as np
np.random.seed(42)
ser = pl.Series("data", np.random.randint(1, 10, 7).tolist())
# Write your code below
Desired Output:
shape: (7,)
Series: 'data' [i32]
[
7
4
8
5
7
3
7
]
[5]
14. How to extract items at given positions from a Series?
Difficulty Level: L1
From ser, extract the items at positions 0, 4, 8, and 14.
Solve:
import polars as pl
ser = pl.Series("letters", list('abcdefghijklmnopqrstuvwxyz'))
pos = [0, 4, 8, 14]
# Write your code below
Desired Output:
shape: (4,)
Series: 'letters' [str]
[
"a"
"e"
"i"
"o"
]
15. How to stack two Series vertically and horizontally?
Difficulty Level: L1
Stack ser1 and ser2 vertically and horizontally (to form a DataFrame).
Solve:
import polars as pl
ser1 = pl.Series("col1", [0, 1, 2, 3, 4])
ser2 = pl.Series("col2", [5, 6, 7, 8, 9])
# Write your code below
Desired Output:
shape: (10,)
Series: 'col1' [i64]
[
0
1
2
3
4
5
6
7
8
9
]
shape: (5, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════╡
│ 0 ┆ 5 │
│ 1 ┆ 6 │
│ 2 ┆ 7 │
│ 3 ┆ 8 │
│ 4 ┆ 9 │
└──────┴──────┘
16. How to get the positions of items of Series A in Series B?
Difficulty Level: L2
Get the positions of items of ser2 in ser1 as a list.
Solve:
import polars as pl
ser1 = pl.Series("a", [10, 9, 6, 5, 3, 1, 12, 8, 13])
ser2 = pl.Series("b", [1, 3, 10, 13])
# Write your code below
Desired Output:
[5, 4, 0, 8]
17. How to compute the mean squared error between a truth and predicted Series?
Difficulty Level: L1
Compute the mean squared error of truth and pred Series.
Solve:
import polars as pl
import numpy as np
np.random.seed(42)
truth = pl.Series("truth", np.arange(10).tolist())
pred = pl.Series("pred", (np.arange(10) + np.random.random(10)).tolist())
# Write your code below
Desired Output:
0.3603
18. How to convert the first character of each element in a Series to uppercase?
Difficulty Level: L2
Change the first character of each word to upper case in each word of ser.
Solve:
import polars as pl
ser = pl.Series("words", ['how', 'to', 'kick', 'ass?'])
# Write your code below
Desired Output:
shape: (4, 1)
┌───────┐
│ words │
│ --- │
│ str │
╞═══════╡
│ How │
│ To │
│ Kick │
│ Ass? │
└───────┘
19. How to calculate the number of characters in each word in a Series?
Difficulty Level: L1
Calculate the number of characters in each word in ser.
Solve:
import polars as pl
ser = pl.Series("words", ['how', 'to', 'kick', 'ass?'])
# Write your code below
Desired Output:
shape: (4,)
Series: 'words' [u32]
[
3
2
4
4
]
20. How to compute the difference of differences between consecutive numbers of a Series?
Difficulty Level: L1
Compute the difference of differences between consecutive numbers of ser.
Solve:
import polars as pl
ser = pl.Series("data", [1, 3, 6, 10, 15, 21, 27, 35])
# Write your code below
Desired Output:
shape: (6,)
Series: 'data' [i64]
[
1
1
1
1
0
2
]
21. How to convert a Series of date strings to a datetime type?
Difficulty Level: L2
Convert the ser to a datetime Series.
Solve:
import polars as pl
ser = pl.Series("dates", ['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])
# Write your code below
Desired Output:
shape: (6,)
Series: 'dates' [datetime[μs]]
[
2010-01-01 00:00:00
2011-02-02 00:00:00
2012-03-03 00:00:00
2013-04-04 00:00:00
2014-05-05 00:00:00
2015-06-06 12:20:00
]
22. How to get the day of month, week number, day of year, and day of week from a datetime Series?
Difficulty Level: L2
Get the day of month, week number, day of year, and day of week from ser.
Solve:
import polars as pl
from dateutil.parser import parse
ser = pl.Series("dates", [parse(d) for d in ['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20']])
# Write your code below
Desired Output:
shape: (6, 5)
┌─────────────────────┬──────────────┬─────────────┬─────────────┬─────────────┐
│ dates ┆ day_of_month ┆ week_number ┆ day_of_year ┆ day_of_week │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ i8 ┆ i8 ┆ i16 ┆ i8 │
╞═════════════════════╪══════════════╪═════════════╪═════════════╪═════════════╡
│ 2010-01-01 00:00:00 ┆ 1 ┆ 53 ┆ 1 ┆ 5 │
│ 2011-02-02 00:00:00 ┆ 2 ┆ 5 ┆ 33 ┆ 3 │
│ 2012-03-03 00:00:00 ┆ 3 ┆ 9 ┆ 63 ┆ 6 │
│ 2013-04-04 00:00:00 ┆ 4 ┆ 14 ┆ 94 ┆ 4 │
│ 2014-05-05 00:00:00 ┆ 5 ┆ 19 ┆ 125 ┆ 1 │
│ 2015-06-06 12:20:00 ┆ 6 ┆ 23 ┆ 157 ┆ 6 │
└─────────────────────┴──────────────┴─────────────┴─────────────┴─────────────┘
23. How to convert year-month string to dates corresponding to the 4th day of the month?
Difficulty Level: L2
Change ser to dates that start with the 4th of the respective months.
Solve:
import polars as pl
ser = pl.Series("dates", ['Jan 2010', 'Feb 2011', 'Mar 2012'])
# Write your code below
Desired Output:
shape: (3,)
Series: 'dates' [datetime[μs]]
[
2010-01-04 00:00:00
2011-02-04 00:00:00
2012-03-04 00:00:00
]
24. How to filter words that contain at least 2 vowels from a Series?
Difficulty Level: L3
From ser, extract words that contain at least 2 vowels.
Solve:
import polars as pl
ser = pl.Series("words", ['Apple', 'Orange', 'Plan', 'Python', 'Money'])
# Write your code below
Desired Output:
shape: (3, 1)
┌────────┐
│ words │
│ --- │
│ str │
╞════════╡
│ Apple │
│ Orange │
│ Money │
└────────┘
25. How to filter valid emails from a Series?
Difficulty Level: L3
Extract valid email addresses from ser.
Solve:
import polars as pl
emails = pl.Series("emails", ['buying books at amazom.com', 'rameses@egypt.com', 'matt@t.co', 'narendra@modi.com'])
# Write your code below
Desired Output:
shape: (3, 1)
┌───────────────────┐
│ emails │
│ --- │
│ str │
╞═══════════════════╡
│ rameses@egypt.com │
│ matt@t.co │
│ narendra@modi.com │
└───────────────────┘
26. How to get the mean of a Series grouped by another Series?
Difficulty Level: L2
Compute the mean of weights grouped by fruit.
Solve:
import polars as pl
import numpy as np
np.random.seed(42)
fruit = pl.Series("fruit", np.random.choice(['apple', 'banana', 'carrot'], 10).tolist())
weights = pl.Series("weights", np.linspace(1, 10, 10).tolist())
# Write your code below
Desired Output:
shape: (3, 2)
┌────────┬──────────┐
│ fruit ┆ weights │
│ --- ┆ --- │
│ str ┆ f64 │
╞════════╪══════════╡
│ apple ┆ 4.333333 │
│ banana ┆ 8.0 │
│ carrot ┆ 5.666667 │
└────────┴──────────┘
27. How to compute the euclidean distance between two Series?
Difficulty Level: L1
Compute the euclidean distance between p and q.
Solve:
import polars as pl
p = pl.Series("p", list(range(1, 11)))
q = pl.Series("q", list(range(10, 0, -1)))
# Write your code below
Desired Output:
18.17
28. How to find all the local maxima (peaks) in a numeric Series?
Difficulty Level: L3
Get the positions of peaks (values surrounded by smaller values on both sides) in ser.
Solve:
import polars as pl
ser = pl.Series("data", [2, 10, 3, 4, 9, 10, 2, 7, 3])
# Write your code below
Desired Output:
[1, 5, 7]
29. How to replace missing spaces in a string with the least frequent character?
Difficulty Level: L2
Replace the spaces in my_str with whichever character is the least frequent, excluding spaces.
Solve:
my_str = 'dbc deb abed gade' # Write your code below
Desired Output:
Least frequent char: g
dbcgdebgabedggade
30. How to create a TimeSeries starting ‘2000-01-01’ and 10 weekends (Saturdays)?
Difficulty Level: L2
Create a Polars DataFrame with 10 Saturday dates starting from 2000-01-01 and random integer values.
Solve:
import polars as pl import numpy as np # Write your code below
Desired Output:
shape: (10, 2)
┌────────────┬───────┐
│ date ┆ value │
│ --- ┆ --- │
│ date ┆ i32 │
╞════════════╪═══════╡
│ 2000-01-01 ┆ 7 │
│ 2000-01-08 ┆ 4 │
│ 2000-01-15 ┆ 8 │
│ 2000-01-22 ┆ 5 │
│ 2000-01-29 ┆ 7 │
│ 2000-02-05 ┆ 3 │
│ 2000-02-12 ┆ 7 │
│ 2000-02-19 ┆ 8 │
│ 2000-02-26 ┆ 5 │
│ 2000-03-04 ┆ 4 │
└────────────┴───────┘
31. How to fill missing dates and forward-fill values?
Difficulty Level: L2
ser has missing dates. Fill in the missing dates and forward-fill the corresponding values.
Solve:
import polars as pl
from datetime import date
df = pl.DataFrame({
"date": [date(2000,1,1), date(2000,1,3), date(2000,1,6), date(2000,1,8)],
"value": [1, 10, 3, None]
})
# Write your code below
Desired Output:
shape: (8, 2)
┌────────────┬───────┐
│ date ┆ value │
│ --- ┆ --- │
│ date ┆ i64 │
╞════════════╪═══════╡
│ 2000-01-01 ┆ 1 │
│ 2000-01-02 ┆ 1 │
│ 2000-01-03 ┆ 10 │
│ 2000-01-04 ┆ 10 │
│ 2000-01-05 ┆ 10 │
│ 2000-01-06 ┆ 3 │
│ 2000-01-07 ┆ 3 │
│ 2000-01-08 ┆ 3 │
└────────────┴───────┘
32. How to find the autocorrelation of a numeric Series?
Difficulty Level: L3
Compute autocorrelation for lags 1 through 10 of ser, and find the lag with the highest correlation.
Solve:
import polars as pl
import numpy as np
np.random.seed(42)
ser = pl.Series("data", (np.arange(20) + np.random.normal(1, 10, 20)).tolist())
# Write your code below
Desired Output:
[-0.04, -0.36, 0.24, -0.23, -0.06, 0.1, -0.59, -0.13, 0.33, -0.03]
Lag with highest correlation: 7
33. How to import only specified columns from a CSV file?
Difficulty Level: L2
Import ‘crim’ and ‘medv’ columns from the BostonHousing dataset CSV.
Solve:
import polars as pl url = 'https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv' # Write your code below
Desired Output:
shape: (5, 2)
┌─────────┬──────┐
│ crim ┆ medv │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═════════╪══════╡
│ 0.00632 ┆ 24.0 │
│ 0.02731 ┆ 21.6 │
│ 0.02729 ┆ 34.7 │
│ 0.03237 ┆ 33.4 │
│ 0.06905 ┆ 36.2 │
└─────────┴──────┘
34. How to get the nrows, ncolumns, datatype, summary stats of each column of a DataFrame?
Difficulty Level: L2
Get the number of rows, columns, datatypes, and summary stats of the Cars93 DataFrame.
Solve:
import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")
# Write your code below
Desired Output:
Shape: (93, 27)
Column dtypes:
Manufacturer: String
Model: String
Type: String
Min.Price: Float64
Price: Float64
...
35. How to extract the row and column number of a particular cell with given criterion?
Difficulty Level: L1
Which manufacturer, model, and type has the highest Price? What is the row and column number of the cell with the highest Price value?
Solve:
import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")
# Write your code below
Desired Output:
Row: 58
Column: 4
Mercedes-Benz 300E Midsize Price: 61.9
36. How to rename a specific column in a DataFrame?
Difficulty Level: L2
Rename the column Type to CarType in df.
Solve:
import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")
# Write your code below
Desired Output:
['Manufacturer', 'Model', 'CarType', 'Min.Price', 'Price']
37. How to check if a DataFrame has any missing values?
Difficulty Level: L1
Check if df has any missing values.
Solve:
import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")
# Write your code below
Desired Output:
True
38. How to count the number of missing values in each column?
Difficulty Level: L1
Count the number of missing values in each column of df.
Solve:
import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")
# Write your code below
Desired Output:
Manufacturer: 4
Price: 2
Type: 3
Min.Price: 7
Max.Price: 5
39. How to replace missing values of multiple numeric columns with the mean?
Difficulty Level: L2
Replace NaNs/nulls with the column mean for all numeric columns in df.
Solve:
import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")
# Write your code below
Desired Output:
Numeric nulls before: 99, after: 0
40. How to use apply function on existing columns with global variables as additional arguments?
Difficulty Level: L2
In df, use polars expressions to compute a new column 'avg' that is the row-mean of columns 'a', 'b', and 'c', then add a column 'avg_mf' = avg × d (where d is an external variable).
Solve:
import polars as pl import numpy as np np.random.seed(42) df = pl.DataFrame(np.random.randint(1, 10, 15).reshape(5, 3).tolist(), schema=['a', 'b', 'c']) d = 5 # Write your code below
Desired Output:
shape: (5, 5)
┌─────┬─────┬─────┬──────────┬───────────┐
│ a ┆ b ┆ c ┆ avg ┆ avg_mf │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 ┆ f64 ┆ f64 │
╞═════╪═════╪═════╪══════════╪═══════════╡
│ 7 ┆ 4 ┆ 8 ┆ 6.333333 ┆ 31.666667 │
│ 5 ┆ 7 ┆ 3 ┆ 5.0 ┆ 25.0 │
│ 7 ┆ 8 ┆ 5 ┆ 6.666667 ┆ 33.333333 │
│ 4 ┆ 8 ┆ 8 ┆ 6.666667 ┆ 33.333333 │
│ 3 ┆ 6 ┆ 5 ┆ 4.666667 ┆ 23.333333 │
└─────┴─────┴─────┴──────────┴───────────┘
41. How to swap two columns in a DataFrame?
Difficulty Level: L2
In df, swap columns 'a' and 'c'.
Solve:
import polars as pl
import numpy as np
df = pl.DataFrame(np.arange(20).reshape(-1, 5).tolist(), schema=list('abcde'))
# Write your code below
Desired Output:
shape: (4, 5)
┌─────┬─────┬─────┬─────┬─────┐
│ c ┆ b ┆ a ┆ d ┆ e │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪═════╪═════╪═════╡
│ 2 ┆ 1 ┆ 0 ┆ 3 ┆ 4 │
│ 7 ┆ 6 ┆ 5 ┆ 8 ┆ 9 │
│ 12 ┆ 11 ┆ 10 ┆ 13 ┆ 14 │
│ 17 ┆ 16 ┆ 15 ┆ 18 ┆ 19 │
└─────┴─────┴─────┴─────┴─────┘
42. How to sort columns in reverse alphabetical order?
Difficulty Level: L2
Sort the columns of df in reverse alphabetical order.
Solve:
import polars as pl
import numpy as np
df = pl.DataFrame(np.arange(20).reshape(-1, 5).tolist(), schema=list('abcde'))
# Write your code below
Desired Output:
shape: (4, 5)
┌─────┬─────┬─────┬─────┬─────┐
│ e ┆ d ┆ c ┆ b ┆ a │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪═════╪═════╪═════╡
│ 4 ┆ 3 ┆ 2 ┆ 1 ┆ 0 │
│ 9 ┆ 8 ┆ 7 ┆ 6 ┆ 5 │
│ 14 ┆ 13 ┆ 12 ┆ 11 ┆ 10 │
│ 19 ┆ 18 ┆ 17 ┆ 16 ┆ 15 │
└─────┴─────┴─────┴─────┴─────┘
43. How to format or suppress scientific notations in a Polars DataFrame?
Difficulty Level: L2
When displaying a DataFrame with very small numbers, format them as fixed-point with 4 decimal places.
Solve:
import polars as pl import numpy as np np.random.seed(42) df = pl.DataFrame((np.random.random([5, 3]) / 1e3).tolist(), schema=['a', 'b', 'c']) # Write your code below
Desired Output:
shape: (5, 3)
┌────────┬────────┬────────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 │
╞════════╪════════╪════════╡
│ 0.0004 ┆ 0.0010 ┆ 0.0007 │
│ 0.0006 ┆ 0.0002 ┆ 0.0002 │
│ 0.0001 ┆ 0.0009 ┆ 0.0006 │
│ 0.0007 ┆ 0.0000 ┆ 0.0010 │
│ 0.0008 ┆ 0.0002 ┆ 0.0002 │
└────────┴────────┴────────┘
44. How to format all values in a DataFrame to show only 4 decimal places?
Difficulty Level: L2
Show all float values in df rounded to 4 decimal places.
Solve:
import polars as pl import numpy as np np.random.seed(42) df = pl.DataFrame(np.random.random([5, 3]).tolist(), schema=['a', 'b', 'c']) # Write your code below
Desired Output:
shape: (5, 3)
┌────────┬────────┬────────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 │
╞════════╪════════╪════════╡
│ 0.3745 ┆ 0.9507 ┆ 0.732 │
│ 0.5987 ┆ 0.156 ┆ 0.156 │
│ 0.0581 ┆ 0.8662 ┆ 0.6011 │
│ 0.7081 ┆ 0.0206 ┆ 0.9699 │
│ 0.8324 ┆ 0.2123 ┆ 0.1818 │
└────────┴────────┴────────┘
45. How to filter rows of a DataFrame by row number?
Difficulty Level: L1
Select every 20th row starting from the 1st row (row 0).
Solve:
import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")
# Write your code below
Desired Output:
shape: (5, 3)
┌──────────────┬─────────┬─────────┐
│ Manufacturer ┆ Model ┆ Type │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════════════╪═════════╪═════════╡
│ Acura ┆ Integra ┆ Small │
│ Chrysler ┆ LeBaron ┆ Compact │
│ Honda ┆ Prelude ┆ Sporty │
│ Mercury ┆ Cougar ┆ Midsize │
│ Subaru ┆ Loyale ┆ Small │
└──────────────┴─────────┴─────────┘
46. How to create a primary key index by combining relevant columns?
Difficulty Level: L2
In df, replace nulls with 'missing' in columns 'Manufacturer', 'Model', and 'Type', then create a new column 'primary_key' as a combination of these three columns. Check if it is unique.
Solve:
import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")
# Write your code below
Desired Output:
True
47. How to get the row number of the n-th largest value in a column?
Difficulty Level: L2
Find the row position of the 5th largest value of column 'a' in df.
Solve:
import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame(np.random.randint(1, 30, 30).reshape(10, -1).tolist(), schema=list('abc'))
# Write your code below
Desired Output:
DataFrame:
shape: (10, 3)
┌─────┬─────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪═════╡
│ 7 ┆ 20 ┆ 29 │
│ 15 ┆ 11 ┆ 8 │
│ 29 ┆ 21 ┆ 7 │
│ 26 ┆ 19 ┆ 23 │
│ 11 ┆ 11 ┆ 24 │
│ 21 ┆ 4 ┆ 8 │
│ 24 ┆ 3 ┆ 22 │
│ 21 ┆ 2 ┆ 24 │
│ 12 ┆ 6 ┆ 2 │
│ 28 ┆ 21 ┆ 1 │
└─────┴─────┴─────┘
Row index of 5th largest value in 'a': 5
48. How to find the position of the n-th largest value greater than the mean?
Difficulty Level: L2
Find the positions of values in ser that are greater than the mean. Report the 2nd position.
Solve:
import polars as pl
import numpy as np
np.random.seed(42)
ser = pl.Series("data", np.random.randint(1, 100, 15).tolist())
# Write your code below
Desired Output:
Series: [52, 93, 15, 72, 61, 21, 83, 87, 75, 75, 88, 24, 3, 22, 53]
Mean: 55
2nd position where value > mean: 3
49. How to get the last two rows of a DataFrame whose row sum > 100?
Difficulty Level: L2
Get the last two rows of df where the sum of the row values exceeds 100.
Solve:
import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame(np.random.randint(10, 40, 60).reshape(-1, 4).tolist(), schema=[f"c{i}" for i in range(4)])
# Write your code below
Desired Output:
shape: (2, 4)
┌─────┬─────┬─────┬─────┐
│ c0 ┆ c1 ┆ c2 ┆ c3 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪═════╪═════╡
│ 24 ┆ 39 ┆ 39 ┆ 24 │
│ 39 ┆ 28 ┆ 21 ┆ 32 │
└─────┴─────┴─────┴─────┘
50. How to find and cap outliers from a Series or DataFrame column?
Difficulty Level: L2
Replace all values in ser that are above the 95th percentile or below the 5th percentile with the respective percentile value.
Solve:
import polars as pl
import numpy as np
np.random.seed(100)
ser = pl.Series("data", np.random.normal(0, 1, 50).tolist())
# Write your code below
Desired Output:
Low: -1.6906, High: 1.4707
shape: (5, 1)
┌───────────┐
│ data │
│ --- │
│ f64 │
╞═══════════╡
│ -1.690617 │
│ 0.34268 │
│ 1.153036 │
│ -0.252436 │
│ 0.981321 │
└───────────┘
51. How to reshape a DataFrame from long to wide format?
Difficulty Level: L3
Pivot df so each unique 'car' becomes a row and the columns are the cities with corresponding 'price' values.
Solve:
import polars as pl
df = pl.DataFrame({
"car": ["Audi", "Audi", "BMW", "BMW"],
"city": ["SF", "NYC", "SF", "NYC"],
"price": [45000, 42000, 55000, 52000]
})
# Write your code below
Desired Output:
shape: (2, 3)
┌──────┬───────┬───────┐
│ car ┆ SF ┆ NYC │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞══════╪═══════╪═══════╡
│ Audi ┆ 45000 ┆ 42000 │
│ BMW ┆ 55000 ┆ 52000 │
└──────┴───────┴───────┘
52. How to reshape a DataFrame from wide to long format?
Difficulty Level: L2
Melt df so each car-city pair becomes a row.
Solve:
import polars as pl
df = pl.DataFrame({
"car": ["Audi", "BMW"],
"SF": [45000, 55000],
"NYC": [42000, 52000]
})
# Write your code below
Desired Output:
shape: (4, 3)
┌──────┬──────┬───────┐
│ car ┆ city ┆ price │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞══════╪══════╪═══════╡
│ Audi ┆ SF ┆ 45000 │
│ BMW ┆ SF ┆ 55000 │
│ Audi ┆ NYC ┆ 42000 │
│ BMW ┆ NYC ┆ 52000 │
└──────┴──────┴───────┘
53. How to create a DataFrame with rows as stacked columns?
Difficulty Level: L3
Create a DataFrame where each row is a column name – column value pair for the first row of df.
Solve:
import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")
# Write your code below
Desired Output:
shape: (10, 2)
┌──────────────┬─────────┐
│ column ┆ value │
│ --- ┆ --- │
│ str ┆ str │
╞══════════════╪═════════╡
│ Manufacturer ┆ Acura │
│ Model ┆ Integra │
│ Type ┆ Small │
│ Min.Price ┆ 12.9 │
│ Price ┆ 15.9 │
│ Max.Price ┆ 18.8 │
│ MPG.city ┆ 25 │
│ MPG.highway ┆ 31 │
│ AirBags ┆ None │
│ DriveTrain ┆ Front │
└──────────────┴─────────┘
54. How to check if a DataFrame has any missing values?
Difficulty Level: L1
Check which columns in df have any null values.
Solve:
import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")
# Write your code below
Desired Output:
['Manufacturer', 'Model', 'Type', 'Min.Price', 'Price', 'Max.Price', 'MPG.city', 'MPG.highway', 'AirBags', 'DriveTrain', 'Cylinders', 'EngineSize', 'Horsepower', 'RPM', 'Rev.per.mile', 'Man.trans.avail', 'Fuel.tank.capacity', 'Passengers', 'Length', 'Wheelbase', 'Width', 'Turn.circle', 'Rear.seat.room', 'Luggage.room', 'Weight', 'Origin', 'Make']
55. How to get the minimum value in each column grouped by another column?
Difficulty Level: L2
In df, for each 'Type', get the minimum 'Price'.
Solve:
import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")
# Write your code below
Desired Output:
shape: (7, 2)
┌─────────┬───────┐
│ Type ┆ Price │
│ --- ┆ --- │
│ str ┆ f64 │
╞═════════╪═══════╡
│ null ┆ 8.6 │
│ Compact ┆ 11.1 │
│ Large ┆ 18.4 │
│ Midsize ┆ 13.9 │
│ Small ┆ 7.4 │
│ Sporty ┆ 12.5 │
│ Van ┆ 16.3 │
└─────────┴───────┘
56. How to get the top n rows of each group in a DataFrame?
Difficulty Level: L2
For each 'Type', get the top 2 rows with the highest 'Price'.
Solve:
import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")
# Write your code below
Desired Output:
shape: (10, 4)
┌─────────┬───────────────┬──────────┬───────┐
│ Type ┆ Manufacturer ┆ Model ┆ Price │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ f64 │
╞═════════╪═══════════════╪══════════╪═══════╡
│ null ┆ Pontiac ┆ Firebird ┆ 17.7 │
│ null ┆ Hyundai ┆ Scoupe ┆ 10.0 │
│ Compact ┆ Mercedes-Benz ┆ 190E ┆ 31.9 │
│ Compact ┆ Audi ┆ 90 ┆ 29.1 │
│ Large ┆ Lincoln ┆ Town_Car ┆ 36.1 │
│ Large ┆ Cadillac ┆ DeVille ┆ 34.7 │
│ Midsize ┆ Toyota ┆ Camry ┆ null │
│ Midsize ┆ Mercedes-Benz ┆ 300E ┆ 61.9 │
│ Small ┆ Saturn ┆ SL ┆ null │
│ Small ┆ Acura ┆ Integra ┆ 15.9 │
└─────────┴───────────────┴──────────┴───────┘
57. How to replace missing values with the mode of a column?
Difficulty Level: L2
Replace the missing values in 'DriveTrain' column with its mode (most frequent value).
Solve:
import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")
# Write your code below
Desired Output:
Mode: Front
Nulls after fill: 0
58. How to create a new column from existing columns using a condition?
Difficulty Level: L2
Create a new column 'price_category' that says 'high' if Price > 30 else 'low'.
Solve:
import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")
# Write your code below
Desired Output:
shape: (10, 3)
┌──────────────┬───────┬────────────────┐
│ Manufacturer ┆ Price ┆ price_category │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ str │
╞══════════════╪═══════╪════════════════╡
│ Acura ┆ 15.9 ┆ low │
│ null ┆ 33.9 ┆ high │
│ Audi ┆ 29.1 ┆ low │
│ Audi ┆ 37.7 ┆ high │
│ BMW ┆ 30.0 ┆ low │
│ Buick ┆ 15.7 ┆ low │
│ Buick ┆ 20.8 ┆ low │
│ Buick ┆ 23.7 ┆ low │
│ Buick ┆ 26.3 ┆ low │
│ Cadillac ┆ 34.7 ┆ high │
└──────────────┴───────┴────────────────┘
59. How to get the column-wise maximum of two DataFrames?
Difficulty Level: L2
Get the element-wise maximum of two DataFrames df1 and df2.
Solve:
import polars as pl
import numpy as np
np.random.seed(100)
df1 = pl.DataFrame(np.random.randint(1, 25, [5, 3]), schema=list('abc'))
df2 = pl.DataFrame(np.random.randint(1, 25, [5, 3]), schema=list('abc'))
# Write your code below
Desired Output:
shape: (5, 3)
┌─────┬─────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪═════╡
│ 17 ┆ 16 ┆ 8 │
│ 24 ┆ 17 ┆ 17 │
│ 23 ┆ 21 ┆ 13 │
│ 22 ┆ 3 ┆ 14 │
│ 22 ┆ 20 ┆ 18 │
└─────┴─────┴─────┘
60. How to get the correlation between two columns of a DataFrame?
Difficulty Level: L2
Compute the correlation between all numeric columns in df and find the two columns with the highest absolute correlation.
Solve:
import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame(np.random.randint(1, 100, 80).reshape(8, -1).tolist(), schema=[f'c{i}' for i in range(10)])
# Write your code below
Desired Output:
Highest correlation: (c1, c8) = 0.8447
61. How to create a column containing the minimum-by-maximum of each row?
Difficulty Level: L2
Compute the minimum / maximum for every row of df.
Solve:
import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame(np.random.randint(1, 100, 80).reshape(8, -1).tolist(), schema=[f'c{i}' for i in range(10)])
# Write your code below
Desired Output:
shape: (8,)
Series: 'min_by_max' [f64]
[
0.16129
0.022727
0.230769
0.163043
0.041096
0.021739
0.098765
0.021505
]
62. How to create a column that contains the penultimate (second largest) value in each row?
Difficulty Level: L2
Create a new column 'penultimate' which has the second largest value of each row.
Solve:
import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame(np.random.randint(1, 100, 80).reshape(8, -1).tolist(), schema=[f'c{i}' for i in range(10)])
# Write your code below
Desired Output:
shape: (8, 1)
┌─────────────┐
│ penultimate │
│ --- │
│ i32 │
╞═════════════╡
│ 87 │
│ 88 │
│ 89 │
│ 80 │
│ 64 │
│ 90 │
│ 78 │
│ 90 │
└─────────────┘
63. How to normalize all columns in a DataFrame?
Difficulty Level: L2
Normalize all columns of df so that the values in each column range from 0 to 1. (min-max scaling)
Solve:
import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame(np.random.randint(1, 100, 80).reshape(8, -1).tolist(), schema=[f'c{i}' for i in range(10)])
# Write your code below
Desired Output:
shape: (5, 10)
┌──────────┬──────────┬──────────┬──────────┬───┬──────────┬──────────┬──────────┬──────────┐
│ c0 ┆ c1 ┆ c2 ┆ c3 ┆ … ┆ c6 ┆ c7 ┆ c8 ┆ c9 │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞══════════╪══════════╪══════════╪══════════╪═══╪══════════╪══════════╪══════════╪══════════╡
│ 0.571429 ┆ 1.0 ┆ 0.134831 ┆ 1.0 ┆ … ┆ 0.861111 ┆ 0.977011 ┆ 0.863636 ┆ 0.811111 │
│ 1.0 ┆ 0.241758 ┆ 0.0 ┆ 0.275362 ┆ … ┆ 0.930556 ┆ 0.321839 ┆ 0.30303 ┆ 0.0 │
│ 0.714286 ┆ 0.637363 ┆ 0.202247 ┆ 0.434783 ┆ … ┆ 0.013889 ┆ 1.0 ┆ 0.469697 ┆ 0.988889 │
│ 0.654762 ┆ 0.43956 ┆ 1.0 ┆ 0.826087 ┆ … ┆ 0.569444 ┆ 0.689655 ┆ 0.439394 ┆ 0.666667 │
│ 0.559524 ┆ 0.582418 ┆ 0.685393 ┆ 0.0 ┆ … ┆ 0.0 ┆ 0.816092 ┆ 0.318182 ┆ 0.177778 │
└──────────┴──────────┴──────────┴──────────┴───┴──────────┴──────────┴──────────┴──────────┘
64. How to compute the row-wise softmax of a DataFrame?
Difficulty Level: L3
Compute the softmax of each row: e^x_i / sum(e^x) for each row.
Solve:
import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame(np.random.randint(1, 100, 80).reshape(8, -1).tolist(), schema=[f'c{i}' for i in range(10)])
# Write your code below
Desired Output:
shape: (3, 10)
┌────────┬────────┬────────┬────────┬───┬────────┬────────┬────────┬────────┐
│ c0 ┆ c1 ┆ c2 ┆ c3 ┆ … ┆ c6 ┆ c7 ┆ c8 ┆ c9 │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞════════╪════════╪════════╪════════╪═══╪════════╪════════╪════════╪════════╡
│ 0.0000 ┆ 0.9975 ┆ 0.0000 ┆ 0.0000 ┆ … ┆ 0.0000 ┆ 0.0025 ┆ 0.0000 ┆ 0.0000 │
│ 0.5000 ┆ 0.0000 ┆ 0.0000 ┆ 0.0000 ┆ … ┆ 0.5000 ┆ 0.0000 ┆ 0.0000 ┆ 0.0000 │
│ 0.0000 ┆ 0.0000 ┆ 0.0000 ┆ 0.0000 ┆ … ┆ 0.0000 ┆ 0.1192 ┆ 0.0000 ┆ 0.8808 │
└────────┴────────┴────────┴────────┴───┴────────┴────────┴────────┴────────┘
65. How to find the maximum range (max – min) column in a DataFrame?
Difficulty Level: L2
Find the column with the maximum range (max – min) in df.
Solve:
import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame(np.random.randint(1, 100, 80).reshape(8, -1).tolist(), schema=[f'c{i}' for i in range(10)])
# Write your code below
Desired Output:
Ranges: {'c1': 91, 'c9': 90, 'c2': 89}
Column with max range: c1
66. How to replace both diagonals of a DataFrame with 0?
Difficulty Level: L3
Replace both the main and anti-diagonal of df with 0.
Solve:
import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame(np.random.randint(1, 100, 100).reshape(10, -1).tolist(), schema=[f'c{i}' for i in range(10)])
# Write your code below
Desired Output:
shape: (5, 10)
┌─────┬─────┬─────┬─────┬───┬─────┬─────┬─────┬─────┐
│ c0 ┆ c1 ┆ c2 ┆ c3 ┆ … ┆ c6 ┆ c7 ┆ c8 ┆ c9 │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 ┆ i32 ┆ ┆ i32 ┆ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪═════╪═════╪═══╪═════╪═════╪═════╪═════╡
│ 0 ┆ 93 ┆ 15 ┆ 72 ┆ … ┆ 83 ┆ 87 ┆ 75 ┆ 0 │
│ 88 ┆ 0 ┆ 3 ┆ 22 ┆ … ┆ 88 ┆ 30 ┆ 0 ┆ 2 │
│ 64 ┆ 60 ┆ 0 ┆ 33 ┆ … ┆ 22 ┆ 0 ┆ 49 ┆ 91 │
│ 59 ┆ 42 ┆ 92 ┆ 0 ┆ … ┆ 0 ┆ 62 ┆ 47 ┆ 62 │
│ 51 ┆ 55 ┆ 64 ┆ 3 ┆ … ┆ 21 ┆ 73 ┆ 39 ┆ 18 │
└─────┴─────┴─────┴─────┴───┴─────┴─────┴─────┴─────┘
67. How to get a particular group of a group_by DataFrame by key?
Difficulty Level: L2
From df grouped by 'col1', get the group belonging to 'apple' as a DataFrame.
Solve:
import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame({
'col1': ['apple', 'banana', 'orange'] * 3,
'col2': np.random.rand(9).tolist(),
'col3': np.random.randint(0, 15, 9).tolist()
})
# Write your code below
Desired Output:
shape: (3, 3)
┌───────┬──────────┬──────┐
│ col1 ┆ col2 ┆ col3 │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ i32 │
╞═══════╪══════════╪══════╡
│ apple ┆ 0.37454 ┆ 7 │
│ apple ┆ 0.598658 ┆ 4 │
│ apple ┆ 0.058084 ┆ 11 │
└───────┴──────────┴──────┘
68. How to get the n-th largest value of a column when grouped by another column?
Difficulty Level: L2
In df, find the second largest value of 'taste' for 'banana'.
Solve:
import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame({
'fruit': ['apple', 'banana', 'orange', 'apple', 'banana', 'orange', 'apple', 'banana', 'orange'],
'taste': np.random.rand(9).tolist(),
'price': np.random.randint(1, 15, 9).tolist()
})
# Write your code below
Desired Output:
2nd largest taste for banana: 0.8662
69. How to compute grouped mean and keep the grouped column as another column (not index)?
Difficulty Level: L1
Compute the grouped mean of 'price' by 'fruit' and keep 'fruit' as a regular column.
Solve:
import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame({
'fruit': ['apple', 'banana', 'orange'] * 3,
'taste': np.random.rand(9).tolist(),
'price': np.random.randint(1, 15, 9).tolist()
})
# Write your code below
Desired Output:
shape: (3, 2)
┌────────┬──────────┐
│ fruit ┆ price │
│ --- ┆ --- │
│ str ┆ f64 │
╞════════╪══════════╡
│ apple ┆ 8.333333 │
│ banana ┆ 6.333333 │
│ orange ┆ 6.666667 │
└────────┴──────────┘
70. How to join two DataFrames by 2 columns so they have only the common rows?
Difficulty Level: L2
Join df1 and df2 on 'fruit' and 'weight' so only matching rows remain.
Solve:
import polars as pl
import numpy as np
np.random.seed(42)
df1 = pl.DataFrame({
'fruit': ['apple', 'banana', 'orange'],
'weight': ['high', 'medium', 'low'],
'price': np.random.randint(0, 15, 3).tolist()
})
df2 = pl.DataFrame({
'fruit': ['apple', 'banana', 'melon'],
'weight': ['high', 'medium', 'high'],
'taste': np.random.randint(0, 15, 3).tolist()
})
# Write your code below
Desired Output:
shape: (2, 4)
┌────────┬────────┬───────┬───────┐
│ fruit ┆ weight ┆ price ┆ taste │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i32 ┆ i32 │
╞════════╪════════╪═══════╪═══════╡
│ apple ┆ high ┆ 6 ┆ 14 │
│ banana ┆ medium ┆ 3 ┆ 10 │
└────────┴────────┴───────┴───────┘
71. How to remove rows from a DataFrame that are present in another DataFrame?
Difficulty Level: L3
Remove rows from df1 that are present in df2, based on the 'fruit' column.
Solve:
import polars as pl
import numpy as np
np.random.seed(42)
df1 = pl.DataFrame({
'fruit': ['apple', 'banana', 'orange'],
'weight': ['high', 'medium', 'low'],
'price': np.random.randint(0, 15, 3).tolist()
})
df2 = pl.DataFrame({
'fruit': ['apple', 'melon', 'banana'],
'weight': ['high', 'high', 'low'],
'taste': np.random.randint(0, 15, 3).tolist()
})
# Write your code below
Desired Output:
shape: (1, 3)
┌────────┬────────┬───────┐
│ fruit ┆ weight ┆ price │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i32 │
╞════════╪════════╪═══════╡
│ orange ┆ low ┆ 12 │
└────────┴────────┴───────┘
72. How to get the positions where values of two columns match?
Difficulty Level: L1
Get the row positions where the values of columns 'a' and 'b' are equal.
Solve:
import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame({
'a': np.random.choice([1, 2, 3, 4], 10).tolist(),
'b': np.random.choice([1, 2, 3, 4], 10).tolist()
})
# Write your code below
Desired Output:
shape: (10, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i32 ┆ i32 │
╞═════╪═════╡
│ 3 ┆ 3 │
│ 4 ┆ 3 │
│ 1 ┆ 3 │
│ 3 ┆ 3 │
│ 3 ┆ 4 │
│ 4 ┆ 1 │
│ 1 ┆ 4 │
│ 1 ┆ 4 │
│ 3 ┆ 4 │
│ 2 ┆ 3 │
└─────┴─────┘
Positions where a == b: [0, 3]
73. How to create lags and leads of a column in a DataFrame?
Difficulty Level: L2
Create columns for lag1 (shifted down by 1) and lead1 (shifted up by 1) of column 'a'.
Solve:
import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame({
'a': np.arange(1, 11).tolist(),
'b': np.random.randint(10, 30, 10).tolist()
})
# Write your code below
Desired Output:
shape: (10, 4)
┌─────┬─────┬──────┬───────┐
│ a ┆ b ┆ lag1 ┆ lead1 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪══════╪═══════╡
│ 1 ┆ 16 ┆ null ┆ 2 │
│ 2 ┆ 29 ┆ 1 ┆ 3 │
│ 3 ┆ 24 ┆ 2 ┆ 4 │
│ 4 ┆ 20 ┆ 3 ┆ 5 │
│ 5 ┆ 17 ┆ 4 ┆ 6 │
│ 6 ┆ 16 ┆ 5 ┆ 7 │
│ 7 ┆ 28 ┆ 6 ┆ 8 │
│ 8 ┆ 20 ┆ 7 ┆ 9 │
│ 9 ┆ 20 ┆ 8 ┆ 10 │
│ 10 ┆ 13 ┆ 9 ┆ null │
└─────┴─────┴──────┴───────┘
74. How to get the frequency of unique values in the entire DataFrame?
Difficulty Level: L2
Get the frequency of unique values across the entire DataFrame df.
Solve:
import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame(np.random.randint(1, 10, 20).reshape(4, 5).tolist(), schema=list('abcde'))
# Write your code below
Desired Output:
shape: (7, 2)
┌─────┬───────┐
│ a ┆ count │
│ --- ┆ --- │
│ i64 ┆ u32 │
╞═════╪═══════╡
│ 8 ┆ 5 │
│ 5 ┆ 4 │
│ 7 ┆ 3 │
│ 4 ┆ 2 │
│ 6 ┆ 2 │
│ 2 ┆ 2 │
│ 3 ┆ 2 │
└─────┴───────┘
75. How to split a text column into two separate columns?
Difficulty Level: L2
Split the string column in df to form a DataFrame with 3 columns.
Solve:
import polars as pl
df = pl.DataFrame({
"row": [
"STD, City\tState",
"33, Kolkata\tWest Bengal",
"44, Chennai\tTamil Nadu",
"40, Hyderabad\tTelengana",
"80, Bangalore\tKarnataka"
]
})
# Write your code below
Desired Output:
shape: (4, 3)
┌─────┬───────────┬─────────────┐
│ STD ┆ City ┆ State │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════╪═══════════╪═════════════╡
│ 33 ┆ Kolkata ┆ West Bengal │
│ 44 ┆ Chennai ┆ Tamil Nadu │
│ 40 ┆ Hyderabad ┆ Telengana │
│ 80 ┆ Bangalore ┆ Karnataka │
└─────┴───────────┴─────────────┘
76. How to rank items within each group?
Difficulty Level: L2
For each store, rank the months by revenue (highest = rank 1). Use a window function.
Solve:
import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame({
'store': ['A','A','A','B','B','B','C','C','C'],
'month': ['Jan','Feb','Mar','Jan','Feb','Mar','Jan','Feb','Mar'],
'revenue': np.random.randint(100, 500, 9).tolist()
})
# Write your code below
Desired Output:
shape: (9, 4)
┌───────┬───────┬─────────┬───────────────┐
│ store ┆ month ┆ revenue ┆ rank_in_store │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i32 ┆ i32 │
╞═══════╪═══════╪═════════╪═══════════════╡
│ A ┆ Jan ┆ 202 ┆ 3 │
│ A ┆ Feb ┆ 448 ┆ 1 │
│ A ┆ Mar ┆ 370 ┆ 2 │
│ B ┆ Jan ┆ 206 ┆ 2 │
│ B ┆ Feb ┆ 171 ┆ 3 │
│ B ┆ Mar ┆ 288 ┆ 1 │
│ C ┆ Jan ┆ 120 ┆ 3 │
│ C ┆ Feb ┆ 202 ┆ 2 │
│ C ┆ Mar ┆ 221 ┆ 1 │
└───────┴───────┴─────────┴───────────────┘
77. How to compute the running difference within groups?
Difficulty Level: L2
For each user, compute the day-over-day change in logins using diff() within groups.
Solve:
import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame({
'user': ['A','A','A','A','B','B','B','B'],
'day': [1,2,3,4,1,2,3,4],
'logins': np.random.randint(1, 20, 8).tolist()
})
# Write your code below
Desired Output:
shape: (8, 4)
┌──────┬─────┬────────┬──────────────┐
│ user ┆ day ┆ logins ┆ daily_change │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i32 ┆ i32 │
╞══════╪═════╪════════╪══════════════╡
│ A ┆ 1 ┆ 7 ┆ null │
│ A ┆ 2 ┆ 15 ┆ 8 │
│ A ┆ 3 ┆ 11 ┆ -4 │
│ A ┆ 4 ┆ 8 ┆ -3 │
│ B ┆ 1 ┆ 7 ┆ null │
│ B ┆ 2 ┆ 19 ┆ 12 │
│ B ┆ 3 ┆ 11 ┆ -8 │
│ B ┆ 4 ┆ 11 ┆ 0 │
└──────┴─────┴────────┴──────────────┘
78. How to compute each employee’s salary as a percentage of their department total?
Difficulty Level: L2
Add a column showing what percentage of the department salary each employee represents.
Solve:
import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame({
'dept': ['Sales','Sales','Sales','Eng','Eng','Eng'],
'employee': ['Alice','Bob','Carol','Dave','Eve','Frank'],
'salary': (np.random.randint(50, 150, 6) * 1000).tolist()
})
# Write your code below
Desired Output:
shape: (6, 4)
┌───────┬──────────┬────────┬─────────────┐
│ dept ┆ employee ┆ salary ┆ pct_of_dept │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i32 ┆ f64 │
╞═══════╪══════════╪════════╪═════════════╡
│ Sales ┆ Alice ┆ 101000 ┆ 32.9 │
│ Sales ┆ Bob ┆ 142000 ┆ 46.3 │
│ Sales ┆ Carol ┆ 64000 ┆ 20.8 │
│ Eng ┆ Dave ┆ 121000 ┆ 40.2 │
│ Eng ┆ Eve ┆ 110000 ┆ 36.5 │
│ Eng ┆ Frank ┆ 70000 ┆ 23.3 │
└───────┴──────────┴────────┴─────────────┘
79. How to detect the start of a new streak in a sequence?
Difficulty Level: L3
Given a Series of status values, flag each row where a new streak begins (i.e., the value changes from the previous row).
Solve:
import polars as pl
ser = pl.Series("status", ['ok','ok','fail','fail','fail','ok','fail','ok','ok'])
# Write your code below
Desired Output:
shape: (9, 3)
┌─────┬────────┬───────────────┐
│ idx ┆ status ┆ is_new_streak │
│ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ bool │
╞═════╪════════╪═══════════════╡
│ 0 ┆ ok ┆ null │
│ 1 ┆ ok ┆ false │
│ 2 ┆ fail ┆ true │
│ 3 ┆ fail ┆ false │
│ 4 ┆ fail ┆ false │
│ 5 ┆ ok ┆ true │
│ 6 ┆ fail ┆ true │
│ 7 ┆ ok ┆ true │
│ 8 ┆ ok ┆ false │
└─────┴────────┴───────────────┘
80. How to compute the row-wise coefficient of variation?
Difficulty Level: L3
Compute the coefficient of variation (std / mean) across the columns for each row.
Solve:
import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame(np.random.randint(10, 100, 30).reshape(6, 5).tolist(), schema=[f"s{i}" for i in range(5)])
# Write your code below
Desired Output:
shape: (6, 1)
┌────────┐
│ cv │
│ --- │
│ f64 │
╞════════╡
│ 0.4706 │
│ 0.0696 │
│ 0.6956 │
│ 0.6162 │
│ 0.3786 │
│ 0.4025 │
└────────┘
81. How to build a pivot table with multiple aggregations?
Difficulty Level: L2
Group by region and product, then compute total sales, average sales, and total quantity.
Solve:
import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame({
'region': ['East','East','West','West','East','West'] * 2,
'product': ['A','B','A','B','A','B'] * 2,
'sales': np.random.randint(100, 1000, 12).tolist(),
'qty': np.random.randint(1, 50, 12).tolist()
})
# Write your code below
Desired Output:
shape: (4, 5)
┌────────┬─────────┬─────────────┬───────────┬───────────┐
│ region ┆ product ┆ total_sales ┆ avg_sales ┆ total_qty │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i32 ┆ f64 ┆ i32 │
╞════════╪═════════╪═════════════╪═══════════╪═══════════╡
│ East ┆ A ┆ 1774 ┆ 444.0 ┆ 98 │
│ East ┆ B ┆ 655 ┆ 328.0 ┆ 33 │
│ West ┆ A ┆ 1674 ┆ 837.0 ┆ 26 │
│ West ┆ B ┆ 1076 ┆ 269.0 ┆ 114 │
└────────┴─────────┴─────────────┴───────────┴───────────┘
82. How to create a rolling mean column?
Difficulty Level: L2
Create a 5-period rolling mean of column 'medv'.
Solve:
import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv')
# Write your code below
Desired Output:
shape: (7, 2)
┌──────┬──────────────┐
│ medv ┆ rolling_medv │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞══════╪══════════════╡
│ 24.0 ┆ null │
│ 21.6 ┆ null │
│ 34.7 ┆ null │
│ 33.4 ┆ null │
│ 36.2 ┆ 29.98 │
│ 28.7 ┆ 30.92 │
│ 22.9 ┆ 31.18 │
└──────┴──────────────┘
83. How to find the first occurrence of each unique value?
Difficulty Level: L2
For each unique category, find the row index and value of its first appearance.
Solve:
import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame({
'category': ['B','A','C','A','B','C','A','B'],
'value': np.random.randint(10, 99, 8).tolist()
})
# Write your code below
Desired Output:
shape: (3, 3)
┌──────────┬───────────────┬─────────────┐
│ category ┆ first_seen_at ┆ first_value │
│ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ i32 │
╞══════════╪═══════════════╪═════════════╡
│ B ┆ 0 ┆ 61 │
│ A ┆ 1 ┆ 24 │
│ C ┆ 2 ┆ 81 │
└──────────┴───────────────┴─────────────┘
84. How to find duplicate rows in a DataFrame?
Difficulty Level: L1
Find duplicate rows based on all columns.
Solve:
import polars as pl
df = pl.DataFrame({
'a': [1, 2, 2, 3, 3],
'b': ['x', 'y', 'y', 'z', 'z'],
})
# Write your code below
Desired Output:
shape: (4, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 2 ┆ y │
│ 2 ┆ y │
│ 3 ┆ z │
│ 3 ┆ z │
└─────┴─────┘
85. How to identify the top performer in each group?
Difficulty Level: L2
From df, select the player with the highest score in each team — using a window function, not group_by.
Solve:
import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame({
'team': ['Red','Red','Red','Blue','Blue','Blue','Green','Green','Green'],
'player': ['A','B','C','D','E','F','G','H','I'],
'score': np.random.randint(50, 100, 9).tolist()
})
# Write your code below
Desired Output:
shape: (3, 3)
┌───────┬────────┬───────┐
│ team ┆ player ┆ score │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i32 │
╞═══════╪════════╪═══════╡
│ Red ┆ A ┆ 88 │
│ Blue ┆ D ┆ 92 │
│ Green ┆ G ┆ 88 │
└───────┴────────┴───────┘
86. How to compute z-scores per group?
Difficulty Level: L2
Compute the z-score of value within each group using window functions.
Solve:
import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame({
'group': ['A','A','A','A','B','B','B','B'],
'value': np.random.normal(50, 10, 8).round(1).tolist()
})
# Write your code below
Desired Output:
shape: (8, 3)
┌───────┬───────┬─────────┐
│ group ┆ value ┆ z_score │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 │
╞═══════╪═══════╪═════════╡
│ A ┆ 55.0 ┆ -0.19 │
│ A ┆ 48.6 ┆ -1.13 │
│ A ┆ 56.5 ┆ 0.03 │
│ A ┆ 65.2 ┆ 1.3 │
│ B ┆ 47.7 ┆ -0.8 │
│ B ┆ 47.7 ┆ -0.8 │
│ B ┆ 65.8 ┆ 1.26 │
│ B ┆ 57.7 ┆ 0.34 │
└───────┴───────┴─────────┘
87. How to compute expanding (cumulative) window aggregations?
Difficulty Level: L2
Add columns for cumulative sum, running max, and running min of sales.
Solve:
import polars as pl
df = pl.DataFrame({
'day': list(range(1, 8)),
'sales': [100, 150, 130, 200, 180, 220, 210]
})
# Write your code below
Desired Output:
shape: (7, 5)
┌─────┬───────┬───────────┬─────────────┬─────────────┐
│ day ┆ sales ┆ cum_sales ┆ running_max ┆ running_min │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═══════╪═══════════╪═════════════╪═════════════╡
│ 1 ┆ 100 ┆ 100 ┆ 100 ┆ 100 │
│ 2 ┆ 150 ┆ 250 ┆ 150 ┆ 100 │
│ 3 ┆ 130 ┆ 380 ┆ 150 ┆ 100 │
│ 4 ┆ 200 ┆ 580 ┆ 200 ┆ 100 │
│ 5 ┆ 180 ┆ 760 ┆ 200 ┆ 100 │
│ 6 ┆ 220 ┆ 980 ┆ 220 ┆ 100 │
│ 7 ┆ 210 ┆ 1190 ┆ 220 ┆ 100 │
└─────┴───────┴───────────┴─────────────┴─────────────┘
88. How to compute a conditional cumulative sum?
Difficulty Level: L3
Compute a running total of amount, but only accumulate rows where event == 'purchase'.
Solve:
import polars as pl
df = pl.DataFrame({
'event': ['login','purchase','login','purchase','login','purchase','login','purchase'],
'amount': [0, 50, 0, 30, 0, 80, 0, 20]
})
# Write your code below
Desired Output:
shape: (8, 3)
┌──────────┬────────┬────────────────────────┐
│ event ┆ amount ┆ running_purchase_total │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞══════════╪════════╪════════════════════════╡
│ login ┆ 0 ┆ 0 │
│ purchase ┆ 50 ┆ 50 │
│ login ┆ 0 ┆ 50 │
│ purchase ┆ 30 ┆ 80 │
│ login ┆ 0 ┆ 80 │
│ purchase ┆ 80 ┆ 160 │
│ login ┆ 0 ┆ 160 │
│ purchase ┆ 20 ┆ 180 │
└──────────┴────────┴────────────────────────┘
89. How to compute quarter-over-quarter growth rate within groups?
Difficulty Level: L2
For each company, compute the percentage growth in revenue from the previous quarter.
Solve:
import polars as pl
df = pl.DataFrame({
'company': ['AAPL','AAPL','AAPL','GOOG','GOOG','GOOG'],
'quarter': ['Q1','Q2','Q3','Q1','Q2','Q3'],
'revenue': [100, 120, 115, 200, 230, 250]
})
# Write your code below
Desired Output:
shape: (6, 5)
┌─────────┬─────────┬─────────┬──────────────┬────────────┐
│ company ┆ quarter ┆ revenue ┆ prev_revenue ┆ growth_pct │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 ┆ f64 │
╞═════════╪═════════╪═════════╪══════════════╪════════════╡
│ AAPL ┆ Q1 ┆ 100 ┆ null ┆ null │
│ AAPL ┆ Q2 ┆ 120 ┆ 100 ┆ 20.0 │
│ AAPL ┆ Q3 ┆ 115 ┆ 120 ┆ -4.2 │
│ GOOG ┆ Q1 ┆ 200 ┆ null ┆ null │
│ GOOG ┆ Q2 ┆ 230 ┆ 200 ┆ 15.0 │
│ GOOG ┆ Q3 ┆ 250 ┆ 230 ┆ 8.7 │
└─────────┴─────────┴─────────┴──────────────┴────────────┘
90. How to detect outliers using the IQR method?
Difficulty Level: L2
Find values in ser that fall outside 1.5 × IQR from the quartiles.
Solve:
import polars as pl
import numpy as np
np.random.seed(42)
data = list(np.random.normal(50, 10, 20).round(1)) + [150.0, -30.0] # inject outliers
ser = pl.Series("data", data)
# Write your code below
Desired Output:
Q1=40.9, Q3=55.4, IQR=14.5
Bounds: [19.1, 77.2]
Outliers: [150.0, -30.0]
91. How to use an anti-join to find missing records?
Difficulty Level: L2
Given a list of expected IDs and a DataFrame of received records, find which IDs are missing.
Solve:
import polars as pl
expected = pl.DataFrame({"id": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})
received = pl.DataFrame({"id": [1, 2, 4, 5, 7, 9], "value": [10, 20, 40, 50, 70, 90]})
# Write your code below
Desired Output:
shape: (4, 1)
┌─────┐
│ id │
│ --- │
│ i64 │
╞═════╡
│ 3 │
│ 6 │
│ 8 │
│ 10 │
└─────┘
92. How to select columns by dtype?
Difficulty Level: L2
Select only the float columns from df.
Solve:
import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")
# Write your code below
Desired Output:
['Min.Price', 'Price', 'Max.Price', 'EngineSize', 'Fuel.tank.capacity', 'Rear.seat.room']
93. How to categorize a numeric column using when/then/otherwise?
Difficulty Level: L2
Categorize 'medv' into 'low' (< 20), 'medium' (20-35), and 'high' (> 35).
Solve:
import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv')
# Write your code below
Desired Output:
shape: (10, 2)
┌──────┬──────────┐
│ medv ┆ category │
│ --- ┆ --- │
│ f64 ┆ str │
╞══════╪══════════╡
│ 24.0 ┆ medium │
│ 21.6 ┆ medium │
│ 34.7 ┆ medium │
│ 33.4 ┆ medium │
│ 36.2 ┆ high │
│ 28.7 ┆ medium │
│ 22.9 ┆ medium │
│ 27.1 ┆ medium │
│ 16.5 ┆ low │
│ 18.9 ┆ low │
└──────┴──────────┘
94. How to compute the mode of each column in a DataFrame?
Difficulty Level: L2
Find the most frequent value in each column of df.
Solve:
import polars as pl
df = pl.DataFrame({
'color': ['red','blue','red','green','blue','red','blue','green'],
'size': ['S','M','L','M','M','S','L','M'],
'rating': [5, 3, 5, 4, 3, 5, 3, 4]
})
# Write your code below
Desired Output:
Mode of 'color': red
Mode of 'size': M
Mode of 'rating': 5
95. How to use lazy evaluation in Polars?
Difficulty Level: L2
Use lazy evaluation to filter rows where 'Price' > 30 and select 'Manufacturer', 'Model', and 'Price'.
Solve:
import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")
# Write your code below
Desired Output:
shape: (12, 3)
┌───────────────┬─────────────┬───────┐
│ Manufacturer ┆ Model ┆ Price │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞═══════════════╪═════════════╪═══════╡
│ null ┆ Legend ┆ 33.9 │
│ Audi ┆ 100 ┆ 37.7 │
│ Cadillac ┆ DeVille ┆ 34.7 │
│ Cadillac ┆ Seville ┆ 40.1 │
│ Chevrolet ┆ Corvette ┆ 38.0 │
│ … ┆ … ┆ … │
│ Lincoln ┆ Continental ┆ 34.3 │
│ Lincoln ┆ Town_Car ┆ 36.1 │
│ Mazda ┆ RX-7 ┆ 32.5 │
│ Mercedes-Benz ┆ 190E ┆ 31.9 │
│ Mercedes-Benz ┆ 300E ┆ 61.9 │
└───────────────┴─────────────┴───────┘
96. How to use window functions to compute group-level statistics alongside row-level data?
Difficulty Level: L2
Add a column showing the mean 'Price' per 'Type' alongside every row, without collapsing the DataFrame.
Solve:
import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")
# Write your code below
Desired Output:
shape: (10, 4)
┌──────────────┬─────────┬───────┬────────────────────┐
│ Manufacturer ┆ Type ┆ Price ┆ mean_price_by_type │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ f64 │
╞══════════════╪═════════╪═══════╪════════════════════╡
│ Acura ┆ Small ┆ 15.90 ┆ 10.20 │
│ null ┆ Midsize ┆ 33.90 ┆ 27.65 │
│ Audi ┆ Compact ┆ 29.10 ┆ 18.21 │
│ Audi ┆ Midsize ┆ 37.70 ┆ 27.65 │
│ BMW ┆ Midsize ┆ 30.00 ┆ 27.65 │
│ Buick ┆ Midsize ┆ 15.70 ┆ 27.65 │
│ Buick ┆ Large ┆ 20.80 ┆ 24.30 │
│ Buick ┆ Large ┆ 23.70 ┆ 24.30 │
│ Buick ┆ Midsize ┆ 26.30 ┆ 27.65 │
│ Cadillac ┆ Large ┆ 34.70 ┆ 24.30 │
└──────────────┴─────────┴───────┴────────────────────┘
97. How to understand the difference between rank(method='min') and rank(method='dense')?
Difficulty Level: L2
Rank students by score using both min and dense methods and observe the difference when there are ties.
Solve:
import polars as pl
df = pl.DataFrame({
'student': ['Alice','Bob','Carol','Dave','Eve'],
'score': [88, 92, 88, 95, 92]
})
# Write your code below
Desired Output:
shape: (5, 4)
┌─────────┬───────┬──────────┬────────────┐
│ student ┆ score ┆ rank_min ┆ rank_dense │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i32 ┆ i32 │
╞═════════╪═══════╪══════════╪════════════╡
│ Alice ┆ 88 ┆ 4 ┆ 3 │
│ Bob ┆ 92 ┆ 2 ┆ 2 │
│ Carol ┆ 88 ┆ 4 ┆ 3 │
│ Dave ┆ 95 ┆ 1 ┆ 1 │
│ Eve ┆ 92 ┆ 2 ┆ 2 │
└─────────┴───────┴──────────┴────────────┘
98. How to clean and standardize messy string columns?
Difficulty Level: L2
Clean first_name and last_name (strip whitespace, title-case), combine into full_name, and normalize email_raw to lowercase.
Solve:
import polars as pl
df = pl.DataFrame({
'first_name': [' John ', 'ALICE', 'bob ', ' Carol'],
'last_name': ['DOE ', ' Smith', 'JONES', ' Lee '],
'email_raw': ['John.Doe@GMAIL.COM', 'alice@Yahoo.com', 'BOB@hotmail.COM', 'carol@outlook.COM']
})
# Write your code below
Desired Output:
shape: (4, 2)
┌─────────────┬────────────────────┐
│ full_name ┆ email_clean │
│ --- ┆ --- │
│ str ┆ str │
╞═════════════╪════════════════════╡
│ John Doe ┆ john.doe@gmail.com │
│ Alice Smith ┆ alice@yahoo.com │
│ Bob Jones ┆ bob@hotmail.com │
│ Carol Lee ┆ carol@outlook.com │
└─────────────┴────────────────────┘
99. How to extract date features for machine learning?
Difficulty Level: L2
From a date column, extract month, weekday, quarter, and create a boolean is_holiday_season column (Nov, Dec, Jan).
Solve:
import polars as pl
from datetime import date
df = pl.DataFrame({
'order_date': pl.date_range(date(2024, 1, 1), date(2024, 12, 31), eager=True)
}).sample(8, seed=42).sort("order_date")
# Write your code below
Desired Output:
shape: (8, 5)
┌────────────┬───────┬─────────┬─────────┬───────────────────┐
│ order_date ┆ month ┆ weekday ┆ quarter ┆ is_holiday_season │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ i8 ┆ i8 ┆ i8 ┆ bool │
╞════════════╪═══════╪═════════╪═════════╪═══════════════════╡
│ 2024-02-15 ┆ 2 ┆ 4 ┆ 1 ┆ false │
│ 2024-04-24 ┆ 4 ┆ 3 ┆ 2 ┆ false │
│ 2024-08-02 ┆ 8 ┆ 5 ┆ 3 ┆ false │
│ 2024-08-09 ┆ 8 ┆ 5 ┆ 3 ┆ false │
│ 2024-09-10 ┆ 9 ┆ 2 ┆ 3 ┆ false │
│ 2024-10-15 ┆ 10 ┆ 2 ┆ 4 ┆ false │
│ 2024-10-19 ┆ 10 ┆ 6 ┆ 4 ┆ false │
│ 2024-12-21 ┆ 12 ┆ 6 ┆ 4 ┆ true │
└────────────┴───────┴─────────┴─────────┴───────────────────┘
100. How to explode a list column and compute aggregations?
Difficulty Level: L3
Each user has a list of tags. Explode the tags, then count how many users have each tag and list who they are.
Solve:
import polars as pl
df = pl.DataFrame({
'user': ['Alice', 'Bob', 'Carol'],
'tags': [['python', 'polars', 'ML'], ['python', 'rust'], ['polars', 'ML', 'DL', 'python']]
})
# Write your code below
Desired Output:
shape: (5, 3)
┌────────┬───────────┬───────────────────────────┐
│ tags ┆ num_users ┆ users │
│ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ list[str] │
╞════════╪═══════════╪═══════════════════════════╡
│ python ┆ 3 ┆ ["Alice", "Bob", "Carol"] │
│ ML ┆ 2 ┆ ["Alice", "Carol"] │
│ polars ┆ 2 ┆ ["Alice", "Carol"] │
│ DL ┆ 1 ┆ ["Carol"] │
│ rust ┆ 1 ┆ ["Bob"] │
└────────┴───────────┴───────────────────────────┘
101. How to use struct and unnest to work with nested data?
Difficulty Level: L3
Create a struct column from 'Manufacturer' and 'Model', then unnest it back.
Solve:
import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")
# Write your code below
Desired Output:
shape: (3, 2)
┌─────────────────────┬───────┐
│ car_info ┆ Price │
│ --- ┆ --- │
│ struct[2] ┆ f64 │
╞═════════════════════╪═══════╡
│ {"Acura","Integra"} ┆ 15.9 │
│ {null,"Legend"} ┆ 33.9 │
│ {"Audi","90"} ┆ 29.1 │
└─────────────────────┴───────┘
shape: (3, 3)
┌──────────────┬─────────┬───────┐
│ Manufacturer ┆ Model ┆ Price │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞══════════════╪═════════╪═══════╡
│ Acura ┆ Integra ┆ 15.9 │
│ null ┆ Legend ┆ 33.9 │
│ Audi ┆ 90 ┆ 29.1 │
└──────────────┴─────────┴───────┘
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →