101 Polars Exercises for Data Analysis (with Solutions)

Master Polars with 101 hands-on exercises and solutions — covering DataFrames, groupby, joins, window functions, lazy eval, and more.

Written by Selva Prabhakaran | 52 min read

Practice Polars — the blazing-fast DataFrame library for Python — with these 101 exercises ranging from beginner to advanced.

This post has interactive code — click ‘Run’ or press Ctrl+Enter on any code block to execute it directly in your browser. The first run may take a few seconds to initialize.

Polars is a lightning-fast DataFrame library written in Rust with a Python API. It is designed for performance and ergonomics, offering lazy evaluation, expressive syntax, and first-class support for parallel execution. These 101 exercises will help you master Polars through hands-on practice.

The exercises are organized by increasing difficulty across topics like Series, DataFrames, filtering, groupby, joins, string operations, datetime handling, reshaping, and more.

Before you begin: Run the code block below to install Polars. This only needs to be done once per session.

import polars as pl
print("Polars", pl.__version__, "ready!")

Difficulty Levels:

L1 — Beginner
L2 — Intermediate
L3 — Advanced

1. How to import polars and check the version?

Difficulty Level: L1

Import polars and print the version installed.

Solve:

# Task: Import polars and check the version

# Write your code below

Desired Output:

python

1.39.2

Show Solution

import polars as pl
print(pl.__version__)

2. How to create a Series from a list, numpy array, and dict?

Difficulty Level: L1

Create a polars Series from each of the following: a list, a numpy array, and a dictionary (keys as name, values as data).

Solve:

import polars as pl
import numpy as np
mylist = list('abcedfghijklmnopqrstuvwxyz')
myarr = np.arange(26)
mydict = dict(zip(mylist, myarr))

# Write your code below

Desired Output:

python

shape: (10,)
Series: 'values' [i32]
[
    0
    1
    2
    3
    4
    5
    6
    7
    8
    9
]

Show Solution

import polars as pl
import numpy as np
mylist = list('abcedfghijklmnopqrstuvwxyz')
myarr = np.arange(26)
mydict = dict(zip(mylist, myarr))

# From list
ser1 = pl.Series("letters", mylist)

# From numpy array
ser2 = pl.Series("numbers", myarr.tolist())

# From dict (keys become a column, values become another)
ser3 = pl.Series("values", list(mydict.values()))
print(ser1.head())
print(ser2.head())
print(ser3.head())

3. How to convert a Series into a DataFrame with the index as a column?

Difficulty Level: L1

Polars doesn’t have an index. Given a dictionary, create a two-column DataFrame with keys in one column and values in another.

Solve:

import polars as pl
import numpy as np
mylist = list('abcedfghijklmnopqrstuvwxyz')
myarr = np.arange(26)
mydict = dict(zip(mylist, myarr))

# Write your code below

Desired Output:

python

shape: (5, 2)
┌─────┬───────┐
│ key ┆ value │
│ --- ┆ ---   │
│ str ┆ i32   │
╞═════╪═══════╡
│ a   ┆ 0     │
│ b   ┆ 1     │
│ c   ┆ 2     │
│ e   ┆ 3     │
│ d   ┆ 4     │
└─────┴───────┘

Show Solution

import polars as pl
import numpy as np
mylist = list('abcedfghijklmnopqrstuvwxyz')
myarr = np.arange(26)
mydict = dict(zip(mylist, myarr))

df = pl.DataFrame({"key": list(mydict.keys()), "value": list(mydict.values())})
print(df.head())

4. How to combine many Series to form a DataFrame?

Difficulty Level: L1

Combine ser1 and ser2 to form a DataFrame.

Solve:

import polars as pl
import numpy as np
ser1 = pl.Series("col1", list('abcedfghijklmnopqrstuvwxyz'))
ser2 = pl.Series("col2", np.arange(26).tolist())

# Write your code below

Desired Output:

python

shape: (5, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ str  ┆ i32  │
╞══════╪══════╡
│ a    ┆ 0    │
│ b    ┆ 1    │
│ c    ┆ 2    │
│ e    ┆ 3    │
│ d    ┆ 4    │
└──────┴──────┘

Show Solution

import polars as pl
import numpy as np
ser1 = pl.Series("col1", list('abcedfghijklmnopqrstuvwxyz'))
ser2 = pl.Series("col2", np.arange(26).tolist())

df = pl.DataFrame([ser1, ser2])
print(df.head())

5. How to assign a name to a Series?

Difficulty Level: L1

Give a name 'alphabets' to the series ser.

Solve:

import polars as pl
ser = pl.Series(list('abcedfghijklmnopqrstuvwxyz'))

# Write your code below

Desired Output:

python

shape: (10,)
Series: 'alphabets' [str]
[
    "a"
    "b"
    "c"
    "e"
    "d"
    "f"
    "g"
    "h"
    "i"
    "j"
]

Show Solution

import polars as pl
ser = pl.Series(list('abcedfghijklmnopqrstuvwxyz'))

ser = ser.alias("alphabets")
print(ser.head())

6. How to get the items of Series A not present in Series B?

Difficulty Level: L2

From ser1, remove items present in ser2.

Solve:

import polars as pl
ser1 = pl.Series("a", [1, 2, 3, 4, 5])
ser2 = pl.Series("b", [4, 5, 6, 7, 8])

# Write your code below

Desired Output:

python

shape: (3,)
Series: 'a' [i64]
[
    1
    2
    3
]

Show Solution

import polars as pl
ser1 = pl.Series("a", [1, 2, 3, 4, 5])
ser2 = pl.Series("b", [4, 5, 6, 7, 8])

result = ser1.filter(~ser1.is_in(ser2))
print(result)

7. How to get the items not common to both Series A and Series B?

Difficulty Level: L2

Get all items of ser1 and ser2 not common to both.

Solve:

import polars as pl
ser1 = pl.Series("a", [1, 2, 3, 4, 5])
ser2 = pl.Series("b", [4, 5, 6, 7, 8])

# Write your code below

Desired Output:

python

shape: (6,)
Series: 'union' [i64]
[
    1
    2
    3
    6
    7
    8
]

Show Solution

import polars as pl
import numpy as np
ser1 = pl.Series("a", [1, 2, 3, 4, 5])
ser2 = pl.Series("b", [4, 5, 6, 7, 8])

union = pl.Series("union", np.union1d(ser1, ser2).tolist())
intersect = pl.Series("intersect", np.intersect1d(ser1, ser2).tolist())
result = union.filter(~union.is_in(intersect))
print(result)

8. How to get the minimum, 25th percentile, median, 75th, and max of a numeric Series?

Difficulty Level: L1

Compute the min, 25th percentile, median, 75th percentile, and max of ser.

Solve:

import polars as pl
import numpy as np
np.random.seed(42)
ser = pl.Series("data", np.random.normal(10, 5, 25).tolist())

# Write your code below

Desired Output:

python

[0.43, 7.19, 8.83, 12.48, 17.9]

Show Solution

import polars as pl
import numpy as np
np.random.seed(42)
ser = pl.Series("data", np.random.normal(10, 5, 25).tolist())

result = [
    ser.min(),
    ser.quantile(0.25),
    ser.median(),
    ser.quantile(0.75),
    ser.max(),
]
print(result)

9. How to get frequency counts of unique items of a Series?

Difficulty Level: L1

Calculate the frequency counts of each unique value in ser.

Solve:

import polars as pl
import numpy as np
np.random.seed(42)
ser = pl.Series("chars", np.take(list('abcdefgh'), np.random.randint(8, size=30)))

# Write your code below

Desired Output:

python

shape: (7, 2)
┌───────┬───────┐
│ chars ┆ count │
│ ---   ┆ ---   │
│ str   ┆ u32   │
╞═══════╪═══════╡
│ h     ┆ 6     │
│ c     ┆ 5     │
│ e     ┆ 5     │
│ d     ┆ 4     │
│ g     ┆ 4     │
│ f     ┆ 3     │
│ b     ┆ 3     │
└───────┴───────┘

Show Solution

import polars as pl
import numpy as np
np.random.seed(42)
ser = pl.Series("chars", np.take(list('abcdefgh'), np.random.randint(8, size=30)))

print(ser.value_counts().sort("count", descending=True))

10. How to keep only the top 2 most frequent values and replace everything else as ‘Other’?

Difficulty Level: L2

In ser, keep the top 2 most frequent values as-is. Replace all other values with 'Other'.

Solve:

import polars as pl
import numpy as np
np.random.seed(100)
ser = pl.Series("data", np.random.randint(1, 5, [12]).tolist())

# Write your code below

Desired Output:

python

shape: (12, 1)
┌───────┐
│ data  │
│ ---   │
│ str   │
╞═══════╡
│ 1     │
│ 1     │
│ 4     │
│ 4     │
│ 4     │
│ …     │
│ Other │
│ Other │
│ 1     │
│ Other │
│ Other │
└───────┘

Show Solution

import polars as pl
import numpy as np
np.random.seed(100)
ser = pl.Series("data", np.random.randint(1, 5, [12]).tolist())

counts = ser.value_counts().sort("count", descending=True)
top2 = counts.head(2)["data"].to_list()
result = ser.cast(pl.String).to_frame("data").with_columns(
    pl.when(pl.col("data").cast(pl.Int64).is_in(top2))
    .then(pl.col("data"))
    .otherwise(pl.lit("Other"))
    .alias("data")
)
print(result)

11. How to bin a numeric Series to 10 groups of equal size?

Difficulty Level: L2

Bin the series ser into 10 equal-sized decile groups and label them from 1st to 10th.

Solve:

import polars as pl
import numpy as np
np.random.seed(42)
ser = pl.Series("data", np.random.random(20).tolist())

# Write your code below

Desired Output:

python

shape: (10,)
Series: 'data' [cat]
[
    "4th"
    "10th"
    "8th"
    "6th"
    "2nd"
    "2nd"
    "1st"
    "9th"
    "7th"
    "8th"
]

Show Solution

import polars as pl
import numpy as np
np.random.seed(42)
ser = pl.Series("data", np.random.random(20).tolist())

labels = ['1st','2nd','3rd','4th','5th','6th','7th','8th','9th','10th']
breakpoints = [i / 10 for i in range(11)]

result = ser.cut(breakpoints[1:-1], labels=labels)
print(result)

12. How to convert a numpy array to a DataFrame of given shape?

Difficulty Level: L1

Reshape the series ser into a DataFrame with 7 rows and 5 columns.

Solve:

import polars as pl
import numpy as np
np.random.seed(42)
ser = pl.Series("data", np.random.randint(1, 10, 35).tolist())

# Write your code below

Desired Output:

python

shape: (7, 5)
┌──────┬──────┬──────┬──────┬──────┐
│ col0 ┆ col1 ┆ col2 ┆ col3 ┆ col4 │
│ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  │
│ i32  ┆ i32  ┆ i32  ┆ i32  ┆ i32  │
╞══════╪══════╪══════╪══════╪══════╡
│ 7    ┆ 4    ┆ 8    ┆ 5    ┆ 7    │
│ 3    ┆ 7    ┆ 8    ┆ 5    ┆ 4    │
│ 8    ┆ 8    ┆ 3    ┆ 6    ┆ 5    │
│ 2    ┆ 8    ┆ 6    ┆ 2    ┆ 5    │
│ 1    ┆ 6    ┆ 9    ┆ 1    ┆ 3    │
│ 7    ┆ 4    ┆ 9    ┆ 3    ┆ 5    │
│ 3    ┆ 7    ┆ 5    ┆ 9    ┆ 7    │
└──────┴──────┴──────┴──────┴──────┘

Show Solution

import polars as pl
import numpy as np
np.random.seed(42)
arr = np.random.randint(1, 10, 35).reshape(7, 5)

df = pl.DataFrame(arr.tolist(), schema=[f"col{i}" for i in range(5)])
print(df)

13. How to find the positions of numbers that are multiples of 3 from a Series?

Difficulty Level: L2

Find the positions of numbers that are multiples of 3 from ser.

Solve:

import polars as pl
import numpy as np
np.random.seed(42)
ser = pl.Series("data", np.random.randint(1, 10, 7).tolist())

# Write your code below

Desired Output:

python

shape: (7,)
Series: 'data' [i32]
[
    7
    4
    8
    5
    7
    3
    7
]
[5]

Show Solution

import polars as pl
import numpy as np
np.random.seed(42)
ser = pl.Series("data", np.random.randint(1, 10, 7).tolist())

print(ser)
positions = [i for i, v in enumerate(ser) if v % 3 == 0]
print(positions)

14. How to extract items at given positions from a Series?

Difficulty Level: L1

From ser, extract the items at positions 0, 4, 8, and 14.

Solve:

import polars as pl
ser = pl.Series("letters", list('abcdefghijklmnopqrstuvwxyz'))
pos = [0, 4, 8, 14]

# Write your code below

Desired Output:

python

shape: (4,)
Series: 'letters' [str]
[
    "a"
    "e"
    "i"
    "o"
]

Show Solution

import polars as pl
ser = pl.Series("letters", list('abcdefghijklmnopqrstuvwxyz'))
pos = [0, 4, 8, 14]

result = ser.gather(pos)
print(result)

15. How to stack two Series vertically and horizontally?

Difficulty Level: L1

Stack ser1 and ser2 vertically and horizontally (to form a DataFrame).

Solve:

import polars as pl
ser1 = pl.Series("col1", [0, 1, 2, 3, 4])
ser2 = pl.Series("col2", [5, 6, 7, 8, 9])

# Write your code below

Desired Output:

python

shape: (10,)
Series: 'col1' [i64]
[
    0
    1
    2
    3
    4
    5
    6
    7
    8
    9
]

shape: (5, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ 0    ┆ 5    │
│ 1    ┆ 6    │
│ 2    ┆ 7    │
│ 3    ┆ 8    │
│ 4    ┆ 9    │
└──────┴──────┘

Show Solution

import polars as pl
ser1 = pl.Series("col1", [0, 1, 2, 3, 4])
ser2 = pl.Series("col2", [5, 6, 7, 8, 9])

# Vertical
vertical = pl.concat([ser1, ser2])
print(vertical)

# Horizontal
horizontal = pl.DataFrame([ser1, ser2])
print(horizontal)

16. How to get the positions of items of Series A in Series B?

Difficulty Level: L2

Get the positions of items of ser2 in ser1 as a list.

Solve:

import polars as pl
ser1 = pl.Series("a", [10, 9, 6, 5, 3, 1, 12, 8, 13])
ser2 = pl.Series("b", [1, 3, 10, 13])

# Write your code below

Desired Output:

python

[5, 4, 0, 8]

Show Solution

import polars as pl
ser1 = pl.Series("a", [10, 9, 6, 5, 3, 1, 12, 8, 13])
ser2 = pl.Series("b", [1, 3, 10, 13])

positions = [ser1.to_list().index(val) for val in ser2.to_list()]
print(positions)

17. How to compute the mean squared error between a truth and predicted Series?

Difficulty Level: L1

Compute the mean squared error of truth and pred Series.

Solve:

import polars as pl
import numpy as np
np.random.seed(42)
truth = pl.Series("truth", np.arange(10).tolist())
pred = pl.Series("pred", (np.arange(10) + np.random.random(10)).tolist())

# Write your code below

Desired Output:

python

0.3603

Show Solution

import polars as pl
import numpy as np
np.random.seed(42)
truth = pl.Series("truth", np.arange(10).astype(float).tolist())
pred = pl.Series("pred", (np.arange(10) + np.random.random(10)).tolist())

mse = ((truth - pred) ** 2).mean()
print(mse)

18. How to convert the first character of each element in a Series to uppercase?

Difficulty Level: L2

Change the first character of each word to upper case in each word of ser.

Solve:

import polars as pl
ser = pl.Series("words", ['how', 'to', 'kick', 'ass?'])

# Write your code below

Desired Output:

python

shape: (4, 1)
┌───────┐
│ words │
│ ---   │
│ str   │
╞═══════╡
│ How   │
│ To    │
│ Kick  │
│ Ass?  │
└───────┘

Show Solution

import polars as pl
ser = pl.Series("words", ['how', 'to', 'kick', 'ass?'])

result = ser.to_frame("words").with_columns(
    pl.col("words").str.to_titlecase().alias("words")
)
print(result)

19. How to calculate the number of characters in each word in a Series?

Difficulty Level: L1

Calculate the number of characters in each word in ser.

Solve:

import polars as pl
ser = pl.Series("words", ['how', 'to', 'kick', 'ass?'])

# Write your code below

Desired Output:

python

shape: (4,)
Series: 'words' [u32]
[
    3
    2
    4
    4
]

Show Solution

import polars as pl
ser = pl.Series("words", ['how', 'to', 'kick', 'ass?'])

result = ser.str.len_chars()
print(result)

20. How to compute the difference of differences between consecutive numbers of a Series?

Difficulty Level: L1

Compute the difference of differences between consecutive numbers of ser.

Solve:

import polars as pl
ser = pl.Series("data", [1, 3, 6, 10, 15, 21, 27, 35])

# Write your code below

Desired Output:

python

shape: (6,)
Series: 'data' [i64]
[
    1
    1
    1
    1
    0
    2
]

Show Solution

import polars as pl
ser = pl.Series("data", [1, 3, 6, 10, 15, 21, 27, 35])

diff1 = ser.diff()
diff2 = diff1.diff()
print(diff2.drop_nulls())

21. How to convert a Series of date strings to a datetime type?

Difficulty Level: L2

Convert the ser to a datetime Series.

Solve:

import polars as pl
ser = pl.Series("dates", ['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])

# Write your code below

Desired Output:

python

shape: (6,)
Series: 'dates' [datetime[μs]]
[
    2010-01-01 00:00:00
    2011-02-02 00:00:00
    2012-03-03 00:00:00
    2013-04-04 00:00:00
    2014-05-05 00:00:00
    2015-06-06 12:20:00
]

Show Solution

import polars as pl
from dateutil.parser import parse

ser = pl.Series("dates", ['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])

# Use dateutil to parse mixed formats
parsed = pl.Series("dates", [parse(d) for d in ser.to_list()])
print(parsed)

22. How to get the day of month, week number, day of year, and day of week from a datetime Series?

Difficulty Level: L2

Get the day of month, week number, day of year, and day of week from ser.

Solve:

import polars as pl
from dateutil.parser import parse
ser = pl.Series("dates", [parse(d) for d in ['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20']])

# Write your code below

Desired Output:

python

shape: (6, 5)
┌─────────────────────┬──────────────┬─────────────┬─────────────┬─────────────┐
│ dates               ┆ day_of_month ┆ week_number ┆ day_of_year ┆ day_of_week │
│ ---                 ┆ ---          ┆ ---         ┆ ---         ┆ ---         │
│ datetime[μs]        ┆ i8           ┆ i8          ┆ i16         ┆ i8          │
╞═════════════════════╪══════════════╪═════════════╪═════════════╪═════════════╡
│ 2010-01-01 00:00:00 ┆ 1            ┆ 53          ┆ 1           ┆ 5           │
│ 2011-02-02 00:00:00 ┆ 2            ┆ 5           ┆ 33          ┆ 3           │
│ 2012-03-03 00:00:00 ┆ 3            ┆ 9           ┆ 63          ┆ 6           │
│ 2013-04-04 00:00:00 ┆ 4            ┆ 14          ┆ 94          ┆ 4           │
│ 2014-05-05 00:00:00 ┆ 5            ┆ 19          ┆ 125         ┆ 1           │
│ 2015-06-06 12:20:00 ┆ 6            ┆ 23          ┆ 157         ┆ 6           │
└─────────────────────┴──────────────┴─────────────┴─────────────┴─────────────┘

Show Solution

import polars as pl
from dateutil.parser import parse

dates = [parse(d) for d in ['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20']]
ser = pl.Series("dates", dates)

df = ser.to_frame("dates").with_columns(
    pl.col("dates").dt.day().alias("day_of_month"),
    pl.col("dates").dt.week().alias("week_number"),
    pl.col("dates").dt.ordinal_day().alias("day_of_year"),
    pl.col("dates").dt.weekday().alias("day_of_week"),
)
print(df)

23. How to convert year-month string to dates corresponding to the 4th day of the month?

Difficulty Level: L2

Change ser to dates that start with the 4th of the respective months.

Solve:

import polars as pl
ser = pl.Series("dates", ['Jan 2010', 'Feb 2011', 'Mar 2012'])

# Write your code below

Desired Output:

python

shape: (3,)
Series: 'dates' [datetime[μs]]
[
    2010-01-04 00:00:00
    2011-02-04 00:00:00
    2012-03-04 00:00:00
]

Show Solution

import polars as pl
from dateutil.parser import parse

ser = pl.Series("dates", ['Jan 2010', 'Feb 2011', 'Mar 2012'])

result = pl.Series("dates", [parse('04 ' + d) for d in ser.to_list()])
print(result)

24. How to filter words that contain at least 2 vowels from a Series?

Difficulty Level: L3

From ser, extract words that contain at least 2 vowels.

Solve:

import polars as pl
ser = pl.Series("words", ['Apple', 'Orange', 'Plan', 'Python', 'Money'])

# Write your code below

Desired Output:

python

shape: (3, 1)
┌────────┐
│ words  │
│ ---    │
│ str    │
╞════════╡
│ Apple  │
│ Orange │
│ Money  │
└────────┘

Show Solution

import polars as pl
ser = pl.Series("words", ['Apple', 'Orange', 'Plan', 'Python', 'Money'])

result = ser.to_frame("words").filter(
    pl.col("words").str.count_matches(r"[aeiouAEIOU]") >= 2
)
print(result)

25. How to filter valid emails from a Series?

Difficulty Level: L3

Extract valid email addresses from ser.

Solve:

import polars as pl
emails = pl.Series("emails", ['buying books at amazom.com', 'rameses@egypt.com', 'matt@t.co', 'narendra@modi.com'])

# Write your code below

Desired Output:

python

shape: (3, 1)
┌───────────────────┐
│ emails            │
│ ---               │
│ str               │
╞═══════════════════╡
│ rameses@egypt.com │
│ matt@t.co         │
│ narendra@modi.com │
└───────────────────┘

Show Solution

import polars as pl
emails = pl.Series("emails", ['buying books at amazom.com', 'rameses@egypt.com', 'matt@t.co', 'narendra@modi.com'])

pattern = r'^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}$'
result = emails.to_frame("emails").filter(
    pl.col("emails").str.contains(pattern)
)
print(result)

26. How to get the mean of a Series grouped by another Series?

Difficulty Level: L2

Compute the mean of weights grouped by fruit.

Solve:

import polars as pl
import numpy as np
np.random.seed(42)
fruit = pl.Series("fruit", np.random.choice(['apple', 'banana', 'carrot'], 10).tolist())
weights = pl.Series("weights", np.linspace(1, 10, 10).tolist())

# Write your code below

Desired Output:

python

shape: (3, 2)
┌────────┬──────────┐
│ fruit  ┆ weights  │
│ ---    ┆ ---      │
│ str    ┆ f64      │
╞════════╪══════════╡
│ apple  ┆ 4.333333 │
│ banana ┆ 8.0      │
│ carrot ┆ 5.666667 │
└────────┴──────────┘

Show Solution

import polars as pl
import numpy as np
np.random.seed(42)
fruit = pl.Series("fruit", np.random.choice(['apple', 'banana', 'carrot'], 10).tolist())
weights = pl.Series("weights", np.linspace(1, 10, 10).tolist())

df = pl.DataFrame([fruit, weights])
print(df.group_by("fruit").agg(pl.col("weights").mean()))

27. How to compute the euclidean distance between two Series?

Difficulty Level: L1

Compute the euclidean distance between p and q.

Solve:

import polars as pl
p = pl.Series("p", list(range(1, 11)))
q = pl.Series("q", list(range(10, 0, -1)))

# Write your code below

Desired Output:

python

18.17

Show Solution

import polars as pl
import numpy as np
p = pl.Series("p", list(range(1, 11)))
q = pl.Series("q", list(range(10, 0, -1)))

dist = ((p - q) ** 2).sum() ** 0.5
print(dist)

# Or using numpy
print(np.linalg.norm(p.to_numpy() - q.to_numpy()))

28. How to find all the local maxima (peaks) in a numeric Series?

Difficulty Level: L3

Get the positions of peaks (values surrounded by smaller values on both sides) in ser.

Solve:

import polars as pl
ser = pl.Series("data", [2, 10, 3, 4, 9, 10, 2, 7, 3])

# Write your code below

Desired Output:

python

[1, 5, 7]

Show Solution

import polars as pl
import numpy as np
ser = pl.Series("data", [2, 10, 3, 4, 9, 10, 2, 7, 3])

arr = ser.to_numpy()
# A peak: value greater than both neighbors
peaks = np.where((arr[1:-1] > arr[:-2]) & (arr[1:-1] > arr[2:]))[0] + 1
print(peaks.tolist())

29. How to replace missing spaces in a string with the least frequent character?

Difficulty Level: L2

Replace the spaces in my_str with whichever character is the least frequent, excluding spaces.

Solve:

my_str = 'dbc deb abed gade'

# Write your code below

Desired Output:

python

Least frequent char: g
dbcgdebgabedggade

Show Solution

import polars as pl
my_str = 'dbc deb abed gade'

ser = pl.Series("chars", list(my_str))
counts = ser.filter(ser != ' ').value_counts().sort("count")
least_freq = counts[0, 0]
print(my_str.replace(' ', least_freq))

30. How to create a TimeSeries starting ‘2000-01-01’ and 10 weekends (Saturdays)?

Difficulty Level: L2

Create a Polars DataFrame with 10 Saturday dates starting from 2000-01-01 and random integer values.

Solve:

import polars as pl
import numpy as np

# Write your code below

Desired Output:

python

shape: (10, 2)
┌────────────┬───────┐
│ date       ┆ value │
│ ---        ┆ ---   │
│ date       ┆ i32   │
╞════════════╪═══════╡
│ 2000-01-01 ┆ 7     │
│ 2000-01-08 ┆ 4     │
│ 2000-01-15 ┆ 8     │
│ 2000-01-22 ┆ 5     │
│ 2000-01-29 ┆ 7     │
│ 2000-02-05 ┆ 3     │
│ 2000-02-12 ┆ 7     │
│ 2000-02-19 ┆ 8     │
│ 2000-02-26 ┆ 5     │
│ 2000-03-04 ┆ 4     │
└────────────┴───────┘

Show Solution

import polars as pl
import numpy as np
from datetime import date, timedelta

# Find first Saturday on or after 2000-01-01
start = date(2000, 1, 1)
while start.weekday() != 5:  # 5 = Saturday
    start += timedelta(days=1)

saturdays = [start + timedelta(weeks=i) for i in range(10)]
df = pl.DataFrame({
    "date": saturdays,
    "value": np.random.randint(1, 10, 10).tolist()
})
print(df)

31. How to fill missing dates and forward-fill values?

Difficulty Level: L2

ser has missing dates. Fill in the missing dates and forward-fill the corresponding values.

Solve:

import polars as pl
from datetime import date
df = pl.DataFrame({
    "date": [date(2000,1,1), date(2000,1,3), date(2000,1,6), date(2000,1,8)],
    "value": [1, 10, 3, None]
})

# Write your code below

Desired Output:

python

shape: (8, 2)
┌────────────┬───────┐
│ date       ┆ value │
│ ---        ┆ ---   │
│ date       ┆ i64   │
╞════════════╪═══════╡
│ 2000-01-01 ┆ 1     │
│ 2000-01-02 ┆ 1     │
│ 2000-01-03 ┆ 10    │
│ 2000-01-04 ┆ 10    │
│ 2000-01-05 ┆ 10    │
│ 2000-01-06 ┆ 3     │
│ 2000-01-07 ┆ 3     │
│ 2000-01-08 ┆ 3     │
└────────────┴───────┘

Show Solution

import polars as pl
from datetime import date

df = pl.DataFrame({
    "date": [date(2000,1,1), date(2000,1,3), date(2000,1,6), date(2000,1,8)],
    "value": [1, 10, 3, None]
})

all_dates = pl.DataFrame({
    "date": pl.date_range(date(2000,1,1), date(2000,1,8), eager=True)
})
result = all_dates.join(df, on="date", how="left").with_columns(
    pl.col("value").forward_fill()
)
print(result)

32. How to find the autocorrelation of a numeric Series?

Difficulty Level: L3

Compute autocorrelation for lags 1 through 10 of ser, and find the lag with the highest correlation.

Solve:

import polars as pl
import numpy as np
np.random.seed(42)
ser = pl.Series("data", (np.arange(20) + np.random.normal(1, 10, 20)).tolist())

# Write your code below

Desired Output:

python

[-0.04, -0.36, 0.24, -0.23, -0.06, 0.1, -0.59, -0.13, 0.33, -0.03]
Lag with highest correlation: 7

Show Solution

import polars as pl
import numpy as np
np.random.seed(42)
ser = pl.Series("data", (np.arange(20) + np.random.normal(1, 10, 20)).tolist())

arr = ser.to_numpy()
autocorrs = [np.corrcoef(arr[:-i], arr[i:])[0, 1] for i in range(1, 11)]
print([round(a, 2) for a in autocorrs])
print('Lag with highest correlation:', np.argmax(np.abs(autocorrs)) + 1)

33. How to import only specified columns from a CSV file?

Difficulty Level: L2

Import ‘crim’ and ‘medv’ columns from the BostonHousing dataset CSV.

Solve:

import polars as pl
url = 'https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv'

# Write your code below

Desired Output:

python

shape: (5, 2)
┌─────────┬──────┐
│ crim    ┆ medv │
│ ---     ┆ ---  │
│ f64     ┆ f64  │
╞═════════╪══════╡
│ 0.00632 ┆ 24.0 │
│ 0.02731 ┆ 21.6 │
│ 0.02729 ┆ 34.7 │
│ 0.03237 ┆ 33.4 │
│ 0.06905 ┆ 36.2 │
└─────────┴──────┘

Show Solution

import polars as pl

df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv', columns=["crim", "medv"])
print(df.head())

34. How to get the nrows, ncolumns, datatype, summary stats of each column of a DataFrame?

Difficulty Level: L2

Get the number of rows, columns, datatypes, and summary stats of the Cars93 DataFrame.

Solve:

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")

# Write your code below

Desired Output:

python

Shape: (93, 27)

Column dtypes:
  Manufacturer: String
  Model: String
  Type: String
  Min.Price: Float64
  Price: Float64
  ...

Show Solution

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")

print("Shape:", df.shape)
print("\nDtypes:\n", df.dtypes)
print("\nDescribe:\n", df.describe())

35. How to extract the row and column number of a particular cell with given criterion?

Difficulty Level: L1

Which manufacturer, model, and type has the highest Price? What is the row and column number of the cell with the highest Price value?

Solve:

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")

# Write your code below

Desired Output:

python

Row: 58
Column: 4
Mercedes-Benz 300E Midsize Price: 61.9

Show Solution

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")

# Row with highest Price
row_idx = df["Price"].arg_max()
print("Row:", row_idx)
print("Column:", df.columns.index("Price"))
print(df[row_idx])

36. How to rename a specific column in a DataFrame?

Difficulty Level: L2

Rename the column Type to CarType in df.

Solve:

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")

# Write your code below

Desired Output:

python

['Manufacturer', 'Model', 'CarType', 'Min.Price', 'Price']

Show Solution

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")

df = df.rename({"Type": "CarType"})
print(df.columns[:5])

37. How to check if a DataFrame has any missing values?

Difficulty Level: L1

Check if df has any missing values.

Solve:

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")

# Write your code below

Desired Output:

python

True

Show Solution

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")

has_nulls = df.null_count().sum_horizontal()[0] > 0
print(has_nulls)

38. How to count the number of missing values in each column?

Difficulty Level: L1

Count the number of missing values in each column of df.

Solve:

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")

# Write your code below

Desired Output:

python

Manufacturer: 4
  Price: 2
  Type: 3
  Min.Price: 7
  Max.Price: 5

Show Solution

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")

print(df.null_count())

39. How to replace missing values of multiple numeric columns with the mean?

Difficulty Level: L2

Replace NaNs/nulls with the column mean for all numeric columns in df.

Solve:

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")

# Write your code below

Desired Output:

python

Numeric nulls before: 99, after: 0

Show Solution

import polars as pl
import polars.selectors as cs
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")

df = df.with_columns(cs.numeric().fill_null(cs.numeric().mean()))
print(df.null_count())

40. How to use apply function on existing columns with global variables as additional arguments?

Difficulty Level: L2

In df, use polars expressions to compute a new column 'avg' that is the row-mean of columns 'a', 'b', and 'c', then add a column 'avg_mf' = avg × d (where d is an external variable).

Solve:

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame(np.random.randint(1, 10, 15).reshape(5, 3).tolist(), schema=['a', 'b', 'c'])
d = 5

# Write your code below

Desired Output:

python

shape: (5, 5)
┌─────┬─────┬─────┬──────────┬───────────┐
│ a   ┆ b   ┆ c   ┆ avg      ┆ avg_mf    │
│ --- ┆ --- ┆ --- ┆ ---      ┆ ---       │
│ i32 ┆ i32 ┆ i32 ┆ f64      ┆ f64       │
╞═════╪═════╪═════╪══════════╪═══════════╡
│ 7   ┆ 4   ┆ 8   ┆ 6.333333 ┆ 31.666667 │
│ 5   ┆ 7   ┆ 3   ┆ 5.0      ┆ 25.0      │
│ 7   ┆ 8   ┆ 5   ┆ 6.666667 ┆ 33.333333 │
│ 4   ┆ 8   ┆ 8   ┆ 6.666667 ┆ 33.333333 │
│ 3   ┆ 6   ┆ 5   ┆ 4.666667 ┆ 23.333333 │
└─────┴─────┴─────┴──────────┴───────────┘

Show Solution

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame(np.random.randint(1, 10, 15).reshape(5, 3).tolist(), schema=['a', 'b', 'c'])
d = 5

df = df.with_columns(
    ((pl.col("a") + pl.col("b") + pl.col("c")) / 3).alias("avg")
).with_columns(
    (pl.col("avg") * d).alias("avg_mf")
)
print(df)

41. How to swap two columns in a DataFrame?

Difficulty Level: L2

In df, swap columns 'a' and 'c'.

Solve:

import polars as pl
import numpy as np
df = pl.DataFrame(np.arange(20).reshape(-1, 5).tolist(), schema=list('abcde'))

# Write your code below

Desired Output:

python

shape: (4, 5)
┌─────┬─────┬─────┬─────┬─────┐
│ c   ┆ b   ┆ a   ┆ d   ┆ e   │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪═════╪═════╪═════╡
│ 2   ┆ 1   ┆ 0   ┆ 3   ┆ 4   │
│ 7   ┆ 6   ┆ 5   ┆ 8   ┆ 9   │
│ 12  ┆ 11  ┆ 10  ┆ 13  ┆ 14  │
│ 17  ┆ 16  ┆ 15  ┆ 18  ┆ 19  │
└─────┴─────┴─────┴─────┴─────┘

Show Solution

import polars as pl
import numpy as np
df = pl.DataFrame(np.arange(20).reshape(-1, 5).tolist(), schema=list('abcde'))

# Swap 'a' and 'c'
cols = df.columns
a_idx, c_idx = cols.index('a'), cols.index('c')
cols[a_idx], cols[c_idx] = cols[c_idx], cols[a_idx]
df = df.select(cols)
print(df)

42. How to sort columns in reverse alphabetical order?

Difficulty Level: L2

Sort the columns of df in reverse alphabetical order.

Solve:

import polars as pl
import numpy as np
df = pl.DataFrame(np.arange(20).reshape(-1, 5).tolist(), schema=list('abcde'))

# Write your code below

Desired Output:

python

shape: (4, 5)
┌─────┬─────┬─────┬─────┬─────┐
│ e   ┆ d   ┆ c   ┆ b   ┆ a   │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪═════╪═════╪═════╡
│ 4   ┆ 3   ┆ 2   ┆ 1   ┆ 0   │
│ 9   ┆ 8   ┆ 7   ┆ 6   ┆ 5   │
│ 14  ┆ 13  ┆ 12  ┆ 11  ┆ 10  │
│ 19  ┆ 18  ┆ 17  ┆ 16  ┆ 15  │
└─────┴─────┴─────┴─────┴─────┘

Show Solution

import polars as pl
import numpy as np
df = pl.DataFrame(np.arange(20).reshape(-1, 5).tolist(), schema=list('abcde'))

df = df.select(sorted(df.columns, reverse=True))
print(df)

43. How to format or suppress scientific notations in a Polars DataFrame?

Difficulty Level: L2

When displaying a DataFrame with very small numbers, format them as fixed-point with 4 decimal places.

Solve:

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame((np.random.random([5, 3]) / 1e3).tolist(), schema=['a', 'b', 'c'])

# Write your code below

Desired Output:

python

shape: (5, 3)
┌────────┬────────┬────────┐
│ a      ┆ b      ┆ c      │
│ ---    ┆ ---    ┆ ---    │
│ f64    ┆ f64    ┆ f64    │
╞════════╪════════╪════════╡
│ 0.0004 ┆ 0.0010 ┆ 0.0007 │
│ 0.0006 ┆ 0.0002 ┆ 0.0002 │
│ 0.0001 ┆ 0.0009 ┆ 0.0006 │
│ 0.0007 ┆ 0.0000 ┆ 0.0010 │
│ 0.0008 ┆ 0.0002 ┆ 0.0002 │
└────────┴────────┴────────┘

Show Solution

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame((np.random.random([5, 3]) / 1e3).tolist(), schema=['a', 'b', 'c'])

# Polars uses Config for display settings
with pl.Config(float_precision=4):
    print(df)

44. How to format all values in a DataFrame to show only 4 decimal places?

Difficulty Level: L2

Show all float values in df rounded to 4 decimal places.

Solve:

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame(np.random.random([5, 3]).tolist(), schema=['a', 'b', 'c'])

# Write your code below

Desired Output:

python

shape: (5, 3)
┌────────┬────────┬────────┐
│ a      ┆ b      ┆ c      │
│ ---    ┆ ---    ┆ ---    │
│ f64    ┆ f64    ┆ f64    │
╞════════╪════════╪════════╡
│ 0.3745 ┆ 0.9507 ┆ 0.732  │
│ 0.5987 ┆ 0.156  ┆ 0.156  │
│ 0.0581 ┆ 0.8662 ┆ 0.6011 │
│ 0.7081 ┆ 0.0206 ┆ 0.9699 │
│ 0.8324 ┆ 0.2123 ┆ 0.1818 │
└────────┴────────┴────────┘

Show Solution

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame(np.random.random([5, 3]).tolist(), schema=['a', 'b', 'c'])

df = df.with_columns(pl.all().round(4))
print(df)

45. How to filter rows of a DataFrame by row number?

Difficulty Level: L1

Select every 20th row starting from the 1st row (row 0).

Solve:

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")

# Write your code below

Desired Output:

python

shape: (5, 3)
┌──────────────┬─────────┬─────────┐
│ Manufacturer ┆ Model   ┆ Type    │
│ ---          ┆ ---     ┆ ---     │
│ str          ┆ str     ┆ str     │
╞══════════════╪═════════╪═════════╡
│ Acura        ┆ Integra ┆ Small   │
│ Chrysler     ┆ LeBaron ┆ Compact │
│ Honda        ┆ Prelude ┆ Sporty  │
│ Mercury      ┆ Cougar  ┆ Midsize │
│ Subaru       ┆ Loyale  ┆ Small   │
└──────────────┴─────────┴─────────┘

Show Solution

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")

result = df.gather_every(20)
print(result.select(["Manufacturer", "Model", "Type"]))

46. How to create a primary key index by combining relevant columns?

Difficulty Level: L2

In df, replace nulls with 'missing' in columns 'Manufacturer', 'Model', and 'Type', then create a new column 'primary_key' as a combination of these three columns. Check if it is unique.

Solve:

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")

# Write your code below

Desired Output:

python

True

Show Solution

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")

df = df.with_columns(
    pl.col("Manufacturer").fill_null("missing"),
    pl.col("Model").fill_null("missing"),
    pl.col("Type").fill_null("missing"),
).with_columns(
    (pl.col("Manufacturer") + "_" + pl.col("Model") + "_" + pl.col("Type")).alias("primary_key")
)
print(df["primary_key"].is_unique().all())

47. How to get the row number of the n-th largest value in a column?

Difficulty Level: L2

Find the row position of the 5th largest value of column 'a' in df.

Solve:

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame(np.random.randint(1, 30, 30).reshape(10, -1).tolist(), schema=list('abc'))

# Write your code below

Desired Output:

python

DataFrame:
shape: (10, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪═════╡
│ 7   ┆ 20  ┆ 29  │
│ 15  ┆ 11  ┆ 8   │
│ 29  ┆ 21  ┆ 7   │
│ 26  ┆ 19  ┆ 23  │
│ 11  ┆ 11  ┆ 24  │
│ 21  ┆ 4   ┆ 8   │
│ 24  ┆ 3   ┆ 22  │
│ 21  ┆ 2   ┆ 24  │
│ 12  ┆ 6   ┆ 2   │
│ 28  ┆ 21  ┆ 1   │
└─────┴─────┴─────┘

Row index of 5th largest value in 'a': 5

Show Solution

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame(np.random.randint(1, 30, 30).reshape(10, -1).tolist(), schema=list('abc'))

n = 5
row_idx = df.with_row_index("idx").sort("a", descending=True)[n - 1, "idx"]
print(row_idx)

48. How to find the position of the n-th largest value greater than the mean?

Difficulty Level: L2

Find the positions of values in ser that are greater than the mean. Report the 2nd position.

Solve:

import polars as pl
import numpy as np
np.random.seed(42)
ser = pl.Series("data", np.random.randint(1, 100, 15).tolist())

# Write your code below

Desired Output:

python

Series: [52, 93, 15, 72, 61, 21, 83, 87, 75, 75, 88, 24, 3, 22, 53]
Mean: 55
2nd position where value > mean: 3

Show Solution

import polars as pl
import numpy as np
np.random.seed(42)
ser = pl.Series("data", np.random.randint(1, 100, 15).tolist())

mean_val = ser.mean()
positions = [i for i, v in enumerate(ser) if v > mean_val]
print("Mean:", round(mean_val))
print("2nd position where value > mean:", positions[1])

49. How to get the last two rows of a DataFrame whose row sum > 100?

Difficulty Level: L2

Get the last two rows of df where the sum of the row values exceeds 100.

Solve:

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame(np.random.randint(10, 40, 60).reshape(-1, 4).tolist(), schema=[f"c{i}" for i in range(4)])

# Write your code below

Desired Output:

python

shape: (2, 4)
┌─────┬─────┬─────┬─────┐
│ c0  ┆ c1  ┆ c2  ┆ c3  │
│ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪═════╪═════╡
│ 24  ┆ 39  ┆ 39  ┆ 24  │
│ 39  ┆ 28  ┆ 21  ┆ 32  │
└─────┴─────┴─────┴─────┘

Show Solution

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame(np.random.randint(10, 40, 60).reshape(-1, 4).tolist(), schema=[f"c{i}" for i in range(4)])

result = df.filter(
    pl.sum_horizontal(pl.all()) > 100
).tail(2)
print(result)

50. How to find and cap outliers from a Series or DataFrame column?

Difficulty Level: L2

Replace all values in ser that are above the 95th percentile or below the 5th percentile with the respective percentile value.

Solve:

import polars as pl
import numpy as np
np.random.seed(100)
ser = pl.Series("data", np.random.normal(0, 1, 50).tolist())

# Write your code below

Desired Output:

python

Low: -1.6906, High: 1.4707
shape: (5, 1)
┌───────────┐
│ data      │
│ ---       │
│ f64       │
╞═══════════╡
│ -1.690617 │
│ 0.34268   │
│ 1.153036  │
│ -0.252436 │
│ 0.981321  │
└───────────┘

Show Solution

import polars as pl
import numpy as np
np.random.seed(100)
ser = pl.Series("data", np.random.normal(0, 1, 50).tolist())

low = ser.quantile(0.05)
high = ser.quantile(0.95)

result = ser.to_frame("data").with_columns(
    pl.col("data").clip(low, high)
)
print(result)

51. How to reshape a DataFrame from long to wide format?

Difficulty Level: L3

Pivot df so each unique 'car' becomes a row and the columns are the cities with corresponding 'price' values.

Solve:

import polars as pl
df = pl.DataFrame({
    "car": ["Audi", "Audi", "BMW", "BMW"],
    "city": ["SF", "NYC", "SF", "NYC"],
    "price": [45000, 42000, 55000, 52000]
})

# Write your code below

Desired Output:

python

shape: (2, 3)
┌──────┬───────┬───────┐
│ car  ┆ SF    ┆ NYC   │
│ ---  ┆ ---   ┆ ---   │
│ str  ┆ i64   ┆ i64   │
╞══════╪═══════╪═══════╡
│ Audi ┆ 45000 ┆ 42000 │
│ BMW  ┆ 55000 ┆ 52000 │
└──────┴───────┴───────┘

Show Solution

import polars as pl
df = pl.DataFrame({
    "car": ["Audi", "Audi", "BMW", "BMW"],
    "city": ["SF", "NYC", "SF", "NYC"],
    "price": [45000, 42000, 55000, 52000]
})

result = df.pivot(on="city", index="car", values="price")
print(result)

52. How to reshape a DataFrame from wide to long format?

Difficulty Level: L2

Melt df so each car-city pair becomes a row.

Solve:

import polars as pl
df = pl.DataFrame({
    "car": ["Audi", "BMW"],
    "SF": [45000, 55000],
    "NYC": [42000, 52000]
})

# Write your code below

Desired Output:

python

shape: (4, 3)
┌──────┬──────┬───────┐
│ car  ┆ city ┆ price │
│ ---  ┆ ---  ┆ ---   │
│ str  ┆ str  ┆ i64   │
╞══════╪══════╪═══════╡
│ Audi ┆ SF   ┆ 45000 │
│ BMW  ┆ SF   ┆ 55000 │
│ Audi ┆ NYC  ┆ 42000 │
│ BMW  ┆ NYC  ┆ 52000 │
└──────┴──────┴───────┘

Show Solution

import polars as pl
df = pl.DataFrame({
    "car": ["Audi", "BMW"],
    "SF": [45000, 55000],
    "NYC": [42000, 52000]
})

result = df.unpivot(on=["SF", "NYC"], index="car", variable_name="city", value_name="price")
print(result)

53. How to create a DataFrame with rows as stacked columns?

Difficulty Level: L3

Create a DataFrame where each row is a column name – column value pair for the first row of df.

Solve:

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")

# Write your code below

Desired Output:

python

shape: (10, 2)
┌──────────────┬─────────┐
│ column       ┆ value   │
│ ---          ┆ ---     │
│ str          ┆ str     │
╞══════════════╪═════════╡
│ Manufacturer ┆ Acura   │
│ Model        ┆ Integra │
│ Type         ┆ Small   │
│ Min.Price    ┆ 12.9    │
│ Price        ┆ 15.9    │
│ Max.Price    ┆ 18.8    │
│ MPG.city     ┆ 25      │
│ MPG.highway  ┆ 31      │
│ AirBags      ┆ None    │
│ DriveTrain   ┆ Front   │
└──────────────┴─────────┘

Show Solution

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")

first_row = df.head(1)
result = first_row.unpivot(variable_name="column", value_name="value")
print(result)

54. How to check if a DataFrame has any missing values?

Difficulty Level: L1

Check which columns in df have any null values.

Solve:

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")

# Write your code below

Desired Output:

python

['Manufacturer', 'Model', 'Type', 'Min.Price', 'Price', 'Max.Price', 'MPG.city', 'MPG.highway', 'AirBags', 'DriveTrain', 'Cylinders', 'EngineSize', 'Horsepower', 'RPM', 'Rev.per.mile', 'Man.trans.avail', 'Fuel.tank.capacity', 'Passengers', 'Length', 'Wheelbase', 'Width', 'Turn.circle', 'Rear.seat.room', 'Luggage.room', 'Weight', 'Origin', 'Make']

Show Solution

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")

null_counts = df.null_count()
cols_with_nulls = [col for col in df.columns if null_counts[col][0] > 0]
print(cols_with_nulls)

55. How to get the minimum value in each column grouped by another column?

Difficulty Level: L2

In df, for each 'Type', get the minimum 'Price'.

Solve:

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")

# Write your code below

Desired Output:

python

shape: (7, 2)
┌─────────┬───────┐
│ Type    ┆ Price │
│ ---     ┆ ---   │
│ str     ┆ f64   │
╞═════════╪═══════╡
│ null    ┆ 8.6   │
│ Compact ┆ 11.1  │
│ Large   ┆ 18.4  │
│ Midsize ┆ 13.9  │
│ Small   ┆ 7.4   │
│ Sporty  ┆ 12.5  │
│ Van     ┆ 16.3  │
└─────────┴───────┘

Show Solution

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")

print(df.group_by("Type").agg(pl.col("Price").min()))

56. How to get the top n rows of each group in a DataFrame?

Difficulty Level: L2

For each 'Type', get the top 2 rows with the highest 'Price'.

Solve:

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")

# Write your code below

Desired Output:

python

shape: (10, 4)
┌─────────┬───────────────┬──────────┬───────┐
│ Type    ┆ Manufacturer  ┆ Model    ┆ Price │
│ ---     ┆ ---           ┆ ---      ┆ ---   │
│ str     ┆ str           ┆ str      ┆ f64   │
╞═════════╪═══════════════╪══════════╪═══════╡
│ null    ┆ Pontiac       ┆ Firebird ┆ 17.7  │
│ null    ┆ Hyundai       ┆ Scoupe   ┆ 10.0  │
│ Compact ┆ Mercedes-Benz ┆ 190E     ┆ 31.9  │
│ Compact ┆ Audi          ┆ 90       ┆ 29.1  │
│ Large   ┆ Lincoln       ┆ Town_Car ┆ 36.1  │
│ Large   ┆ Cadillac      ┆ DeVille  ┆ 34.7  │
│ Midsize ┆ Toyota        ┆ Camry    ┆ null  │
│ Midsize ┆ Mercedes-Benz ┆ 300E     ┆ 61.9  │
│ Small   ┆ Saturn        ┆ SL       ┆ null  │
│ Small   ┆ Acura         ┆ Integra  ┆ 15.9  │
└─────────┴───────────────┴──────────┴───────┘

Show Solution

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")

result = df.sort("Price", descending=True).group_by("Type").head(2)
print(result.select(["Type", "Manufacturer", "Model", "Price"]))

57. How to replace missing values with the mode of a column?

Difficulty Level: L2

Replace the missing values in 'DriveTrain' column with its mode (most frequent value).

Solve:

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")

# Write your code below

Desired Output:

python

Mode: Front
Nulls after fill: 0

Show Solution

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")

mode_val = df["DriveTrain"].drop_nulls().value_counts().sort("count", descending=True)[0, 0]
df = df.with_columns(pl.col("DriveTrain").fill_null(mode_val))
print(df["DriveTrain"].null_count())

58. How to create a new column from existing columns using a condition?

Difficulty Level: L2

Create a new column 'price_category' that says 'high' if Price > 30 else 'low'.

Solve:

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")

# Write your code below

Desired Output:

python

shape: (10, 3)
┌──────────────┬───────┬────────────────┐
│ Manufacturer ┆ Price ┆ price_category │
│ ---          ┆ ---   ┆ ---            │
│ str          ┆ f64   ┆ str            │
╞══════════════╪═══════╪════════════════╡
│ Acura        ┆ 15.9  ┆ low            │
│ null         ┆ 33.9  ┆ high           │
│ Audi         ┆ 29.1  ┆ low            │
│ Audi         ┆ 37.7  ┆ high           │
│ BMW          ┆ 30.0  ┆ low            │
│ Buick        ┆ 15.7  ┆ low            │
│ Buick        ┆ 20.8  ┆ low            │
│ Buick        ┆ 23.7  ┆ low            │
│ Buick        ┆ 26.3  ┆ low            │
│ Cadillac     ┆ 34.7  ┆ high           │
└──────────────┴───────┴────────────────┘

Show Solution

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")

df = df.with_columns(
    pl.when(pl.col("Price") > 30)
    .then(pl.lit("high"))
    .otherwise(pl.lit("low"))
    .alias("price_category")
)
print(df.select(["Manufacturer", "Price", "price_category"]).head(10))

59. How to get the column-wise maximum of two DataFrames?

Difficulty Level: L2

Get the element-wise maximum of two DataFrames df1 and df2.

Solve:

import polars as pl
import numpy as np
np.random.seed(100)
df1 = pl.DataFrame(np.random.randint(1, 25, [5, 3]), schema=list('abc'))
df2 = pl.DataFrame(np.random.randint(1, 25, [5, 3]), schema=list('abc'))

# Write your code below

Desired Output:

python

shape: (5, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪═════╡
│ 17  ┆ 16  ┆ 8   │
│ 24  ┆ 17  ┆ 17  │
│ 23  ┆ 21  ┆ 13  │
│ 22  ┆ 3   ┆ 14  │
│ 22  ┆ 20  ┆ 18  │
└─────┴─────┴─────┘

Show Solution

import polars as pl
import numpy as np
np.random.seed(100)
df1 = pl.DataFrame(np.random.randint(1, 25, [5, 3]), schema=list('abc'))
df2 = pl.DataFrame(np.random.randint(1, 25, [5, 3]), schema=list('abc'))

# Rename df2 columns to avoid collision, then use max_horizontal
df2_renamed = df2.rename({c: f"{c}_2" for c in df2.columns})
combined = pl.concat([df1, df2_renamed], how="horizontal")
result = combined.select(
    [pl.max_horizontal(pl.col(c), pl.col(f"{c}_2")).alias(c) for c in df1.columns]
)
print(result)

60. How to get the correlation between two columns of a DataFrame?

Difficulty Level: L2

Compute the correlation between all numeric columns in df and find the two columns with the highest absolute correlation.

Solve:

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame(np.random.randint(1, 100, 80).reshape(8, -1).tolist(), schema=[f'c{i}' for i in range(10)])

# Write your code below

Desired Output:

python

Highest correlation: (c1, c8) = 0.8447

Show Solution

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame(np.random.randint(1, 100, 80).reshape(8, -1).tolist(), schema=[f'c{i}' for i in range(10)])

# Compute pairwise correlations using numpy
arr = df.to_numpy().astype(float)
corr_matrix = np.corrcoef(arr.T)
np.fill_diagonal(corr_matrix, 0)
max_idx = np.unravel_index(np.argmax(np.abs(corr_matrix)), corr_matrix.shape)
print(f"Highest correlation: ({df.columns[max_idx[0]]}, {df.columns[max_idx[1]]}) = {corr_matrix[max_idx]:.4f}")

61. How to create a column containing the minimum-by-maximum of each row?

Difficulty Level: L2

Compute the minimum / maximum for every row of df.

Solve:

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame(np.random.randint(1, 100, 80).reshape(8, -1).tolist(), schema=[f'c{i}' for i in range(10)])

# Write your code below

Desired Output:

python

shape: (8,)
Series: 'min_by_max' [f64]
[
    0.16129
    0.022727
    0.230769
    0.163043
    0.041096
    0.021739
    0.098765
    0.021505
]

Show Solution

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame(np.random.randint(1, 100, 80).reshape(8, -1).tolist(), schema=[f'c{i}' for i in range(10)])

result = df.with_columns(
    (pl.min_horizontal(pl.all()) / pl.max_horizontal(pl.all())).alias("min_by_max")
)
print(result["min_by_max"])

62. How to create a column that contains the penultimate (second largest) value in each row?

Difficulty Level: L2

Create a new column 'penultimate' which has the second largest value of each row.

Solve:

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame(np.random.randint(1, 100, 80).reshape(8, -1).tolist(), schema=[f'c{i}' for i in range(10)])

# Write your code below

Desired Output:

python

shape: (8, 1)
┌─────────────┐
│ penultimate │
│ ---         │
│ i32         │
╞═════════════╡
│ 87          │
│ 88          │
│ 89          │
│ 80          │
│ 64          │
│ 90          │
│ 78          │
│ 90          │
└─────────────┘

Show Solution

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame(np.random.randint(1, 100, 80).reshape(8, -1).tolist(), schema=[f'c{i}' for i in range(10)])

# Use numpy on each row
arr = df.to_numpy()
penultimate = [sorted(row)[-2] for row in arr]
df = df.with_columns(pl.Series("penultimate", penultimate))
print(df.select(["penultimate"]))

63. How to normalize all columns in a DataFrame?

Difficulty Level: L2

Normalize all columns of df so that the values in each column range from 0 to 1. (min-max scaling)

Solve:

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame(np.random.randint(1, 100, 80).reshape(8, -1).tolist(), schema=[f'c{i}' for i in range(10)])

# Write your code below

Desired Output:

python

shape: (5, 10)
┌──────────┬──────────┬──────────┬──────────┬───┬──────────┬──────────┬──────────┬──────────┐
│ c0       ┆ c1       ┆ c2       ┆ c3       ┆ … ┆ c6       ┆ c7       ┆ c8       ┆ c9       │
│ ---      ┆ ---      ┆ ---      ┆ ---      ┆   ┆ ---      ┆ ---      ┆ ---      ┆ ---      │
│ f64      ┆ f64      ┆ f64      ┆ f64      ┆   ┆ f64      ┆ f64      ┆ f64      ┆ f64      │
╞══════════╪══════════╪══════════╪══════════╪═══╪══════════╪══════════╪══════════╪══════════╡
│ 0.571429 ┆ 1.0      ┆ 0.134831 ┆ 1.0      ┆ … ┆ 0.861111 ┆ 0.977011 ┆ 0.863636 ┆ 0.811111 │
│ 1.0      ┆ 0.241758 ┆ 0.0      ┆ 0.275362 ┆ … ┆ 0.930556 ┆ 0.321839 ┆ 0.30303  ┆ 0.0      │
│ 0.714286 ┆ 0.637363 ┆ 0.202247 ┆ 0.434783 ┆ … ┆ 0.013889 ┆ 1.0      ┆ 0.469697 ┆ 0.988889 │
│ 0.654762 ┆ 0.43956  ┆ 1.0      ┆ 0.826087 ┆ … ┆ 0.569444 ┆ 0.689655 ┆ 0.439394 ┆ 0.666667 │
│ 0.559524 ┆ 0.582418 ┆ 0.685393 ┆ 0.0      ┆ … ┆ 0.0      ┆ 0.816092 ┆ 0.318182 ┆ 0.177778 │
└──────────┴──────────┴──────────┴──────────┴───┴──────────┴──────────┴──────────┴──────────┘

Show Solution

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame(np.random.randint(1, 100, 80).reshape(8, -1).tolist(), schema=[f'c{i}' for i in range(10)])

result = df.with_columns(
    [(pl.col(c) - pl.col(c).min()) / (pl.col(c).max() - pl.col(c).min()) for c in df.columns]
)
print(result)

64. How to compute the row-wise softmax of a DataFrame?

Difficulty Level: L3

Compute the softmax of each row: e^x_i / sum(e^x) for each row.

Solve:

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame(np.random.randint(1, 100, 80).reshape(8, -1).tolist(), schema=[f'c{i}' for i in range(10)])

# Write your code below

Desired Output:

python

shape: (3, 10)
┌────────┬────────┬────────┬────────┬───┬────────┬────────┬────────┬────────┐
│ c0     ┆ c1     ┆ c2     ┆ c3     ┆ … ┆ c6     ┆ c7     ┆ c8     ┆ c9     │
│ ---    ┆ ---    ┆ ---    ┆ ---    ┆   ┆ ---    ┆ ---    ┆ ---    ┆ ---    │
│ f64    ┆ f64    ┆ f64    ┆ f64    ┆   ┆ f64    ┆ f64    ┆ f64    ┆ f64    │
╞════════╪════════╪════════╪════════╪═══╪════════╪════════╪════════╪════════╡
│ 0.0000 ┆ 0.9975 ┆ 0.0000 ┆ 0.0000 ┆ … ┆ 0.0000 ┆ 0.0025 ┆ 0.0000 ┆ 0.0000 │
│ 0.5000 ┆ 0.0000 ┆ 0.0000 ┆ 0.0000 ┆ … ┆ 0.5000 ┆ 0.0000 ┆ 0.0000 ┆ 0.0000 │
│ 0.0000 ┆ 0.0000 ┆ 0.0000 ┆ 0.0000 ┆ … ┆ 0.0000 ┆ 0.1192 ┆ 0.0000 ┆ 0.8808 │
└────────┴────────┴────────┴────────┴───┴────────┴────────┴────────┴────────┘

Show Solution

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame(np.random.randint(1, 100, 80).reshape(8, -1).tolist(), schema=[f'c{i}' for i in range(10)])

arr = df.to_numpy().astype(float)
exp_arr = np.exp(arr - arr.max(axis=1, keepdims=True))  # for numerical stability
softmax = exp_arr / exp_arr.sum(axis=1, keepdims=True)
result = pl.DataFrame(softmax.tolist(), schema=df.columns)
print(result)

65. How to find the maximum range (max – min) column in a DataFrame?

Difficulty Level: L2

Find the column with the maximum range (max – min) in df.

Solve:

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame(np.random.randint(1, 100, 80).reshape(8, -1).tolist(), schema=[f'c{i}' for i in range(10)])

# Write your code below

Desired Output:

python

Ranges: {'c1': 91, 'c9': 90, 'c2': 89}
Column with max range: c1

Show Solution

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame(np.random.randint(1, 100, 80).reshape(8, -1).tolist(), schema=[f'c{i}' for i in range(10)])

ranges = {c: df[c].max() - df[c].min() for c in df.columns}
print("Column with max range:", max(ranges, key=ranges.get))

66. How to replace both diagonals of a DataFrame with 0?

Difficulty Level: L3

Replace both the main and anti-diagonal of df with 0.

Solve:

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame(np.random.randint(1, 100, 100).reshape(10, -1).tolist(), schema=[f'c{i}' for i in range(10)])

# Write your code below

Desired Output:

python

shape: (5, 10)
┌─────┬─────┬─────┬─────┬───┬─────┬─────┬─────┬─────┐
│ c0  ┆ c1  ┆ c2  ┆ c3  ┆ … ┆ c6  ┆ c7  ┆ c8  ┆ c9  │
│ --- ┆ --- ┆ --- ┆ --- ┆   ┆ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 ┆ i32 ┆   ┆ i32 ┆ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪═════╪═════╪═══╪═════╪═════╪═════╪═════╡
│ 0   ┆ 93  ┆ 15  ┆ 72  ┆ … ┆ 83  ┆ 87  ┆ 75  ┆ 0   │
│ 88  ┆ 0   ┆ 3   ┆ 22  ┆ … ┆ 88  ┆ 30  ┆ 0   ┆ 2   │
│ 64  ┆ 60  ┆ 0   ┆ 33  ┆ … ┆ 22  ┆ 0   ┆ 49  ┆ 91  │
│ 59  ┆ 42  ┆ 92  ┆ 0   ┆ … ┆ 0   ┆ 62  ┆ 47  ┆ 62  │
│ 51  ┆ 55  ┆ 64  ┆ 3   ┆ … ┆ 21  ┆ 73  ┆ 39  ┆ 18  │
└─────┴─────┴─────┴─────┴───┴─────┴─────┴─────┴─────┘

Show Solution

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame(np.random.randint(1, 100, 100).reshape(10, -1).tolist(), schema=[f'c{i}' for i in range(10)])

arr = df.to_numpy().copy()
np.fill_diagonal(arr, 0)
np.fill_diagonal(np.fliplr(arr), 0)
result = pl.DataFrame(arr.tolist(), schema=df.columns)
print(result)

67. How to get a particular group of a group_by DataFrame by key?

Difficulty Level: L2

From df grouped by 'col1', get the group belonging to 'apple' as a DataFrame.

Solve:

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame({
    'col1': ['apple', 'banana', 'orange'] * 3,
    'col2': np.random.rand(9).tolist(),
    'col3': np.random.randint(0, 15, 9).tolist()
})

# Write your code below

Desired Output:

python

shape: (3, 3)
┌───────┬──────────┬──────┐
│ col1  ┆ col2     ┆ col3 │
│ ---   ┆ ---      ┆ ---  │
│ str   ┆ f64      ┆ i32  │
╞═══════╪══════════╪══════╡
│ apple ┆ 0.37454  ┆ 7    │
│ apple ┆ 0.598658 ┆ 4    │
│ apple ┆ 0.058084 ┆ 11   │
└───────┴──────────┴──────┘

Show Solution

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame({
    'col1': ['apple', 'banana', 'orange'] * 3,
    'col2': np.random.rand(9).tolist(),
    'col3': np.random.randint(0, 15, 9).tolist()
})

result = df.filter(pl.col("col1") == "apple")
print(result)

68. How to get the n-th largest value of a column when grouped by another column?

Difficulty Level: L2

In df, find the second largest value of 'taste' for 'banana'.

Solve:

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame({
    'fruit': ['apple', 'banana', 'orange', 'apple', 'banana', 'orange', 'apple', 'banana', 'orange'],
    'taste': np.random.rand(9).tolist(),
    'price': np.random.randint(1, 15, 9).tolist()
})

# Write your code below

Desired Output:

python

2nd largest taste for banana: 0.8662

Show Solution

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame({
    'fruit': ['apple', 'banana', 'orange', 'apple', 'banana', 'orange', 'apple', 'banana', 'orange'],
    'taste': np.random.rand(9).tolist(),
    'price': np.random.randint(1, 15, 9).tolist()
})

result = (
    df.filter(pl.col("fruit") == "banana")
    .sort("taste", descending=True)
    [1, "taste"]  # 2nd largest (index 1)
)
print(result)

69. How to compute grouped mean and keep the grouped column as another column (not index)?

Difficulty Level: L1

Compute the grouped mean of 'price' by 'fruit' and keep 'fruit' as a regular column.

Solve:

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame({
    'fruit': ['apple', 'banana', 'orange'] * 3,
    'taste': np.random.rand(9).tolist(),
    'price': np.random.randint(1, 15, 9).tolist()
})

# Write your code below

Desired Output:

python

shape: (3, 2)
┌────────┬──────────┐
│ fruit  ┆ price    │
│ ---    ┆ ---      │
│ str    ┆ f64      │
╞════════╪══════════╡
│ apple  ┆ 8.333333 │
│ banana ┆ 6.333333 │
│ orange ┆ 6.666667 │
└────────┴──────────┘

Show Solution

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame({
    'fruit': ['apple', 'banana', 'orange'] * 3,
    'taste': np.random.rand(9).tolist(),
    'price': np.random.randint(1, 15, 9).tolist()
})

# In Polars, group_by always keeps the grouped column as a column (no index)
result = df.group_by("fruit").agg(pl.col("price").mean())
print(result)

70. How to join two DataFrames by 2 columns so they have only the common rows?

Difficulty Level: L2

Join df1 and df2 on 'fruit' and 'weight' so only matching rows remain.

Solve:

import polars as pl
import numpy as np
np.random.seed(42)
df1 = pl.DataFrame({
    'fruit': ['apple', 'banana', 'orange'],
    'weight': ['high', 'medium', 'low'],
    'price': np.random.randint(0, 15, 3).tolist()
})
df2 = pl.DataFrame({
    'fruit': ['apple', 'banana', 'melon'],
    'weight': ['high', 'medium', 'high'],
    'taste': np.random.randint(0, 15, 3).tolist()
})

# Write your code below

Desired Output:

python

shape: (2, 4)
┌────────┬────────┬───────┬───────┐
│ fruit  ┆ weight ┆ price ┆ taste │
│ ---    ┆ ---    ┆ ---   ┆ ---   │
│ str    ┆ str    ┆ i32   ┆ i32   │
╞════════╪════════╪═══════╪═══════╡
│ apple  ┆ high   ┆ 6     ┆ 14    │
│ banana ┆ medium ┆ 3     ┆ 10    │
└────────┴────────┴───────┴───────┘

Show Solution

import polars as pl
import numpy as np
np.random.seed(42)
df1 = pl.DataFrame({
    'fruit': ['apple', 'banana', 'orange'],
    'weight': ['high', 'medium', 'low'],
    'price': np.random.randint(0, 15, 3).tolist()
})
df2 = pl.DataFrame({
    'fruit': ['apple', 'banana', 'melon'],
    'weight': ['high', 'medium', 'high'],
    'taste': np.random.randint(0, 15, 3).tolist()
})

result = df1.join(df2, on=["fruit", "weight"], how="inner")
print(result)

71. How to remove rows from a DataFrame that are present in another DataFrame?

Difficulty Level: L3

Remove rows from df1 that are present in df2, based on the 'fruit' column.

Solve:

import polars as pl
import numpy as np
np.random.seed(42)
df1 = pl.DataFrame({
    'fruit': ['apple', 'banana', 'orange'],
    'weight': ['high', 'medium', 'low'],
    'price': np.random.randint(0, 15, 3).tolist()
})
df2 = pl.DataFrame({
    'fruit': ['apple', 'melon', 'banana'],
    'weight': ['high', 'high', 'low'],
    'taste': np.random.randint(0, 15, 3).tolist()
})

# Write your code below

Desired Output:

python

shape: (1, 3)
┌────────┬────────┬───────┐
│ fruit  ┆ weight ┆ price │
│ ---    ┆ ---    ┆ ---   │
│ str    ┆ str    ┆ i32   │
╞════════╪════════╪═══════╡
│ orange ┆ low    ┆ 12    │
└────────┴────────┴───────┘

Show Solution

import polars as pl
import numpy as np
np.random.seed(42)
df1 = pl.DataFrame({
    'fruit': ['apple', 'banana', 'orange'],
    'weight': ['high', 'medium', 'low'],
    'price': np.random.randint(0, 15, 3).tolist()
})
df2 = pl.DataFrame({
    'fruit': ['apple', 'melon', 'banana'],
    'weight': ['high', 'high', 'low'],
    'taste': np.random.randint(0, 15, 3).tolist()
})

result = df1.filter(~pl.col("fruit").is_in(df2["fruit"]))
print(result)

72. How to get the positions where values of two columns match?

Difficulty Level: L1

Get the row positions where the values of columns 'a' and 'b' are equal.

Solve:

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame({
    'a': np.random.choice([1, 2, 3, 4], 10).tolist(),
    'b': np.random.choice([1, 2, 3, 4], 10).tolist()
})

# Write your code below

Desired Output:

python

shape: (10, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i32 ┆ i32 │
╞═════╪═════╡
│ 3   ┆ 3   │
│ 4   ┆ 3   │
│ 1   ┆ 3   │
│ 3   ┆ 3   │
│ 3   ┆ 4   │
│ 4   ┆ 1   │
│ 1   ┆ 4   │
│ 1   ┆ 4   │
│ 3   ┆ 4   │
│ 2   ┆ 3   │
└─────┴─────┘
Positions where a == b: [0, 3]

Show Solution

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame({
    'a': np.random.choice([1, 2, 3, 4], 10).tolist(),
    'b': np.random.choice([1, 2, 3, 4], 10).tolist()
})

positions = df.with_row_index("idx").filter(pl.col("a") == pl.col("b"))["idx"].to_list()
print(positions)

73. How to create lags and leads of a column in a DataFrame?

Difficulty Level: L2

Create columns for lag1 (shifted down by 1) and lead1 (shifted up by 1) of column 'a'.

Solve:

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame({
    'a': np.arange(1, 11).tolist(),
    'b': np.random.randint(10, 30, 10).tolist()
})

# Write your code below

Desired Output:

python

shape: (10, 4)
┌─────┬─────┬──────┬───────┐
│ a   ┆ b   ┆ lag1 ┆ lead1 │
│ --- ┆ --- ┆ ---  ┆ ---   │
│ i32 ┆ i32 ┆ i32  ┆ i32   │
╞═════╪═════╪══════╪═══════╡
│ 1   ┆ 16  ┆ null ┆ 2     │
│ 2   ┆ 29  ┆ 1    ┆ 3     │
│ 3   ┆ 24  ┆ 2    ┆ 4     │
│ 4   ┆ 20  ┆ 3    ┆ 5     │
│ 5   ┆ 17  ┆ 4    ┆ 6     │
│ 6   ┆ 16  ┆ 5    ┆ 7     │
│ 7   ┆ 28  ┆ 6    ┆ 8     │
│ 8   ┆ 20  ┆ 7    ┆ 9     │
│ 9   ┆ 20  ┆ 8    ┆ 10    │
│ 10  ┆ 13  ┆ 9    ┆ null  │
└─────┴─────┴──────┴───────┘

Show Solution

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame({
    'a': np.arange(1, 11).tolist(),
    'b': np.random.randint(10, 30, 10).tolist()
})

df = df.with_columns(
    pl.col("a").shift(1).alias("lag1"),
    pl.col("a").shift(-1).alias("lead1"),
)
print(df)

74. How to get the frequency of unique values in the entire DataFrame?

Difficulty Level: L2

Get the frequency of unique values across the entire DataFrame df.

Solve:

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame(np.random.randint(1, 10, 20).reshape(4, 5).tolist(), schema=list('abcde'))

# Write your code below

Desired Output:

python

shape: (7, 2)
┌─────┬───────┐
│ a   ┆ count │
│ --- ┆ ---   │
│ i64 ┆ u32   │
╞═════╪═══════╡
│ 8   ┆ 5     │
│ 5   ┆ 4     │
│ 7   ┆ 3     │
│ 4   ┆ 2     │
│ 6   ┆ 2     │
│ 2   ┆ 2     │
│ 3   ┆ 2     │
└─────┴───────┘

Show Solution

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame(np.random.randint(1, 10, 20).reshape(4, 5).tolist(), schema=list('abcde'))

# Flatten all values and count
all_values = pl.concat([df[c].cast(pl.Int64) for c in df.columns])
print(all_values.value_counts().sort("count", descending=True))

75. How to split a text column into two separate columns?

Difficulty Level: L2

Split the string column in df to form a DataFrame with 3 columns.

Solve:

import polars as pl
df = pl.DataFrame({
    "row": [
        "STD, City\tState",
        "33, Kolkata\tWest Bengal",
        "44, Chennai\tTamil Nadu",
        "40, Hyderabad\tTelengana",
        "80, Bangalore\tKarnataka"
    ]
})

# Write your code below

Desired Output:

python

shape: (4, 3)
┌─────┬───────────┬─────────────┐
│ STD ┆ City      ┆ State       │
│ --- ┆ ---       ┆ ---         │
│ str ┆ str       ┆ str         │
╞═════╪═══════════╪═════════════╡
│ 33  ┆ Kolkata   ┆ West Bengal │
│ 44  ┆ Chennai   ┆ Tamil Nadu  │
│ 40  ┆ Hyderabad ┆ Telengana   │
│ 80  ┆ Bangalore ┆ Karnataka   │
└─────┴───────────┴─────────────┘

Show Solution

import polars as pl
df = pl.DataFrame({
    "row": [
        "STD, City\tState",
        "33, Kolkata\tWest Bengal",
        "44, Chennai\tTamil Nadu",
        "40, Hyderabad\tTelengana",
        "80, Bangalore\tKarnataka"
    ]
})

# Split by comma+space to get STD and rest
split1 = df["row"].str.split(", ")
df2 = pl.DataFrame({
    "first": split1.list.get(0),
    "rest": split1.list.get(1),
})

# Split rest by tab
split2 = df2["rest"].str.split("\t")
result = pl.DataFrame({
    "STD": df2["first"],
    "City": split2.list.get(0),
    "State": split2.list.get(1),
})

# Use first row as header and skip it
header = result.row(0)
result = result.slice(1).rename(dict(zip(result.columns, header)))
print(result)

76. How to rank items within each group?

Difficulty Level: L2

For each store, rank the months by revenue (highest = rank 1). Use a window function.

Solve:

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame({
    'store': ['A','A','A','B','B','B','C','C','C'],
    'month': ['Jan','Feb','Mar','Jan','Feb','Mar','Jan','Feb','Mar'],
    'revenue': np.random.randint(100, 500, 9).tolist()
})

# Write your code below

Desired Output:

python

shape: (9, 4)
┌───────┬───────┬─────────┬───────────────┐
│ store ┆ month ┆ revenue ┆ rank_in_store │
│ ---   ┆ ---   ┆ ---     ┆ ---           │
│ str   ┆ str   ┆ i32     ┆ i32           │
╞═══════╪═══════╪═════════╪═══════════════╡
│ A     ┆ Jan   ┆ 202     ┆ 3             │
│ A     ┆ Feb   ┆ 448     ┆ 1             │
│ A     ┆ Mar   ┆ 370     ┆ 2             │
│ B     ┆ Jan   ┆ 206     ┆ 2             │
│ B     ┆ Feb   ┆ 171     ┆ 3             │
│ B     ┆ Mar   ┆ 288     ┆ 1             │
│ C     ┆ Jan   ┆ 120     ┆ 3             │
│ C     ┆ Feb   ┆ 202     ┆ 2             │
│ C     ┆ Mar   ┆ 221     ┆ 1             │
└───────┴───────┴─────────┴───────────────┘

Show Solution

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame({
    'store': ['A','A','A','B','B','B','C','C','C'],
    'month': ['Jan','Feb','Mar','Jan','Feb','Mar','Jan','Feb','Mar'],
    'revenue': np.random.randint(100, 500, 9).tolist()
})

result = df.with_columns(
    pl.col("revenue").rank(descending=True).over("store").cast(pl.Int32).alias("rank_in_store")
)
print(result)

77. How to compute the running difference within groups?

Difficulty Level: L2

For each user, compute the day-over-day change in logins using diff() within groups.

Solve:

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame({
    'user': ['A','A','A','A','B','B','B','B'],
    'day': [1,2,3,4,1,2,3,4],
    'logins': np.random.randint(1, 20, 8).tolist()
})

# Write your code below

Desired Output:

python

shape: (8, 4)
┌──────┬─────┬────────┬──────────────┐
│ user ┆ day ┆ logins ┆ daily_change │
│ ---  ┆ --- ┆ ---    ┆ ---          │
│ str  ┆ i64 ┆ i32    ┆ i32          │
╞══════╪═════╪════════╪══════════════╡
│ A    ┆ 1   ┆ 7      ┆ null         │
│ A    ┆ 2   ┆ 15     ┆ 8            │
│ A    ┆ 3   ┆ 11     ┆ -4           │
│ A    ┆ 4   ┆ 8      ┆ -3           │
│ B    ┆ 1   ┆ 7      ┆ null         │
│ B    ┆ 2   ┆ 19     ┆ 12           │
│ B    ┆ 3   ┆ 11     ┆ -8           │
│ B    ┆ 4   ┆ 11     ┆ 0            │
└──────┴─────┴────────┴──────────────┘

Show Solution

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame({
    'user': ['A','A','A','A','B','B','B','B'],
    'day': [1,2,3,4,1,2,3,4],
    'logins': np.random.randint(1, 20, 8).tolist()
})

result = df.with_columns(
    pl.col("logins").diff().over("user").alias("daily_change")
)
print(result)

78. How to compute each employee’s salary as a percentage of their department total?

Difficulty Level: L2

Add a column showing what percentage of the department salary each employee represents.

Solve:

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame({
    'dept': ['Sales','Sales','Sales','Eng','Eng','Eng'],
    'employee': ['Alice','Bob','Carol','Dave','Eve','Frank'],
    'salary': (np.random.randint(50, 150, 6) * 1000).tolist()
})

# Write your code below

Desired Output:

python

shape: (6, 4)
┌───────┬──────────┬────────┬─────────────┐
│ dept  ┆ employee ┆ salary ┆ pct_of_dept │
│ ---   ┆ ---      ┆ ---    ┆ ---         │
│ str   ┆ str      ┆ i32    ┆ f64         │
╞═══════╪══════════╪════════╪═════════════╡
│ Sales ┆ Alice    ┆ 101000 ┆ 32.9        │
│ Sales ┆ Bob      ┆ 142000 ┆ 46.3        │
│ Sales ┆ Carol    ┆ 64000  ┆ 20.8        │
│ Eng   ┆ Dave     ┆ 121000 ┆ 40.2        │
│ Eng   ┆ Eve      ┆ 110000 ┆ 36.5        │
│ Eng   ┆ Frank    ┆ 70000  ┆ 23.3        │
└───────┴──────────┴────────┴─────────────┘

Show Solution

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame({
    'dept': ['Sales','Sales','Sales','Eng','Eng','Eng'],
    'employee': ['Alice','Bob','Carol','Dave','Eve','Frank'],
    'salary': (np.random.randint(50, 150, 6) * 1000).tolist()
})

result = df.with_columns(
    (pl.col("salary") / pl.col("salary").sum().over("dept") * 100).round(1).alias("pct_of_dept")
)
print(result)

79. How to detect the start of a new streak in a sequence?

Difficulty Level: L3

Given a Series of status values, flag each row where a new streak begins (i.e., the value changes from the previous row).

Solve:

import polars as pl
ser = pl.Series("status", ['ok','ok','fail','fail','fail','ok','fail','ok','ok'])

# Write your code below

Desired Output:

python

shape: (9, 3)
┌─────┬────────┬───────────────┐
│ idx ┆ status ┆ is_new_streak │
│ --- ┆ ---    ┆ ---           │
│ u32 ┆ str    ┆ bool          │
╞═════╪════════╪═══════════════╡
│ 0   ┆ ok     ┆ null          │
│ 1   ┆ ok     ┆ false         │
│ 2   ┆ fail   ┆ true          │
│ 3   ┆ fail   ┆ false         │
│ 4   ┆ fail   ┆ false         │
│ 5   ┆ ok     ┆ true          │
│ 6   ┆ fail   ┆ true          │
│ 7   ┆ ok     ┆ true          │
│ 8   ┆ ok     ┆ false         │
└─────┴────────┴───────────────┘

Show Solution

import polars as pl
ser = pl.Series("status", ['ok','ok','fail','fail','fail','ok','fail','ok','ok'])

df = ser.to_frame("status").with_row_index("idx")
result = df.with_columns(
    (pl.col("status") != pl.col("status").shift(1)).alias("is_new_streak")
)
print(result)

80. How to compute the row-wise coefficient of variation?

Difficulty Level: L3

Compute the coefficient of variation (std / mean) across the columns for each row.

Solve:

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame(np.random.randint(10, 100, 30).reshape(6, 5).tolist(), schema=[f"s{i}" for i in range(5)])

# Write your code below

Desired Output:

python

shape: (6, 1)
┌────────┐
│ cv     │
│ ---    │
│ f64    │
╞════════╡
│ 0.4706 │
│ 0.0696 │
│ 0.6956 │
│ 0.6162 │
│ 0.3786 │
│ 0.4025 │
└────────┘

Show Solution

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame(np.random.randint(10, 100, 30).reshape(6, 5).tolist(), schema=[f"s{i}" for i in range(5)])

means = pl.mean_horizontal(pl.all())
stds = pl.concat_list(pl.all()).list.eval(pl.element().std()).list.first()
result = df.with_columns(
    (stds / means).round(4).alias("cv")
)
print(result.select("cv"))

81. How to build a pivot table with multiple aggregations?

Difficulty Level: L2

Group by region and product, then compute total sales, average sales, and total quantity.

Solve:

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame({
    'region': ['East','East','West','West','East','West'] * 2,
    'product': ['A','B','A','B','A','B'] * 2,
    'sales': np.random.randint(100, 1000, 12).tolist(),
    'qty': np.random.randint(1, 50, 12).tolist()
})

# Write your code below

Desired Output:

python

shape: (4, 5)
┌────────┬─────────┬─────────────┬───────────┬───────────┐
│ region ┆ product ┆ total_sales ┆ avg_sales ┆ total_qty │
│ ---    ┆ ---     ┆ ---         ┆ ---       ┆ ---       │
│ str    ┆ str     ┆ i32         ┆ f64       ┆ i32       │
╞════════╪═════════╪═════════════╪═══════════╪═══════════╡
│ East   ┆ A       ┆ 1774        ┆ 444.0     ┆ 98        │
│ East   ┆ B       ┆ 655         ┆ 328.0     ┆ 33        │
│ West   ┆ A       ┆ 1674        ┆ 837.0     ┆ 26        │
│ West   ┆ B       ┆ 1076        ┆ 269.0     ┆ 114       │
└────────┴─────────┴─────────────┴───────────┴───────────┘

Show Solution

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame({
    'region': ['East','East','West','West','East','West'] * 2,
    'product': ['A','B','A','B','A','B'] * 2,
    'sales': np.random.randint(100, 1000, 12).tolist(),
    'qty': np.random.randint(1, 50, 12).tolist()
})

result = df.group_by(["region", "product"]).agg(
    pl.col("sales").sum().alias("total_sales"),
    pl.col("sales").mean().round(0).alias("avg_sales"),
    pl.col("qty").sum().alias("total_qty")
).sort(["region", "product"])
print(result)

82. How to create a rolling mean column?

Difficulty Level: L2

Create a 5-period rolling mean of column 'medv'.

Solve:

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv')

# Write your code below

Desired Output:

python

shape: (7, 2)
┌──────┬──────────────┐
│ medv ┆ rolling_medv │
│ ---  ┆ ---          │
│ f64  ┆ f64          │
╞══════╪══════════════╡
│ 24.0 ┆ null         │
│ 21.6 ┆ null         │
│ 34.7 ┆ null         │
│ 33.4 ┆ null         │
│ 36.2 ┆ 29.98        │
│ 28.7 ┆ 30.92        │
│ 22.9 ┆ 31.18        │
└──────┴──────────────┘

Show Solution

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv')

df = df.with_columns(
    pl.col("medv").rolling_mean(window_size=5).alias("rolling_medv")
)
print(df.select(["medv", "rolling_medv"]).head(7))

83. How to find the first occurrence of each unique value?

Difficulty Level: L2

For each unique category, find the row index and value of its first appearance.

Solve:

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame({
    'category': ['B','A','C','A','B','C','A','B'],
    'value': np.random.randint(10, 99, 8).tolist()
})

# Write your code below

Desired Output:

python

shape: (3, 3)
┌──────────┬───────────────┬─────────────┐
│ category ┆ first_seen_at ┆ first_value │
│ ---      ┆ ---           ┆ ---         │
│ str      ┆ u32           ┆ i32         │
╞══════════╪═══════════════╪═════════════╡
│ B        ┆ 0             ┆ 61          │
│ A        ┆ 1             ┆ 24          │
│ C        ┆ 2             ┆ 81          │
└──────────┴───────────────┴─────────────┘

Show Solution

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame({
    'category': ['B','A','C','A','B','C','A','B'],
    'value': np.random.randint(10, 99, 8).tolist()
})

result = df.with_row_index("idx").group_by("category").agg(
    pl.col("idx").first().alias("first_seen_at"),
    pl.col("value").first().alias("first_value")
).sort("first_seen_at")
print(result)

84. How to find duplicate rows in a DataFrame?

Difficulty Level: L1

Find duplicate rows based on all columns.

Solve:

import polars as pl
df = pl.DataFrame({
    'a': [1, 2, 2, 3, 3],
    'b': ['x', 'y', 'y', 'z', 'z'],
})

# Write your code below

Desired Output:

python

shape: (4, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 2   ┆ y   │
│ 2   ┆ y   │
│ 3   ┆ z   │
│ 3   ┆ z   │
└─────┴─────┘

Show Solution

import polars as pl
df = pl.DataFrame({
    'a': [1, 2, 2, 3, 3],
    'b': ['x', 'y', 'y', 'z', 'z'],
})

result = df.filter(pl.struct(pl.all()).is_duplicated())
print(result)

85. How to identify the top performer in each group?

Difficulty Level: L2

From df, select the player with the highest score in each team — using a window function, not group_by.

Solve:

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame({
    'team': ['Red','Red','Red','Blue','Blue','Blue','Green','Green','Green'],
    'player': ['A','B','C','D','E','F','G','H','I'],
    'score': np.random.randint(50, 100, 9).tolist()
})

# Write your code below

Desired Output:

python

shape: (3, 3)
┌───────┬────────┬───────┐
│ team  ┆ player ┆ score │
│ ---   ┆ ---    ┆ ---   │
│ str   ┆ str    ┆ i32   │
╞═══════╪════════╪═══════╡
│ Red   ┆ A      ┆ 88    │
│ Blue  ┆ D      ┆ 92    │
│ Green ┆ G      ┆ 88    │
└───────┴────────┴───────┘

Show Solution

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame({
    'team': ['Red','Red','Red','Blue','Blue','Blue','Green','Green','Green'],
    'player': ['A','B','C','D','E','F','G','H','I'],
    'score': np.random.randint(50, 100, 9).tolist()
})

result = df.filter(
    pl.col("score") == pl.col("score").max().over("team")
)
print(result)

86. How to compute z-scores per group?

Difficulty Level: L2

Compute the z-score of value within each group using window functions.

Solve:

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame({
    'group': ['A','A','A','A','B','B','B','B'],
    'value': np.random.normal(50, 10, 8).round(1).tolist()
})

# Write your code below

Desired Output:

python

shape: (8, 3)
┌───────┬───────┬─────────┐
│ group ┆ value ┆ z_score │
│ ---   ┆ ---   ┆ ---     │
│ str   ┆ f64   ┆ f64     │
╞═══════╪═══════╪═════════╡
│ A     ┆ 55.0  ┆ -0.19   │
│ A     ┆ 48.6  ┆ -1.13   │
│ A     ┆ 56.5  ┆ 0.03    │
│ A     ┆ 65.2  ┆ 1.3     │
│ B     ┆ 47.7  ┆ -0.8    │
│ B     ┆ 47.7  ┆ -0.8    │
│ B     ┆ 65.8  ┆ 1.26    │
│ B     ┆ 57.7  ┆ 0.34    │
└───────┴───────┴─────────┘

Show Solution

import polars as pl
import numpy as np
np.random.seed(42)
df = pl.DataFrame({
    'group': ['A','A','A','A','B','B','B','B'],
    'value': np.random.normal(50, 10, 8).round(1).tolist()
})

result = df.with_columns(
    ((pl.col("value") - pl.col("value").mean().over("group")) /
     pl.col("value").std().over("group")).round(2).alias("z_score")
)
print(result)

87. How to compute expanding (cumulative) window aggregations?

Difficulty Level: L2

Add columns for cumulative sum, running max, and running min of sales.

Solve:

import polars as pl
df = pl.DataFrame({
    'day': list(range(1, 8)),
    'sales': [100, 150, 130, 200, 180, 220, 210]
})

# Write your code below

Desired Output:

python

shape: (7, 5)
┌─────┬───────┬───────────┬─────────────┬─────────────┐
│ day ┆ sales ┆ cum_sales ┆ running_max ┆ running_min │
│ --- ┆ ---   ┆ ---       ┆ ---         ┆ ---         │
│ i64 ┆ i64   ┆ i64       ┆ i64         ┆ i64         │
╞═════╪═══════╪═══════════╪═════════════╪═════════════╡
│ 1   ┆ 100   ┆ 100       ┆ 100         ┆ 100         │
│ 2   ┆ 150   ┆ 250       ┆ 150         ┆ 100         │
│ 3   ┆ 130   ┆ 380       ┆ 150         ┆ 100         │
│ 4   ┆ 200   ┆ 580       ┆ 200         ┆ 100         │
│ 5   ┆ 180   ┆ 760       ┆ 200         ┆ 100         │
│ 6   ┆ 220   ┆ 980       ┆ 220         ┆ 100         │
│ 7   ┆ 210   ┆ 1190      ┆ 220         ┆ 100         │
└─────┴───────┴───────────┴─────────────┴─────────────┘

Show Solution

import polars as pl
df = pl.DataFrame({
    'day': list(range(1, 8)),
    'sales': [100, 150, 130, 200, 180, 220, 210]
})

result = df.with_columns(
    pl.col("sales").cum_sum().alias("cum_sales"),
    pl.col("sales").cum_max().alias("running_max"),
    pl.col("sales").cum_min().alias("running_min"),
)
print(result)

88. How to compute a conditional cumulative sum?

Difficulty Level: L3

Compute a running total of amount, but only accumulate rows where event == 'purchase'.

Solve:

import polars as pl
df = pl.DataFrame({
    'event': ['login','purchase','login','purchase','login','purchase','login','purchase'],
    'amount': [0, 50, 0, 30, 0, 80, 0, 20]
})

# Write your code below

Desired Output:

python

shape: (8, 3)
┌──────────┬────────┬────────────────────────┐
│ event    ┆ amount ┆ running_purchase_total │
│ ---      ┆ ---    ┆ ---                    │
│ str      ┆ i64    ┆ i64                    │
╞══════════╪════════╪════════════════════════╡
│ login    ┆ 0      ┆ 0                      │
│ purchase ┆ 50     ┆ 50                     │
│ login    ┆ 0      ┆ 50                     │
│ purchase ┆ 30     ┆ 80                     │
│ login    ┆ 0      ┆ 80                     │
│ purchase ┆ 80     ┆ 160                    │
│ login    ┆ 0      ┆ 160                    │
│ purchase ┆ 20     ┆ 180                    │
└──────────┴────────┴────────────────────────┘

Show Solution

import polars as pl
df = pl.DataFrame({
    'event': ['login','purchase','login','purchase','login','purchase','login','purchase'],
    'amount': [0, 50, 0, 30, 0, 80, 0, 20]
})

result = df.with_columns(
    pl.when(pl.col("event") == "purchase")
    .then(pl.col("amount"))
    .otherwise(0)
    .cum_sum()
    .alias("running_purchase_total")
)
print(result)

89. How to compute quarter-over-quarter growth rate within groups?

Difficulty Level: L2

For each company, compute the percentage growth in revenue from the previous quarter.

Solve:

import polars as pl
df = pl.DataFrame({
    'company': ['AAPL','AAPL','AAPL','GOOG','GOOG','GOOG'],
    'quarter': ['Q1','Q2','Q3','Q1','Q2','Q3'],
    'revenue': [100, 120, 115, 200, 230, 250]
})

# Write your code below

Desired Output:

python

shape: (6, 5)
┌─────────┬─────────┬─────────┬──────────────┬────────────┐
│ company ┆ quarter ┆ revenue ┆ prev_revenue ┆ growth_pct │
│ ---     ┆ ---     ┆ ---     ┆ ---          ┆ ---        │
│ str     ┆ str     ┆ i64     ┆ i64          ┆ f64        │
╞═════════╪═════════╪═════════╪══════════════╪════════════╡
│ AAPL    ┆ Q1      ┆ 100     ┆ null         ┆ null       │
│ AAPL    ┆ Q2      ┆ 120     ┆ 100          ┆ 20.0       │
│ AAPL    ┆ Q3      ┆ 115     ┆ 120          ┆ -4.2       │
│ GOOG    ┆ Q1      ┆ 200     ┆ null         ┆ null       │
│ GOOG    ┆ Q2      ┆ 230     ┆ 200          ┆ 15.0       │
│ GOOG    ┆ Q3      ┆ 250     ┆ 230          ┆ 8.7        │
└─────────┴─────────┴─────────┴──────────────┴────────────┘

Show Solution

import polars as pl
df = pl.DataFrame({
    'company': ['AAPL','AAPL','AAPL','GOOG','GOOG','GOOG'],
    'quarter': ['Q1','Q2','Q3','Q1','Q2','Q3'],
    'revenue': [100, 120, 115, 200, 230, 250]
})

result = df.with_columns(
    pl.col("revenue").shift(1).over("company").alias("prev_revenue")
).with_columns(
    ((pl.col("revenue") - pl.col("prev_revenue")) / pl.col("prev_revenue") * 100)
    .round(1).alias("growth_pct")
)
print(result)

90. How to detect outliers using the IQR method?

Difficulty Level: L2

Find values in ser that fall outside 1.5 × IQR from the quartiles.

Solve:

import polars as pl
import numpy as np
np.random.seed(42)
data = list(np.random.normal(50, 10, 20).round(1)) + [150.0, -30.0]  # inject outliers
ser = pl.Series("data", data)

# Write your code below

Desired Output:

python

Q1=40.9, Q3=55.4, IQR=14.5
Bounds: [19.1, 77.2]
Outliers: [150.0, -30.0]

Show Solution

import polars as pl
import numpy as np
np.random.seed(42)
data = list(np.random.normal(50, 10, 20).round(1)) + [150.0, -30.0]
ser = pl.Series("data", data)

q1 = ser.quantile(0.25)
q3 = ser.quantile(0.75)
iqr = q3 - q1
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr
outliers = ser.filter((ser < lower) | (ser > upper))
print(f"Q1={q1:.1f}, Q3={q3:.1f}, IQR={iqr:.1f}")
print(f"Bounds: [{lower:.1f}, {upper:.1f}]")
print("Outliers:", outliers.to_list())

91. How to use an anti-join to find missing records?

Difficulty Level: L2

Given a list of expected IDs and a DataFrame of received records, find which IDs are missing.

Solve:

import polars as pl
expected = pl.DataFrame({"id": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})
received = pl.DataFrame({"id": [1, 2, 4, 5, 7, 9], "value": [10, 20, 40, 50, 70, 90]})

# Write your code below

Desired Output:

python

shape: (4, 1)
┌─────┐
│ id  │
│ --- │
│ i64 │
╞═════╡
│ 3   │
│ 6   │
│ 8   │
│ 10  │
└─────┘

Show Solution

import polars as pl
expected = pl.DataFrame({"id": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})
received = pl.DataFrame({"id": [1, 2, 4, 5, 7, 9], "value": [10, 20, 40, 50, 70, 90]})

missing = expected.join(received, on="id", how="anti")
print(missing)

92. How to select columns by dtype?

Difficulty Level: L2

Select only the float columns from df.

Solve:

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")

# Write your code below

Desired Output:

python

['Min.Price', 'Price', 'Max.Price', 'EngineSize', 'Fuel.tank.capacity', 'Rear.seat.room']

Show Solution

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")

result = df.select(pl.col(pl.Float64))
print(result.columns)

93. How to categorize a numeric column using when/then/otherwise?

Difficulty Level: L2

Categorize 'medv' into 'low' (< 20), 'medium' (20-35), and 'high' (> 35).

Solve:

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv')

# Write your code below

Desired Output:

python

shape: (10, 2)
┌──────┬──────────┐
│ medv ┆ category │
│ ---  ┆ ---      │
│ f64  ┆ str      │
╞══════╪══════════╡
│ 24.0 ┆ medium   │
│ 21.6 ┆ medium   │
│ 34.7 ┆ medium   │
│ 33.4 ┆ medium   │
│ 36.2 ┆ high     │
│ 28.7 ┆ medium   │
│ 22.9 ┆ medium   │
│ 27.1 ┆ medium   │
│ 16.5 ┆ low      │
│ 18.9 ┆ low      │
└──────┴──────────┘

Show Solution

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv')

df = df.with_columns(
    pl.when(pl.col("medv") < 20).then(pl.lit("low"))
    .when(pl.col("medv") <= 35).then(pl.lit("medium"))
    .otherwise(pl.lit("high"))
    .alias("category")
)
print(df.select(["medv", "category"]).head(10))

94. How to compute the mode of each column in a DataFrame?

Difficulty Level: L2

Find the most frequent value in each column of df.

Solve:

import polars as pl
df = pl.DataFrame({
    'color': ['red','blue','red','green','blue','red','blue','green'],
    'size': ['S','M','L','M','M','S','L','M'],
    'rating': [5, 3, 5, 4, 3, 5, 3, 4]
})

# Write your code below

Desired Output:

python

Mode of 'color': red
Mode of 'size': M
Mode of 'rating': 5

Show Solution

import polars as pl
df = pl.DataFrame({
    'color': ['red','blue','red','green','blue','red','blue','green'],
    'size': ['S','M','L','M','M','S','L','M'],
    'rating': [5, 3, 5, 4, 3, 5, 3, 4]
})

for col in df.columns:
    mode_val = df[col].value_counts().sort("count", descending=True)[0, 0]
    print(f"Mode of '{col}': {mode_val}")

95. How to use lazy evaluation in Polars?

Difficulty Level: L2

Use lazy evaluation to filter rows where 'Price' > 30 and select 'Manufacturer', 'Model', and 'Price'.

Solve:

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")

# Write your code below

Desired Output:

python

shape: (12, 3)
┌───────────────┬─────────────┬───────┐
│ Manufacturer  ┆ Model       ┆ Price │
│ ---           ┆ ---         ┆ ---   │
│ str           ┆ str         ┆ f64   │
╞═══════════════╪═════════════╪═══════╡
│ null          ┆ Legend      ┆ 33.9  │
│ Audi          ┆ 100         ┆ 37.7  │
│ Cadillac      ┆ DeVille     ┆ 34.7  │
│ Cadillac      ┆ Seville     ┆ 40.1  │
│ Chevrolet     ┆ Corvette    ┆ 38.0  │
│ …             ┆ …           ┆ …     │
│ Lincoln       ┆ Continental ┆ 34.3  │
│ Lincoln       ┆ Town_Car    ┆ 36.1  │
│ Mazda         ┆ RX-7        ┆ 32.5  │
│ Mercedes-Benz ┆ 190E        ┆ 31.9  │
│ Mercedes-Benz ┆ 300E        ┆ 61.9  │
└───────────────┴─────────────┴───────┘

Show Solution

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")

result = (
    df.lazy()
    .filter(pl.col("Price") > 30)
    .select(["Manufacturer", "Model", "Price"])
    .collect()
)
print(result)

96. How to use window functions to compute group-level statistics alongside row-level data?

Difficulty Level: L2

Add a column showing the mean 'Price' per 'Type' alongside every row, without collapsing the DataFrame.

Solve:

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")

# Write your code below

Desired Output:

python

shape: (10, 4)
┌──────────────┬─────────┬───────┬────────────────────┐
│ Manufacturer ┆ Type    ┆ Price ┆ mean_price_by_type │
│ ---          ┆ ---     ┆ ---   ┆ ---                │
│ str          ┆ str     ┆ f64   ┆ f64                │
╞══════════════╪═════════╪═══════╪════════════════════╡
│ Acura        ┆ Small   ┆ 15.90 ┆ 10.20              │
│ null         ┆ Midsize ┆ 33.90 ┆ 27.65              │
│ Audi         ┆ Compact ┆ 29.10 ┆ 18.21              │
│ Audi         ┆ Midsize ┆ 37.70 ┆ 27.65              │
│ BMW          ┆ Midsize ┆ 30.00 ┆ 27.65              │
│ Buick        ┆ Midsize ┆ 15.70 ┆ 27.65              │
│ Buick        ┆ Large   ┆ 20.80 ┆ 24.30              │
│ Buick        ┆ Large   ┆ 23.70 ┆ 24.30              │
│ Buick        ┆ Midsize ┆ 26.30 ┆ 27.65              │
│ Cadillac     ┆ Large   ┆ 34.70 ┆ 24.30              │
└──────────────┴─────────┴───────┴────────────────────┘

Show Solution

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")

df = df.with_columns(
    pl.col("Price").mean().over("Type").alias("mean_price_by_type")
)
with pl.Config(float_precision=2):
    print(df.select(["Manufacturer", "Type", "Price", "mean_price_by_type"]).head(10))

97. How to understand the difference between `rank(method='min')` and `rank(method='dense')`?

Difficulty Level: L2

Rank students by score using both min and dense methods and observe the difference when there are ties.

Solve:

import polars as pl
df = pl.DataFrame({
    'student': ['Alice','Bob','Carol','Dave','Eve'],
    'score': [88, 92, 88, 95, 92]
})

# Write your code below

Desired Output:

python

shape: (5, 4)
┌─────────┬───────┬──────────┬────────────┐
│ student ┆ score ┆ rank_min ┆ rank_dense │
│ ---     ┆ ---   ┆ ---      ┆ ---        │
│ str     ┆ i64   ┆ i32      ┆ i32        │
╞═════════╪═══════╪══════════╪════════════╡
│ Alice   ┆ 88    ┆ 4        ┆ 3          │
│ Bob     ┆ 92    ┆ 2        ┆ 2          │
│ Carol   ┆ 88    ┆ 4        ┆ 3          │
│ Dave    ┆ 95    ┆ 1        ┆ 1          │
│ Eve     ┆ 92    ┆ 2        ┆ 2          │
└─────────┴───────┴──────────┴────────────┘

Show Solution

import polars as pl
df = pl.DataFrame({
    'student': ['Alice','Bob','Carol','Dave','Eve'],
    'score': [88, 92, 88, 95, 92]
})

result = df.with_columns(
    pl.col("score").rank(method="min", descending=True).cast(pl.Int32).alias("rank_min"),
    pl.col("score").rank(method="dense", descending=True).cast(pl.Int32).alias("rank_dense"),
)
print(result)

98. How to clean and standardize messy string columns?

Difficulty Level: L2

Clean first_name and last_name (strip whitespace, title-case), combine into full_name, and normalize email_raw to lowercase.

Solve:

import polars as pl
df = pl.DataFrame({
    'first_name': ['  John ', 'ALICE', 'bob  ', ' Carol'],
    'last_name': ['DOE  ', '  Smith', 'JONES', ' Lee  '],
    'email_raw': ['John.Doe@GMAIL.COM', 'alice@Yahoo.com', 'BOB@hotmail.COM', 'carol@outlook.COM']
})

# Write your code below

Desired Output:

python

shape: (4, 2)
┌─────────────┬────────────────────┐
│ full_name   ┆ email_clean        │
│ ---         ┆ ---                │
│ str         ┆ str                │
╞═════════════╪════════════════════╡
│ John Doe    ┆ john.doe@gmail.com │
│ Alice Smith ┆ alice@yahoo.com    │
│ Bob Jones   ┆ bob@hotmail.com    │
│ Carol Lee   ┆ carol@outlook.com  │
└─────────────┴────────────────────┘

Show Solution

import polars as pl
df = pl.DataFrame({
    'first_name': ['  John ', 'ALICE', 'bob  ', ' Carol'],
    'last_name': ['DOE  ', '  Smith', 'JONES', ' Lee  '],
    'email_raw': ['John.Doe@GMAIL.COM', 'alice@Yahoo.com', 'BOB@hotmail.COM', 'carol@outlook.COM']
})

result = df.with_columns(
    (pl.col("first_name").str.strip_chars().str.to_titlecase() + " " +
     pl.col("last_name").str.strip_chars().str.to_titlecase()).alias("full_name"),
    pl.col("email_raw").str.to_lowercase().alias("email_clean")
)
print(result.select(["full_name", "email_clean"]))

99. How to extract date features for machine learning?

Difficulty Level: L2

From a date column, extract month, weekday, quarter, and create a boolean is_holiday_season column (Nov, Dec, Jan).

Solve:

import polars as pl
from datetime import date
df = pl.DataFrame({
    'order_date': pl.date_range(date(2024, 1, 1), date(2024, 12, 31), eager=True)
}).sample(8, seed=42).sort("order_date")

# Write your code below

Desired Output:

python

shape: (8, 5)
┌────────────┬───────┬─────────┬─────────┬───────────────────┐
│ order_date ┆ month ┆ weekday ┆ quarter ┆ is_holiday_season │
│ ---        ┆ ---   ┆ ---     ┆ ---     ┆ ---               │
│ date       ┆ i8    ┆ i8      ┆ i8      ┆ bool              │
╞════════════╪═══════╪═════════╪═════════╪═══════════════════╡
│ 2024-02-15 ┆ 2     ┆ 4       ┆ 1       ┆ false             │
│ 2024-04-24 ┆ 4     ┆ 3       ┆ 2       ┆ false             │
│ 2024-08-02 ┆ 8     ┆ 5       ┆ 3       ┆ false             │
│ 2024-08-09 ┆ 8     ┆ 5       ┆ 3       ┆ false             │
│ 2024-09-10 ┆ 9     ┆ 2       ┆ 3       ┆ false             │
│ 2024-10-15 ┆ 10    ┆ 2       ┆ 4       ┆ false             │
│ 2024-10-19 ┆ 10    ┆ 6       ┆ 4       ┆ false             │
│ 2024-12-21 ┆ 12    ┆ 6       ┆ 4       ┆ true              │
└────────────┴───────┴─────────┴─────────┴───────────────────┘

Show Solution

import polars as pl
from datetime import date
df = pl.DataFrame({
    'order_date': pl.date_range(date(2024, 1, 1), date(2024, 12, 31), eager=True)
}).sample(8, seed=42).sort("order_date")

result = df.with_columns(
    pl.col("order_date").dt.month().alias("month"),
    pl.col("order_date").dt.weekday().alias("weekday"),
    pl.col("order_date").dt.quarter().alias("quarter"),
    (pl.col("order_date").dt.month().is_in([11, 12, 1])).alias("is_holiday_season"),
)
print(result)

100. How to explode a list column and compute aggregations?

Difficulty Level: L3

Each user has a list of tags. Explode the tags, then count how many users have each tag and list who they are.

Solve:

import polars as pl
df = pl.DataFrame({
    'user': ['Alice', 'Bob', 'Carol'],
    'tags': [['python', 'polars', 'ML'], ['python', 'rust'], ['polars', 'ML', 'DL', 'python']]
})

# Write your code below

Desired Output:

python

shape: (5, 3)
┌────────┬───────────┬───────────────────────────┐
│ tags   ┆ num_users ┆ users                     │
│ ---    ┆ ---       ┆ ---                       │
│ str    ┆ u32       ┆ list[str]                 │
╞════════╪═══════════╪═══════════════════════════╡
│ python ┆ 3         ┆ ["Alice", "Bob", "Carol"] │
│ ML     ┆ 2         ┆ ["Alice", "Carol"]        │
│ polars ┆ 2         ┆ ["Alice", "Carol"]        │
│ DL     ┆ 1         ┆ ["Carol"]                 │
│ rust   ┆ 1         ┆ ["Bob"]                   │
└────────┴───────────┴───────────────────────────┘

Show Solution

import polars as pl
df = pl.DataFrame({
    'user': ['Alice', 'Bob', 'Carol'],
    'tags': [['python', 'polars', 'ML'], ['python', 'rust'], ['polars', 'ML', 'DL', 'python']]
})

result = df.explode("tags").group_by("tags").agg(
    pl.col("user").count().alias("num_users"),
    pl.col("user").alias("users")
).sort("num_users", descending=True)
print(result)

101. How to use `struct` and `unnest` to work with nested data?

Difficulty Level: L3

Create a struct column from 'Manufacturer' and 'Model', then unnest it back.

Solve:

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")

# Write your code below

Desired Output:

python

shape: (3, 2)
┌─────────────────────┬───────┐
│ car_info            ┆ Price │
│ ---                 ┆ ---   │
│ struct[2]           ┆ f64   │
╞═════════════════════╪═══════╡
│ {"Acura","Integra"} ┆ 15.9  │
│ {null,"Legend"}     ┆ 33.9  │
│ {"Audi","90"}       ┆ 29.1  │
└─────────────────────┴───────┘

shape: (3, 3)
┌──────────────┬─────────┬───────┐
│ Manufacturer ┆ Model   ┆ Price │
│ ---          ┆ ---     ┆ ---   │
│ str          ┆ str     ┆ f64   │
╞══════════════╪═════════╪═══════╡
│ Acura        ┆ Integra ┆ 15.9  │
│ null         ┆ Legend  ┆ 33.9  │
│ Audi         ┆ 90      ┆ 29.1  │
└──────────────┴─────────┴───────┘

Show Solution

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', null_values="NA")

# Create a struct column
df_struct = df.select(
    pl.struct(["Manufacturer", "Model"]).alias("car_info"),
    "Price"
)
print(df_struct.head())

# Unnest back
df_unnested = df_struct.unnest("car_info")
print(df_unnested.head())

Free Course

Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course →

Trusted by 50,000+ learners

Written by

Selva Prabhakaran →

Related Course

Master Python — Hands-On

Join 5,000+ students at edu.machinelearningplus.com

Explore Course