machine learning +
101 Polars Exercises for Data Analysis (with Solutions)
Numpy Tutorial Part 2 – Vital Functions for Data Analysis
This is part 2 of the mega numpy tutorial. In this part, I go into the details of the advanced features of numpy arrays that are essential for data analysis and manipulations.
Numpy is the core package for data analysis and scientific computing in Python. This is part 2 of a mega numpy tutorial. In this part, I cover the advanced features of numpy that are essential for data analysis and manipulation.
This post has interactive code — click ‘Run’ or press Ctrl+Enter on any code block to execute it directly in your browser. The first run may take a few seconds to initialize.
Introduction
In part 1 of the numpy tutorial, we learned why numpy matters for working with datasets in Python. We covered how to create arrays, explore them, do indexing, reshaping, flattening, and generate random numbers.
In part 2, I pick up where we left off. We will tackle more advanced but essential topics for data analysis.
I will assume you know basic Python, some math, and have read part 1. The best approach is to read the full article once, then come back and run the examples yourself.
Let’s begin.
1. How to get index locations that satisfy a given condition using np.where?
In part 1, you saw how to extract items from an array using boolean indexing. But sometimes you need the index positions of items that match a condition, not the items themselves.
np.where finds exactly those positions.
import numpy as np
# Create an array
arr_rand = np.array([8, 8, 3, 7, 7, 0, 4, 2, 5, 2])
print("Array: ", arr_rand)
# Positions where value > 5
index_gt5 = np.where(arr_rand > 5)
print("Positions where value > 5: ", index_gt5)
Array: [8 8 3 7 7 0 4 2 5 2] Positions where value > 5: (array([0, 1, 3, 4]),)
Once you have the positions, extract the values using the array’s take method.
import numpy as np arr_rand = np.array([8, 8, 3, 7, 7, 0, 4, 2, 5, 2]) index_gt5 = np.where(arr_rand > 5) # Take items at given index print(arr_rand.take(index_gt5))
[[8 8 7 7]]
np.where also accepts two optional arguments: x and y. When the condition is true, it yields x. Otherwise, it yields y.
Below, I create an array that labels each element as 'gt5' or 'le5' based on whether it exceeds 5.
import numpy as np arr_rand = np.array([8, 8, 3, 7, 7, 0, 4, 2, 5, 2]) # If value > 5, then yield 'gt5' else 'le5' print(np.where(arr_rand > 5, 'gt5', 'le5'))
['gt5' 'gt5' 'le5' 'gt5' 'gt5' 'le5' 'le5' 'le5' 'le5' 'le5']
You can also find the positions of maximum and minimum values.
import numpy as np
arr_rand = np.array([8, 8, 3, 7, 7, 0, 4, 2, 5, 2])
# Location of the max
print('Position of max value: ', np.argmax(arr_rand))
# Location of the min
print('Position of min value: ', np.argmin(arr_rand))
Position of max value: 0 Position of min value: 5
2. How to import and export data as a csv file?
A standard way to import datasets is np.genfromtxt. It handles web URLs, missing values, multiple delimiters, and irregular columns.
A simpler alternative is np.loadtxt, which assumes no missing values.
Let’s read a .csv file from a URL. Since every element in a numpy array must share the same data type, text columns get imported as nan by default. The filling_values argument lets you replace those missing entries.
python
# Turn off scientific notation
np.set_printoptions(suppress=True)
# Import data from csv file url
path = 'https://raw.githubusercontent.com/selva86/datasets/master/Auto.csv'
data = np.genfromtxt(path, delimiter=',', skip_header=1, filling_values=-999, dtype='float')
data[:3] # see first 3 rows
array([[ 18. , 8. , 307. , 130. , 3504. , 12. , 70. ,
1. , -999. ],
[ 15. , 8. , 350. , 165. , 3693. , 11.5, 70. ,
1. , -999. ],
[ 18. , 8. , 318. , 150. , 3436. , 11. , 70. ,
1. , -999. ]])Notice that the last column shows -999 everywhere. That happened because we set dtype='float'. The last column in the file contains text values, and numpy could not convert them to floats.
2.1 How to handle datasets that has both numbers and text columns?
If you need the text column intact, set the dtype to 'object' or None.
python
# data2 = np.genfromtxt(path, delimiter=',', skip_header=1, dtype='object')
data2 = np.genfromtxt(path, delimiter=',', skip_header=1, dtype=None)
data2[:3] # see first 3 rows
array([( 18., 8, 307., 130, 3504, 12. , 70, 1, b'"chevrolet chevelle malibu"'),
( 15., 8, 350., 165, 3693, 11.5, 70, 1, b'"buick skylark 320"'),
( 18., 8, 318., 150, 3436, 11. , 70, 1, b'"plymouth satellite"')],
dtype=[('f0', '<f8'), ('f1', '<i8'), ('f2', '<f8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<f8'), ('f6', '<i8'), ('f7', '<i8'), ('f8', 'S38')])Now the text column is preserved. To export an array as a csv file, use np.savetxt.
python
# Save the array as a csv file
np.savetxt("out.csv", data, delimiter=",")
3. How to save and load numpy objects?
Sometimes you want to save large numpy arrays to disk. This avoids re-running data transformation code every time.
Numpy provides .npy and .npz file formats for this. Use np.save to store a single ndarray as a .npy file. Use np.savez to store multiple arrays in one .npz file. Both can be loaded back with np.load.
python
# Save single numpy array object as .npy file
np.save('myarray.npy', arr2d)
# Save multiple numpy arrays as a .npz file
np.savez('array.npz', arr2d_f, arr2d_b)
Load back the .npy file:
python
# Load a .npy file
a = np.load('myarray.npy')
print(a)
[[0 1 2] [3 4 5] [6 7 8]]
Load back the .npz file:
python
# Load a .npz file
b = np.load('array.npz')
print(b.files)
b['arr_0']
['arr_0', 'arr_1']
array([[ 0., 1., 2.],
[ 3., 4., 5.],
[ 6., 7., 8.]])4. How to concatenate two numpy arrays column-wise and row-wise?
There are three ways to concatenate numpy arrays:
- Method 1:
np.concatenatewithaxis=0(rows) oraxis=1(columns) - Method 2:
np.vstackandnp.hstack - Method 3:
np.r_andnp.c_
All three produce the same result. One key difference: np.r_ and np.c_ use square brackets instead of parentheses.
Let me create the arrays first.
import numpy as np
a = np.zeros([4, 4])
b = np.ones([4, 4])
print("Array a:")
print(a)
print("\nArray b:")
print(b)
Array a: [[ 0. 0. 0. 0.] [ 0. 0. 0. 0.] [ 0. 0. 0. 0.] [ 0. 0. 0. 0.]] Array b: [[ 1. 1. 1. 1.] [ 1. 1. 1. 1.] [ 1. 1. 1. 1.] [ 1. 1. 1. 1.]]
Now let’s stack them vertically (row-wise).
import numpy as np
a = np.zeros([4, 4])
b = np.ones([4, 4])
# Vertical Stack Equivalents (Row wise)
print("np.concatenate axis=0:")
print(np.concatenate([a, b], axis=0))
print("\nnp.vstack:")
print(np.vstack([a, b]))
print("\nnp.r_:")
print(np.r_[a, b])
array([[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.]])Now let’s stack them horizontally (column-wise).
import numpy as np
a = np.zeros([4, 4])
b = np.ones([4, 4])
# Horizontal Stack Equivalents (Column wise)
print("np.concatenate axis=1:")
print(np.concatenate([a, b], axis=1))
print("\nnp.hstack:")
print(np.hstack([a, b]))
print("\nnp.c_:")
print(np.c_[a, b])
array([[ 0., 0., 0., 0., 1., 1., 1., 1.],
[ 0., 0., 0., 0., 1., 1., 1., 1.],
[ 0., 0., 0., 0., 1., 1., 1., 1.],
[ 0., 0., 0., 0., 1., 1., 1., 1.]])You can also use np.r_ to build complex 1D sequences.
import numpy as np print(np.r_[[1,2,3], 0, 0, [4,5,6]])
array([1, 2, 3, 0, 0, 4, 5, 6])
5. How to sort a numpy array based on one or more columns?
Let’s sort a 2D array based on the first column.
import numpy as np np.random.seed(42) arr = np.random.randint(1, 6, size=[8, 4]) print(arr)
array([[4, 1, 4, 3],
[5, 1, 1, 1],
[1, 2, 4, 2],
[5, 1, 2, 4],
[2, 1, 1, 3],
[4, 2, 5, 1],
[2, 4, 5, 1],
[3, 2, 2, 5]])If you use np.sort with axis=0, each column gets sorted independently. This breaks the integrity of rows. Values from different rows get mixed together.
import numpy as np np.random.seed(42) arr = np.random.randint(1, 6, size=[8, 4]) # Sort each column of arr independently print(np.sort(arr, axis=0))
array([[1, 1, 1, 1],
[2, 1, 1, 1],
[2, 1, 2, 2],
[3, 1, 2, 3],
[4, 2, 4, 3],
[4, 2, 4, 4],
[5, 2, 5, 5],
[5, 4, 5, 5]])To keep rows intact, use the indirect approach with np.argsort.
5.1 How to sort a numpy array based on 1 column using argsort?
np.argsort returns the index positions that would sort a 1D array. It does not sort the array itself.
import numpy as np
# Get the index positions that would sort the array
x = np.array([1, 10, 5, 2, 8, 9])
sort_index = np.argsort(x)
print("Sort indices:", sort_index)
Sort indices: [0 3 2 4 5 1]
How do you read this? The 0th item is smallest, the 3rd item is second smallest, and so on. Use these indices to reorder the array.
import numpy as np x = np.array([1, 10, 5, 2, 8, 9]) sort_index = np.argsort(x) print(x[sort_index])
array([ 1, 2, 5, 8, 9, 10])
Now apply this idea to sort a 2D array by its first column. Run argsort on that column, then use the result to reorder the full array.
import numpy as np np.random.seed(42) arr = np.random.randint(1, 6, size=[8, 4]) # Argsort the first column sorted_index_1stcol = arr[:, 0].argsort() # Sort 'arr' by first column without disturbing rows print(arr[sorted_index_1stcol])
array([[1, 2, 4, 2],
[2, 1, 1, 3],
[2, 4, 5, 1],
[3, 2, 2, 5],
[4, 1, 4, 3],
[4, 2, 5, 1],
[5, 1, 1, 1],
[5, 1, 2, 4]])For descending order, reverse the argsorted index.
import numpy as np np.random.seed(42) arr = np.random.randint(1, 6, size=[8, 4]) sorted_index_1stcol = arr[:, 0].argsort() # Descending sort print(arr[sorted_index_1stcol[::-1]])
array([[5, 1, 2, 4],
[5, 1, 1, 1],
[4, 2, 5, 1],
[4, 1, 4, 3],
[3, 2, 2, 5],
[2, 4, 5, 1],
[2, 1, 1, 3],
[1, 2, 4, 2]])5.2 How to sort a numpy array based on 2 or more columns?
Use np.lexsort for multi-column sorting. Pass a tuple of columns. Place the primary sort column at the rightmost position in the tuple.
import numpy as np np.random.seed(42) arr = np.random.randint(1, 6, size=[8, 4]) # Sort by column 0, then by column 1 lexsorted_index = np.lexsort((arr[:, 1], arr[:, 0])) print(arr[lexsorted_index])
array([[1, 2, 4, 2],
[2, 1, 1, 3],
[2, 4, 5, 1],
[3, 2, 2, 5],
[4, 1, 4, 3],
[4, 2, 5, 1],
[5, 1, 1, 1],
[5, 1, 2, 4]])6. Working with dates
Numpy handles dates through np.datetime64. It supports precision down to nanoseconds. Create one using a standard YYYY-MM-DD string.
import numpy as np
# Create a datetime64 object
date64 = np.datetime64('2018-02-04 23:10:10')
print(date64)
numpy.datetime64('2018-02-04T23:10:10')You can include hours, minutes, and seconds down to nanoseconds. Let’s strip the time component from date64.
import numpy as np
date64 = np.datetime64('2018-02-04 23:10:10')
# Drop the time part from the datetime64 object
dt64 = np.datetime64(date64, 'D')
print(dt64)
2018-02-04
Adding a plain number increases the date by that many days. For other time units like months, hours, or seconds, use np.timedelta64.
import numpy as np
dt64 = np.datetime64('2018-02-04')
# Create the timedeltas (individual units of time)
tenminutes = np.timedelta64(10, 'm') # 10 minutes
tenseconds = np.timedelta64(10, 's') # 10 seconds
tennanoseconds = np.timedelta64(10, 'ns') # 10 nanoseconds
print('Add 10 days: ', dt64 + 10)
print('Add 10 minutes: ', dt64 + tenminutes)
print('Add 10 seconds: ', dt64 + tenseconds)
print('Add 10 nanoseconds: ', dt64 + tennanoseconds)
Add 10 days: 2018-02-14 Add 10 minutes: 2018-02-04T00:10 Add 10 seconds: 2018-02-04T00:00:10 Add 10 nanoseconds: 2018-02-04T00:00:00.000000010
To convert dt64 back to a string, use np.datetime_as_string.
import numpy as np
dt64 = np.datetime64('2018-02-04')
# Convert np.datetime64 back to a string
print(np.datetime_as_string(dt64))
'2018-02-04'
When working with dates, you often need to filter business days. Use np.is_busday() to check, and np.busday_offset() to shift dates by business days.
import numpy as np
dt64 = np.datetime64('2018-02-04')
print('Date: ', dt64)
print("Is it a business day?: ", np.is_busday(dt64))
print("Add 2 business days (roll forward): ", np.busday_offset(dt64, 2, roll='forward'))
print("Add 2 business days (roll backward): ", np.busday_offset(dt64, 2, roll='backward'))
Date: 2018-02-04 Is it a business day?: False Add 2 business days (roll forward): 2018-02-07 Add 2 business days (roll backward): 2018-02-06
6.1 How to create a sequence of dates?
Use np.arange with datetime64 objects. It works just like creating numeric ranges.
import numpy as np
# Create date sequence
dates = np.arange(np.datetime64('2018-02-01'), np.datetime64('2018-02-10'))
print(dates)
# Check if each date is a business day
print(np.is_busday(dates))
['2018-02-01' '2018-02-02' '2018-02-03' '2018-02-04' '2018-02-05' '2018-02-06' '2018-02-07' '2018-02-08' '2018-02-09'] array([ True, True, False, False, True, True, True, True, True])
6.2 How to convert numpy.datetime64 to datetime.datetime object?
Use the .tolist() method to convert a datetime64 to a Python datetime.date object.
import numpy as np
import datetime
dt64 = np.datetime64('2018-02-04')
# Convert np.datetime64 to datetime.datetime
dt = dt64.tolist()
print(dt)
datetime.date(2018, 2, 4)
Once you have a datetime.date object, you can extract the day, month, year, and more.
import numpy as np
dt64 = np.datetime64('2018-02-04')
dt = dt64.tolist()
print('Year: ', dt.year)
print('Day of month: ', dt.day)
print('Month of year: ', dt.month)
print('Day of Week: ', dt.weekday()) # 6 = Sunday
Year: 2018 Day of month: 4 Month of year: 2 Day of Week: 6
7. Advanced numpy functions
7.1 vectorize – Make a scalar function work on vectors
np.vectorize() lets a function designed for single numbers work on entire arrays.
Here is a simple example. The function foo squares odd numbers and halves even numbers. It works fine on individual values but fails on arrays.
import numpy as np
# Define a scalar function
def foo(x):
if x % 2 == 1:
return x**2
else:
return x/2
# On individual scalars
print('x = 10 returns ', foo(10))
print('x = 11 returns ', foo(11))
# On a vector, this would fail:
# foo([10, 11, 12]) # Error!
x = 10 returns 5.0 x = 11 returns 121
Now let’s vectorize foo() so it works on arrays.
import numpy as np
def foo(x):
if x % 2 == 1:
return x**2
else:
return x/2
# Vectorize foo(). Make it work on vectors.
foo_v = np.vectorize(foo, otypes=[float])
print('x = [10, 11, 12] returns ', foo_v([10, 11, 12]))
print('x = [[10, 11, 12], [1, 2, 3]] returns ', foo_v([[10, 11, 12], [1, 2, 3]]))
x = [10, 11, 12] returns [ 5. 121. 6.] x = [[10, 11, 12], [1, 2, 3]] returns [[ 5. 121. 6.] [ 1. 1. 9.]]
The optional otypes parameter specifies the output data type. Setting it makes the vectorized function run faster.
7.2 apply_along_axis – Apply a function column-wise or row-wise
np.apply_along_axis applies a 1D function across rows or columns of a 2D array. It takes three arguments:
- A function that operates on a 1D vector
- The axis (1 = row-wise, 0 = column-wise)
- The array
Let’s find the difference between the max and min value in each row.
import numpy as np
# Create a 4x10 random array
np.random.seed(100)
arr_x = np.random.randint(1, 10, size=[4, 10])
print("Array:")
print(arr_x)
# Define func1d
def max_minus_min(x):
return np.max(x) - np.min(x)
# Apply along the rows
print('\nRow wise: ', np.apply_along_axis(max_minus_min, 1, arr=arr_x))
# Apply along the columns
print('Column wise: ', np.apply_along_axis(max_minus_min, 0, arr=arr_x))
Array: [[9 9 4 8 8 1 5 3 6 3] [3 3 2 1 9 5 1 7 3 5] [2 6 4 5 5 4 8 2 2 8] [8 1 3 4 3 6 9 2 1 8]] Row wise: [8 8 6 8] Column wise: [7 8 2 7 6 5 8 5 5 5]
Without apply_along_axis, you would need a for-loop. This approach is cleaner and works for any custom function.
7.3 searchsorted – Find the location to insert so the array will remain sorted
np.searchsorted tells you where to insert a value to keep the array sorted.
import numpy as np
x = np.arange(10)
print('Where should 5 be inserted?: ', np.searchsorted(x, 5))
print('Where should 5 be inserted (right)?: ', np.searchsorted(x, 5, side='right'))
Where should 5 be inserted?: 5 Where should 5 be inserted (right)?: 6
With a smart hack by Radim, you can use searchsorted for weighted random sampling. It is much faster than np.random.choice.
python
# Randomly choose an item from a list based on a predefined probability
lst = range(10000) # the list
probs = np.random.random(10000); probs /= probs.sum() # probabilities
%timeit lst[np.searchsorted(probs.cumsum(), np.random.random())]
%timeit np.random.choice(lst, p=probs)
36.6 µs ± 3.93 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 1.02 ms ± 7.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
7.4 How to add a new axis to a numpy array?
Sometimes you need to convert a 1D array into a 2D array without adding data. This is useful for saving a 1D array as a single column in a CSV, or for concatenating with another 2D array.
Use np.newaxis to insert a new axis. This raises the array’s dimension by one.
import numpy as np
# Create a 1D array
x = np.arange(5)
print('Original array: ', x)
# Introduce a new column axis
x_col = x[:, np.newaxis]
print('x_col shape: ', x_col.shape)
print(x_col)
# Introduce a new row axis
x_row = x[np.newaxis, :]
print('x_row shape: ', x_row.shape)
print(x_row)
Original array: [0 1 2 3 4] x_col shape: (5, 1) [[0] [1] [2] [3] [4]] x_row shape: (1, 5) [[0 1 2 3 4]]
7.5 More Useful Functions
Digitize
np.digitize returns the bin index for each element in an array.
import numpy as np # Create the array and bins x = np.arange(10) bins = np.array([0, 3, 6, 9]) # Get bin allotments print(np.digitize(x, bins))
array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4])
Clip
np.clip caps values within a range. Values below the lower limit get replaced by that limit. Same for the upper limit.
import numpy as np x = np.arange(10) # Cap all elements of x to lie between 3 and 8 print(np.clip(x, 3, 8))
array([3, 3, 3, 3, 4, 5, 6, 7, 8, 8])
Histogram and Bincount
Both np.histogram and np.bincount count frequencies, but in different ways.
bincount counts how often each integer appears, from 0 to the max value. It includes zeros for missing values. histogram groups values into bins you define and counts how many fall in each bin.
import numpy as np
x = np.array([1, 1, 2, 2, 2, 4, 4, 5, 6, 6, 6]) # doesn't need to be sorted
# Bincount: 0 occurs 0 times, 1 occurs 2 times, 2 occurs 3 times, ...
print("Bincount:", np.bincount(x))
# Histogram
counts, bins = np.histogram(x, [0, 2, 4, 6, 8])
print('Counts: ', counts)
print('Bins: ', bins)
Bincount: [0 2 3 0 2 1 3] Counts: [2 3 3 3] Bins: [0 2 4 6 8]
8. What is missing in numpy?
We have covered many techniques for data manipulation with numpy. But there are things numpy cannot do directly:
- Merge two 2D arrays based on a common column
- Create pivot tables
- Do 2D cross tabulations
- Compute grouped statistics (like mean by category)
- And more…
These gaps are filled by the pandas library, which I will cover in the upcoming pandas tutorial. Meanwhile, test your skills with these numpy practice exercises.
Free Course
Master Core Python — Your First Step into AI/ML
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →Trusted by 50,000+ learners
Related Course
Master Python — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course

