machine learning +
Bayesian Optimization for Hyperparameter Tuning – Clearly explained.
Machine Learning
Understand PCA — the math, concept, and Python implementation. Learn how Principal Component Analysis reduces dimensions while preserving maximum variance in your data.
Principal Components Analysis (PCA) – Better Explained. Photo by RockyClub.scikit-learn and how to actually compute it from scratch (without using any packages)? and importantly how to understand PCA and what is the intuition behind it?I will try to answer all of these questions in this post using the of MNIST dataset. Structure of the Post: Part 1: Implementing PCA using scikit-Learn package Part 2: Understanding Concepts behind PCA Part 3: PCA from Scratch without scikit-learn package. Let’s first understand the data at hand.scikit-learn package, the implementation of PCA is quite straight forward. The module named sklearn.decomposition provides the PCA object which can simply fit and transform the data into Principal components.# Load Packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
pd.set_option('max_rows', 500)
pd.set_option('max_columns', 500)
np.set_printoptions(suppress=True)
%matplotlib inline
mnist dataset. For ease of learning, I am importing a smaller version containing records for digits 0, 1 and 2 only.# Import Data
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/mnist_012.csv')
# Prepare X and Y
Y = df.loc[:, '0']
X = df.drop(['0'], axis=1)
print(df.shape)
df.head()
This dataset has 784 columns as explanatory variables and one Y variable names '0' which tells what digit the row represents. We won’t use the Y when creating the principal components.Because I don’t want the PCA algorithm to know which class (digit) a particular row belongs to.Rather, I create the PCs using only the X. Later you will see, we draw a scatter plot using the first two PCs and color the points based in the actual Y. Typically, if the X’s were informative enough, you should see clear clusters of points belonging to the same category.PCA() class and call the fit_transform() on X to simultaneously compute the weights of the Principal components and then transform X to produce the new set of Principal components of X. This I am storing in the df_pca object, which is converted to a pandas DataFrame.# PCA
pca = PCA()
df_pca = pca.fit_transform(X=X)
# Store as dataframe and print
df_pca = pd.DataFrame(df_pca)
print(df_pca.shape) #> (3147, 784)
df_pca.round(2).head()
The first column is the first PC and so on. This dataframe (df_pca) has the same dimensions as the original data X.pca has been built.The pca.components_ object contains the weights (also called as ‘loadings’) of each Principal Component. It is using these weights that the final principal components are formed.But what exactly are these weights? how are they related to the Principal components we just formed and how it is calculated?Well, in part 2 of this post, you will learn that these weights are nothing but the eigenvectors of X. More details on this when I show how to implement PCA from scratch without using sklearn’s built-in PCA module.The key thing to understand is that, each principal component is the dot product of its weights (in pca.components_) and the mean centered data(X). What I mean by ‘mean-centered’ is, each column of the ‘X’ is subtracted from its own mean so that the mean of each column becomes zero.
Let’s actually compute this, so its very clear.Step 1: Get the Weights (aka, loadings or eigenvectors).# Principal Components Weights (Eigenvectors)
df_pca_loadings = pd.DataFrame(pca.components_)
df_pca_loadings.head()
Each row actually contains the weights of Principal Components, for example, Row 1 contains the 784 weights of PC1.No need to pay attention to the values at this point, I know, the picture is not that clear anyway.Step 2: Compute the mean centered data.X_mean = X - X.mean()
X_mean.head()
Step 3: If you go by the formula, take a dot product of of the weights in the first row of pca.components_ and the first row of the mean centered X to get the value -134.27.This equals to the value in position (0,0) of df_pca. Do refer back to the pic in section 2 to confirm this.# Compute PC1 for row 1.
np.dot(df_pca_loadings.loc[0, :], X_mean.loc[1, :])
#> -134.272
df_pca) is computed this way internally.explained_variance_ratio_ attribute.print(pca.explained_variance_ratio_.round(2)[:10])
#> [0.22 0.1 0.06 0.06 0.04 0.04 0.03 0.03 0.02 0.02]
variance_exp_cumsum = pca.explained_variance_ratio_.cumsum().round(2)
fig, axes = plt.subplots(1,1,figsize=(16,7), dpi=100)
plt.plot(variance_exp_cumsum, color='firebrick')
plt.title('Screeplot of Variance Explained %', fontsize=22)
plt.xlabel('# of PCs', fontsize=16)
plt.show()

from scipy.spatial import ConvexHull
def encircle(x,y, ax=None, **kw):
if not ax: ax=plt.gca()
p = np.c_[x,y]
hull = ConvexHull(p)
poly = plt.Polygon(p[hull.vertices,:], **kw)
ax.add_patch(poly)
# Scatterplot against PC1 and PC2
fig, ax = plt.subplots(1,1, figsize=(16,12))
# Row masks for each category
rows_0 = Y==0;
rows_1 = Y==1;
rows_2 = Y==2;
# Plot
ax.scatter(df_pca.loc[rows_0.tolist(), 1], df_pca.loc[rows_0.tolist(), 2], c='blue', edgecolor='k', s=120, label='Zero')
ax.scatter(df_pca.loc[rows_1.tolist(), 1], df_pca.loc[rows_1.tolist(), 2], c='red', edgecolor='k', s=120, label='One')
ax.scatter(df_pca.loc[rows_2.tolist(), 1], df_pca.loc[rows_2.tolist(), 2], c='green', edgecolor='k', s=120, label='Two')
# Encircle the boundaries
encircle(df_pca.loc[rows_0.tolist(), 1], df_pca.loc[rows_0.tolist(), 2], ec="blue", fc="none", linewidth=2.5)
encircle(df_pca.loc[rows_1.tolist(), 1], df_pca.loc[rows_1.tolist(), 2], ec="firebrick", fc="none", linewidth=2.5)
encircle(df_pca.loc[rows_2.tolist(), 1], df_pca.loc[rows_2.tolist(), 2], ec="green", fc="none", linewidth=2.5)
# Shading
encircle(df_pca.loc[rows_1.tolist(), 1], df_pca.loc[rows_1.tolist(), 2], ec="k", fc="firebrick", alpha=0.05)
encircle(df_pca.loc[rows_0.tolist(), 1], df_pca.loc[rows_0.tolist(), 2], ec="k", fc="blue", alpha=0.05)
encircle(df_pca.loc[rows_2.tolist(), 1], df_pca.loc[rows_2.tolist(), 2], ec="k", fc="green", alpha=0.05)
# Labels
ax.set_title("MNIST Data for 1, 2 & 3: Scatterplot of First Two PCA directions", fontsize=22)
ax.set_xlabel("1st Principal Component", fontsize=22)
ax.set_ylabel("2nd Principal Component", fontsize=22)
ax.legend(loc='best', title='Transaction Type', fontsize=16)
plt.show();
In the picture, though there is a certain degree of overlap, the points belonging to same category are distinctly clustered and region bound. This proves that the data captured in the first two PCs is informative enough to discriminate the categories from each other.In other words, we now have evidence that the data is not completely random, but rather can be used to discriminate or explain the Y (the number a given row belongs to).Refer to the 50 Masterplots with Python for more visualization ideas.pca object has the inverse_transform() method that gives back the original data when you input principal components features.df_orig = pca.inverse_transform(df_pca)
pd.DataFrame(df_orig).round().head()
Such a line should be in a direction that minimizes the perpendicular distance of each point from the line. It may look something like this:
But how to determine this line?I am only interested in determining the direction(u1) of this line. Because, by knowing the direction u1, I can compute the projection of any point on this line. This line u1 is of length 1 unit and is called a unit vector. This unit vector eventually becomes the weights of the principal components, also called as loadings which we accessed using the pca.components_ earlier.Remember the PCA weights you calculated in Part 1 under ‘Weights of Principal Components’? It is same as the ”u1′ I am talking about here.The PCA weights (Ui) are actually unit vectors of length 1. Because, it is meant to represent only the direction.# PCA weights has length = 1
np.sum(df_pca_loadings.loc[0, :]**2)
#> 0.9999999999999993
u1)?Remember, we wanted to minimize the distances of the points from PC1’s direction?u1 is the unit vector that direction.To determine u1, we use Pythagoras theorem to arrive at the objective function as shown in pic.In the pic below, u1 is the unit vector of the direction of PC1 and Xi is the coordinates of the blue dot in the 2d space. Remember, Xi is nothing but the row corresponding the datapoint in the original dataset.The lengths of the lines can be computed using the Pythagoras theorem as shown in the pic below. 
u1 so that the mean perpendicular distance from the line for all points is minimized. Here is the objective function:
It can be proved that the above equation reaches a minimum when value of u1 equals the Eigen Vector of the covariance matrix of X. This Eigen Vector is same as the PCA weights that we got earlier inside pca.components_ object.We’ll see what Eigen Vectors are shortly.Alright. But there can be a second PC to this data. The next best direction to explain the remaining variance is perpendicular to the first PC. Actually, there can be as many Eigen Vectors as there are columns in the dataset. And they are orthogonal to each other.
Thanks to this excellent discussion on stackexchange that provided these dynamic graphs.
v is an eigenvector of matrix A if A(v) is a scalar multiple of v.
The actual computation of Eigenvector and Eigen value is quite straight forward using the eig() method in numpy.linalg module.Refer to this guide if you want to learn more about the math behind computing Eigen Vectors.# Import
url = 'https://raw.githubusercontent.com/selva86/datasets/master/mnist_012.csv'
df = pd.read_csv(url)
# Prepare X and Y
Y = df.loc[:, '0']
X = df.drop(['0'], axis=1)
df.head()
X_standard = X - X.mean()
X_standard
The covariance matrix calculates the covariance of all possible combinations of columns.As a result, it becomes a square matrix with the same number of rows and columns. But, How to actually compute the covariance matrix in Python?Using pandas dataframe, covariance matrix is computed by calling the df.cov() method.df_cov = X_standard.cov()
print(df.shape)
df_cov.head()
from numpy.linalg import eig
eigvalues, eigvectors = eig(df_cov)
print(eigvalues[:10])
print(eigvectors.shape)
#> [764074.70509694+0.j 333409.88084323+0.j 206248.30545139+0.j
#> 190073.11430869+0.j 149373.67054257+0.j 140893.79592355+0.j
#> 103478.87643359+0.j 87549.48702637+0.j 82347.51266608+0.j
#> 66776.33611636+0.j]
#> (784, 784)
j in the above output implies the resulting eigenvectors are represented as complex numbers.X_pca = np.dot(X_standard, eigvectors)
df_pca_calc = pd.DataFrame(X_pca)
df_pca_calc.round(2).head()
scikit-learn package.scikit-learn, the concept behind it and how to code it out algorithmically as well.Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →Get the exact 10-course programming foundation that Data Science professionals use.