# To get iris
using RDatasets: dataset;
# To find eigen values and vectors
using LinearAlgebra;
# For compuations
using Statistics;
# For plotting
using PlotlyJS;
In this blog post I walk through an implementation of Principal Components Analysis from scratch in the programming language Julia. The Julia package MultivariateStats.jl provides a ready-made implementation of PCA, similar to sklearn.decomposition.PCA
in Python. However, implementing it ourselves will help build intuition for what PCA actually does.
The maximal variance approach to PCA
Before implementing PCA, let’s review why it works using the explanation in the book Mathematics for Machine Learning by Deisenroth, Faisal, and Ong (2020) The goal of Principal Component Analysis is to project a dataset \(X\) onto a lower-dimension subspace \(B\) in a way that maintains as much information as possible. It is well known that this means taking the eigen vectors of the covariance matrix of \(X\), but why is this the case?
For now, consider just the vector \(\vec{b}\). One way to think about this problem is asking
“When \(X\) is projected onto \(\vec{b}\), what values for \(\vec{b}\) maximize the distance between data points?
“What vector \(\vec{b}\) maximizes the variance in my data?”
One way to think about PCA is maximizing the variance of each point in \(X\) when it projected onto \(\vec{b}\). Recall that the formula for the variance of a data vector \(x\) is \(Var = \frac{1}{n}\sum{x^2}\). The projection of a single observation \(x_1\) onto \(\vec{b}\) is \(\vec{b}^\intercal x_1\). So, the formula to maximize is \(\frac{1}{n}\sum{(\vec{b}^\intercal x_1)^2}\).
There are two conditions here. Because \(\vec{b}\) is a basis vector, it must be orthonormal and thus \(\vec{b} \cdot \vec{b} = 0\) and \(\vec{b}^\intercal \vec{b} = 1\). It also helps to observe that \(\frac{1}{n}\sum{(\vec{b}^\intercal x)^2}\) is equivalent to \(\vec{b}^\intercal S \vec{b}\), where \(S\) is the covariance matrix \(\vec{x}\vec{x}^\intercal\).
We can maximize this problem using Lagrangian Multipliers, where our function to maximize is \(f(x) = \frac{1}{n}\sum{\vec{b}^\intercal S \vec{b}}\) subject to our constraint function is \(g(x) = \vec{b}^\intercal \vec{b} - 1\). The Lagrangian is thus
\(\begin{aligned} L = \frac{1}{n}\sum{\vec{b}^\intercal S \vec{b}} - \lambda (\vec{b}^\intercal \vec{b} - 1)\end{aligned}\)
The gradient of this Lagrangian set equal to zero is
\[ \begin{aligned} \frac{\partial L}{\partial \vec{b}} = 2\vec{b}^\intercal S \vec{b} - 2 \lambda \vec{b}^\intercal = 0 \\ 2\vec{b}^\intercal S = 2 \lambda \vec{b}^\intercal \\ S \vec{b} = \lambda \vec{b} \end{aligned} \]
This is the definition of the eigen vectors \(b\) and values \(\lambda\) of \(S\). The vectors \(\vec{b}\) that maximize the variance of \(X\) are thus the eigen vectors of \(S = XX^\intercal\)
Preparation and scaling
To start off, we need to import our Julia packages.
We must first make a function to normalize our data by subtracting the mean and dividing by the standard deviation, also known as Z score standardization. This is done iteratively over each column.
function normalize(x_unscaled)
### Z score normalization
= x_unscaled
X_bar for i in axes(x_unscaled,2)
:,i] .= X_bar[:,i] .- mean(x_unscaled[:,i])
X_bar[:,i] .= X_bar[:,i] ./ std(x_unscaled[:,i])
X_bar[end
return X_bar
end;
= dataset("datasets", "iris");
iris
# X data is every column execpt for the Species
= Array(iris[!, [column for column in names(iris) if column != "Species"]]);
X = normalize(X);
X = iris[!, "Species"]; Y
Eigen values and vectors
Each vector represents a principal component of our dataset. The absolute value of the eigenvalue represents the relatively variance explained by the corresponding vector. So, we need a function that gets the eigen values and vectors of a matrix and sorts the vectors according to their eigen values.
function sorted_eig_val_vec(covariance_matrix)
# Get eigenvalues and vectors from LinearAlgebra
= eigvals(covariance_matrix)
x_cov_vals = eigvecs(covariance_matrix)
x_cov_vects
= sortperm(x_cov_vals, rev = true)
sorted_index # Sort value
= x_cov_vals[sortperm(x_cov_vals, rev = true)]
x_cov_vals_sorted # Sort vectors using
= x_cov_vects[:,sorted_index]
x_cov_vects_sorted
return x_cov_vals_sorted, x_cov_vects_sorted
end;
Putting it all together
Next up, we need a function that finds our lower-dimension basis and projects our data. Confusingly, we can not use the Julia function … to find the dot product.
function manual_pca(high_dimension_array, n_components)
= cov(high_dimension_array)
x_cov
= sorted_eig_val_vec(x_cov)
vals, vects
= vects[:,[1,2]]
principal_components
return high_dimension_array * principal_components
end;
function plot_pca(reduced_matrix, y)
Plot(x=reduced_matrix[:,1],
=reduced_matrix[:,2],
y=y, mode="markers",
colorLayout(
=600, height=400))
widthend;
= manual_pca(X, 2); reduced
We can plot the results using the Gadfly package.
using Gadfly
= reduced[:,1];
x1 = reduced[:,2];
x2
= Gadfly.plot(x = x1, y = x2, color = Y, Geom.point);
p
draw(SVG("myplot.svg", 6inch, 6inch), p);
In our final plot, each species is relatively distinguished.