PCA from scratch in Julia

An application with the iris dataset
Statistics
Author

Slater Dixon

Published

April 23, 2024

In this blog post I walk through an implementation of Principal Components Analysis from scratch in the programming language Julia. The Julia package MultivariateStats.jl provides a ready-made implementation of PCA, similar to sklearn.decomposition.PCA in Python. However, implementing it ourselves will help build intuition for what PCA actually does.

The maximal variance approach to PCA

Before implementing PCA, let’s review why it works using the explanation in the book Mathematics for Machine Learning by Deisenroth, Faisal, and Ong (2020) The goal of Principal Component Analysis is to project a dataset \(X\) onto a lower-dimension subspace \(B\) in a way that maintains as much information as possible. It is well known that this means taking the eigen vectors of the covariance matrix of \(X\), but why is this the case?

For now, consider just the vector \(\vec{b}\). One way to think about this problem is asking

“When \(X\) is projected onto \(\vec{b}\), what values for \(\vec{b}\) maximize the distance between data points?

“What vector \(\vec{b}\) maximizes the variance in my data?”

One way to think about PCA is maximizing the variance of each point in \(X\) when it projected onto \(\vec{b}\). Recall that the formula for the variance of a data vector \(x\) is \(Var = \frac{1}{n}\sum{x^2}\). The projection of a single observation \(x_1\) onto \(\vec{b}\) is \(\vec{b}^\intercal x_1\). So, the formula to maximize is \(\frac{1}{n}\sum{(\vec{b}^\intercal x_1)^2}\).

There are two conditions here. Because \(\vec{b}\) is a basis vector, it must be orthonormal and thus \(\vec{b} \cdot \vec{b} = 0\) and \(\vec{b}^\intercal \vec{b} = 1\). It also helps to observe that \(\frac{1}{n}\sum{(\vec{b}^\intercal x)^2}\) is equivalent to \(\vec{b}^\intercal S \vec{b}\), where \(S\) is the covariance matrix \(\vec{x}\vec{x}^\intercal\).

We can maximize this problem using Lagrangian Multipliers, where our function to maximize is \(f(x) = \frac{1}{n}\sum{\vec{b}^\intercal S \vec{b}}\) subject to our constraint function is \(g(x) = \vec{b}^\intercal \vec{b} - 1\). The Lagrangian is thus

\(\begin{aligned} L = \frac{1}{n}\sum{\vec{b}^\intercal S \vec{b}} - \lambda (\vec{b}^\intercal \vec{b} - 1)\end{aligned}\)

The gradient of this Lagrangian set equal to zero is

\[ \begin{aligned} \frac{\partial L}{\partial \vec{b}} = 2\vec{b}^\intercal S \vec{b} - 2 \lambda \vec{b}^\intercal = 0 \\ 2\vec{b}^\intercal S = 2 \lambda \vec{b}^\intercal \\ S \vec{b} = \lambda \vec{b} \end{aligned} \]

This is the definition of the eigen vectors \(b\) and values \(\lambda\) of \(S\). The vectors \(\vec{b}\) that maximize the variance of \(X\) are thus the eigen vectors of \(S = XX^\intercal\)

Preparation and scaling

To start off, we need to import our Julia packages.


# To get iris
using RDatasets: dataset;
# To find eigen values and vectors
using LinearAlgebra;
# For compuations
using Statistics;
# For plotting
using PlotlyJS;

We must first make a function to normalize our data by subtracting the mean and dividing by the standard deviation, also known as Z score standardization. This is done iteratively over each column.


function normalize(x_unscaled)
    ### Z score normalization
    X_bar = x_unscaled
    for i in axes(x_unscaled,2)
        X_bar[:,i] .= X_bar[:,i] .- mean(x_unscaled[:,i])
        X_bar[:,i] .= X_bar[:,i] ./ std(x_unscaled[:,i])
    end
    return X_bar
end;

iris = dataset("datasets", "iris");


# X data is every column execpt for the Species
X = Array(iris[!, [column for column in names(iris) if column != "Species"]]);
X = normalize(X);
Y = iris[!, "Species"];

Eigen values and vectors

Each vector represents a principal component of our dataset. The absolute value of the eigenvalue represents the relatively variance explained by the corresponding vector. So, we need a function that gets the eigen values and vectors of a matrix and sorts the vectors according to their eigen values.



function sorted_eig_val_vec(covariance_matrix)
    # Get eigenvalues and vectors from LinearAlgebra
    x_cov_vals = eigvals(covariance_matrix)
    x_cov_vects = eigvecs(covariance_matrix)

    sorted_index = sortperm(x_cov_vals, rev = true)
    # Sort value
    x_cov_vals_sorted = x_cov_vals[sortperm(x_cov_vals, rev = true)]
    # Sort vectors using 
    x_cov_vects_sorted = x_cov_vects[:,sorted_index]

    return x_cov_vals_sorted, x_cov_vects_sorted

end;

Putting it all together

Next up, we need a function that finds our lower-dimension basis and projects our data. Confusingly, we can not use the Julia function … to find the dot product.


function manual_pca(high_dimension_array, n_components)

    x_cov = cov(high_dimension_array)

    vals, vects = sorted_eig_val_vec(x_cov)

    principal_components = vects[:,[1,2]]
    
    return high_dimension_array * principal_components

end;

function plot_pca(reduced_matrix, y)

    Plot(x=reduced_matrix[:,1], 
            y=reduced_matrix[:,2], 
            color=y, mode="markers", 
            Layout(
        width=600, height=400))
end;

reduced = manual_pca(X, 2);

We can plot the results using the Gadfly package.


using Gadfly

x1 = reduced[:,1];
x2 = reduced[:,2];


p = Gadfly.plot(x = x1, y = x2, color = Y, Geom.point);


draw(SVG("myplot.svg", 6inch, 6inch), p);

In our final plot, each species is relatively distinguished.

References

Deisenroth, Marc Peter, A. Aldo Faisal, and Cheng Soon Ong. 2020. Mathematics for machine learning. Cambridge, UK New York, NY: Cambridge University Press.