Principal Component Analysis for a five year old

Posted on Thu 15 October 2015 in blog

I went to a talk a couple of weeks ago at Stanford on using machine learning to understand complex biological data. At one point in the talk the speaker made an offhand comment about data so simple "that a five year old could cluster it". Wow, were you that smart at five?

Anyway, I don't have kids, but I know enough about them that young children always ask, "why?" Maybe that five-year old mindset is instructive in machine learning. I think I've had a good intuitive grasp of Principcal Component Analysis (PCA) as a black box. That is, I know it is a way to reduce multidimensional datasets down to discover potentially interesting axes of variation.

But my eyes gloss over when I start hearing about reconstituting the data matrix, etc. So I want to take that position of the five year old, and ask "Why?" To do this, I want to understand the actual operations of PCA, rather than the pre-packaged R function.

FYI, this is excellent background reading.

First use a pre-loaded R dataset of crime incidence by state. First, let's see the data.

In [1]:
head(USArrests)
library(cluster)
Out[1]:
MurderAssaultUrbanPopRape
Alabama13.22365821.2
Alaska102634844.5
Arizona8.12948031
Arkansas8.81905019.5
California92769140.6
Colorado7.92047838.7

1) The first step we need to take is to center the data.

"Why?"

If we don't, the highest variance in the data would just be driven by the distance to the origin.

An optional step is to scale the data, since if a particular variable has a higher magnitude, it will tend to dominate the variance captured by PCA. There is not a convention on whether scaling should always be done or not. Here, I'll skip it.

In [2]:
centerData = scale(USArrests,scale=F)
head(centerData)
Out[2]:
MurderAssaultUrbanPopRape
Alabama 5.41265.240-7.540-0.032
Alaska 2.212 92.240-17.540 23.268
Arizona 0.312123.240 14.460 9.768
Arkansas 1.012 19.240-15.540 -1.732
California 1.212105.240 25.460 19.368
Colorado 0.11233.24012.46017.468

2) Next up, calculate the covariance matrix.

"Why?"

Because covariance will tell us if two variables tend to increase or decrease together, summarized by a positive or negative sign. "What is covariance" Well, consider the relationship between two variables, x and y. The covariance would be

$$(y_{mean} - y_{i}) * (x_{mean} - x_{i})$$

What does that mean?

Okay, let's think about some data we know that co-vary, thus they should have a positive covariance.

In [3]:
options(repr.plot.height=3.5 , repr.plot.width=3)
x = -10:10
y = x + rnorm(length(x),sd=3)
cols = rep('black',length(x)); cols[2]='blue'; cols[2]='blue'; cols[18]='red'
cexs=rep(1,length(x)); cexs[c(2,18)]=1.2;

plot(y,x,col=cols,cex=cexs,pch=16,main='Covariance illustration')

So, lets consider the two points, shown in red and blue. First, the red point. If you calculate $(y_{mean} - y_{i})$, you get a negative value. Similarly, $(x_{mean} - x_{i})$ is negative. So two negatives multiplied is positive. For the blue point, both $(x_{mean} - x_{i})$ and $(y_{mean} - y_{i})$ are positive and their product is also positive.

So, the covariance quantifies this relationship, which here we can see by eye. If the data are anticorrelated, then the covariance will be negative.

The covariance matrix quantifies every combination of variables. Thus if you have $n$ samples and $p$ variables, the covariance matrix will be a square $p * p$ matrix.

Okay, let's compute it.

In [52]:
cvMat = cov(centerData)
print(paste('total variance = ',sum(diag(cvMat))))
cvMat
[1] "total variance =  7261.38411428571"
Out[52]:
MurderAssaultUrbanPopRape
Murder 18.970465291.062367 4.386204 22.991412
Assault 291.06246945.1657 312.2751 519.2691
UrbanPop 4.386204312.275102209.518776 55.768082
Rape 22.99141519.26906 55.76808 87.72916

A couple of things to note. First is that the diagonal elements of the covariance matrix tells us the variance explained by that variable. And the sum of these elements is the total variance in the data.

Now, we will compare this to the covariance matrix of the PC scores, and see that the variance is explained only along the diagonal and these are orthogonal (with some rounding error).

In [55]:
print(paste('total variance = ',sum(cov(pca$scores))))
cov(pca$scores)
[1] "total variance =  7261.38411428572"
Out[55]:
Comp.1Comp.2Comp.3Comp.4
Comp.17.011115e+034.392473e-134.619880e-146.197096e-15
Comp.2 4.392473e-13 2.019924e+02 1.348832e-13-2.092928e-14
Comp.3 4.619880e-14 1.348832e-13 4.211265e+01-2.377396e-15
Comp.4 6.197096e-15-2.092928e-14-2.377396e-15 6.164246e+00

A rationale for dimensionality reduction

This is helpful to see. Notice that

  1. the total variance explained is the same (~7261)
  2. the diagonal elements are steadily decreasing from left to right.

So, PCA enables us to effectively drop dimensions from our data while losing minimal variance. We can quantify the fraction of variance explained by each PC.

In [72]:
options(repr.plot.height=3.5 , repr.plot.width=3)
barplot(diag(cov(pca$scores))/sum(diag(cov(pca$scores))), las=2,ylab='Var explained',font.lab=2)
#diag(cov(pca$scores))/sum(diag(cov(pca$scores)))

Okay, moving on...

3) Calculate the eigenvectors/eigenvalues of the covariance matrix.

The covariance matrix, because it is symmetric, will have p eigenvectors associated with it. The first eigenvector, the first principal component, will describe the axis of greatest variance in the data.

"Why?"

Okay, you got me. The sum of the diagonal matrix elements (the trace) is the variance. Some matrix algebra can prove that the eigenvectors are indeed maximizing variance, but that's more effort than I want to invest now.

Anyway the second, third, etc., will capture the subsequent highest axes of variance. Additionally, each of the eigenvectors will, by definition be orthogonal. Thus, the data can be re-expressed in terms of the principal components.

Let's take a look.

In [5]:
eigData = eigen(cvMat)
eigVec = eigData$vectors
eigVal = eigData$values

cat('Here are the eigenvectors. The left-most column is the first eigenvector,\nwhich captures the greatest axis of variance in the data.\nEigenvectors 2-4 are sequentially right-ward columns. \n\n')
df = data.frame(eigVec); colnames(df)=c('eigVec-1','eigVec-2','eigVec-3','eigVec-4')
df
Here are the eigenvectors. The left-most column is the first eigenvector,
which captures the greatest axis of variance in the data.
Eigenvectors 2-4 are sequentially right-ward columns. 

Out[5]:
eigVec-1eigVec-2eigVec-3eigVec-4
1-0.041704320.044821660.079890660.9949217
2-0.99522130.05876003-0.06756974-0.0389383
3-0.04633575-0.9768575-0.20054630.05816914
4-0.0751555-0.20071810.9740806-0.07232502

4) Compute new data from feature vector

Now that we have the the eigenvectors, we will choose a feature vector, composed of of a combination of PCs. Let's choose the first two, which capture the top two axes of variation.

By multiplying the transposed eigenvectors and raw data, we can project the original data onto the principal components.

"Why is this?"

What actually happens during the matrix multiplication is essentially taking the dot product, which if you remember from calculus, is used to translate vectors to a new dimensional space. So, that explains how PCA actually is doing dimensionality reduction. Keeping the maximal variance using a subset of PCs can preferentially reduce the effects of redundancy and noise.

In [6]:
# Define feature vector
featureVec=eigVec[,1:2]

# Project the data along the PCs
dataInPC = t(featureVec) %*% t(USArrests)
dataInPC = scale(t(dataInPC),scale=F)

# Look at the new data in PC space
df=data.frame(dataInPC); colnames(df)=c('PC1','PC2')
head(df)
Out[6]:
PC1PC2
Alabama-64.8021611.44801
Alaska-92.8274517.98294
Arizona-124.0682-8.830403
Arkansas-18.3400416.70391
California-107.423-22.52007
Colorado-34.97599-13.71958

Sanity check

Now that we have calculated the new data, in terms of the top two PCs. Let's see how different our calculations are from the same results using the builtin R function, princomp

Now, let's run PCA using the built-in R method

In [7]:
pca = princomp(USArrests)
In [59]:
summary(pca)
Out[59]:
Importance of components:
                           Comp.1      Comp.2      Comp.3       Comp.4
Standard deviation     82.8908472 14.06956001 6.424204055 2.4578367034
Proportion of Variance  0.9655342  0.02781734 0.005799535 0.0008489079
Cumulative Proportion   0.9655342  0.99335156 0.999151092 1.0000000000

And now, let's plot the new data along the top two PCs

In [8]:
options(repr.plot.height=6 , repr.plot.width=10)
# PCA calculated by hand
par(mfrow=c(1,2),font.lab=2,cex.lab=1.5,cex.main=2)
#plot(dataInPC[,1:2],type='n',xlab='PC1',ylab='PC2',main='PCA by hand')
plot(dataInPC,type='n',xlab='PC1',ylab='PC2',main='PCA by hand')
text(dataInPC, labels=row.names(dataInPC), cex=1.2)

# PCA calculated by `printcomp` function
plot(pca$scores[,1:2],type='n',xlab='PC1',ylab='PC2',main='Built-in PCA in R')
text(pca$scores[,1:2], labels=row.names(USArrests), cex=1.2)

Success!

The data look pretty much exactly the same. So there you have it. We explored some fundamentals of PCA and went through the process of reducing the dimensionality from 4 to 2. Maybe now you can cluster as well as a five-year old :)