Principal Component Analysis Explained Visually

onurcel · on May 2, 2021

Here is the best PCA explanation I ever read on the web: https://stats.stackexchange.com/questions/2691/making-sense-...

knolan · on May 2, 2021

I’ve always like this example from the Matlab documentation:

https://www.mathworks.com/help/stats/biplot.html

The use of car data and how it correlates is a good way of showing how PCA treats each variable. For example acceleration and weight are negatively correlated while displacement and horsepower are highly correlated.

melling · on May 3, 2021

Anyone using Matlab these days? Seems a bit niche.

I used Octave in Andrew Ng’s ML class and wanted to learn more Matlab so I signed up for a Coursera course on it.

It’s a nice tool.

ingqondo · on May 3, 2021

I'm sure it's still very popular in engineering. It's still very much the lingua franca for numerical methods applied to engineering problems - there are decades of materials solving engineering problems written in MATLAB.

ur-whale · on May 3, 2021

"The Cobbler's Kids Have No Shoes"

catillac · on May 3, 2021

Matlab is great. I won’t hire anyone who only has experience in it though, like I might hire someone who only has python numerical analysis tools experience. It’s brittle and difficult to put into production in my experience.

asdff · on May 3, 2021

It's used heavily in school still. I think statistics and data science in general have dropped it along with other relics like SAS (used today only by those people who haven't bothered learning anything other than SAS), and are sticking with R and python for functionality previously done in matlab.

nxpnsv · on May 3, 2021

Yes. We made a proof on concept thing on top of a closed source 2.4Ghz radio wave propagation simulator built in matlab. Interfacing to compiled matlab through hacky command line and .mat files is a pretty horrifying experience :)

knolan · on May 3, 2021

The stack exchange post from the parent used Matlab which is why I replied with more Matlab.

Matlab is very popular in engineering. It’s simple enough to get simple things done without needing to know too much CS.

saeranv · on May 2, 2021

Seconded. This is exactly the same stackexchange post I thought of as well.

jschveibinz · on May 3, 2021

Even simpler without the math:

Let’s say you eat a piece of cake. You say, “hmm, salty, sweet, fruity, nice texture.” Let’s call those attributes the “principal components” of the cake from your point of view. Are there others? Maybe you could have also said “moist, or fluffy,” but you didn’t because those weren’t as obvious, so not as important.

When the cake was made, according to the recipe, there were no instructions on how to add “sweet” or “fruity.” Instead, there was a list of ingredients: sugar, vanilla, lemon juice, flour, water, baking powder, etc. The mixture of these ingredients in the quantities dictated (plus the baking) resulted in the cake having the characteristics that you described. Some of the characteristics have a strong reliance on just one or two ingredients, e.g. “sweet” with “sugar,” and some characteristics are the result of subtle combinations of many ingredients, e.g. “texture.”

The list of characteristics (principal components) definitely describe the cake, but in a more convenient and relevant way. You don’t need the whole list of ingredients to describe the cake. This is what makes principal component analysis useful.

nxpnsv · on May 3, 2021

It's a nice start, but are sweet and fruity orthogonal components :)

jschveibinz · on May 3, 2021

Oh you just had to go there :) Sure, let’s say that “sweet” and “fruity” are completely independent.

quantstats · on May 2, 2021

This article is relatively popular here (considering the topic). Two previous discussions about it:

From 2015: https://news.ycombinator.com/item?id=9040266.

From 2017: https://news.ycombinator.com/item?id=14405665.

A more recent approach to visualizing high-dimensional data is the t-SNE algorithm, which I normally use together with PCA when exploring big data sets. If you're interested in the differences between both methods, here's a really good answer: https://stats.stackexchange.com/a/249520.

usgroup · on May 3, 2021

I think PCA is a good reason to learn enough linear algebra to understand PCA. It means learning about basis, rank, rank approximation, orthogonality, eigenvectors, spectral decomposition, etc. There's a whole iceberg of concepts that go into actually understand PCA without which PCA is not really understood.

gentleman11 · on May 2, 2021

> a transformation no different than finding a camera angle

I’ve used PCA a bit in the past and it’s so abstract that one forgets how to conceptualize it shortly after finishing the task. This is an interesting and memorable way to put it, I like that.

rcar · on May 2, 2021

PCA is a cool technique mathematically, but in my many years of building models, I've never seen it result in a more accurate model. I could see it potentially being useful in situations where you're forced to use a linear/logistic model since you're going to have to do a lot of feature preprocessing, but tree ensembles, NNs, etc. are all able to tease out pretty complicated relationships among features on their own. Considering that PCA also complicates things from a model interpretability point of view, it feels to me like a method whose time has largely passed.

baron_harkonnen · on May 2, 2021

> Considering that PCA also complicates things from a model interpretability point of view

This is a strange comment since my primary usages of PCA/SVD is as a first step in understanding latent factors which are driving the data. Latent factors typically involve all of the important things that anyone running a business or deciding policy care about: customer engagement, patient well being, employee hapiness, etc all represent latent factors.

If you have ever wanted to perform data analysis and gain some exciting insight into explaining user behavior, PCA/SVD will get you there pretty quickly. It is one of the most powerful tools in my arsenal when I'm working on a project that requires interoperability.

The "loadings" in PC and the V matrix in SVD both contain information about how the original feature space correlates with the new projection. This can easily show thing things like "User's who do X,Y and NOT Z are more likely to purchase".

Likewise in LSA (Latent Semantic Analysis/indexing) on a Term-Frequency matrix you will get a first pass at semantic embedding. You'll notice, for example, that "dog" and "cat" will project onto the new space in a common PC which can be used to interpret "pets".

> I've never seen it result in a more accurate model. I could see it potentially being useful in situations where you're forced to use a linear/logistic model

PCA/SVD are a linear transformation of the data and shouldn't give you any performance increase on a linear model. However they can be very helpful in transforming extremely high dimensional, sparse vectors into lower dimensional, dense representations. This can provide a lot of storage/performance benefits.

> NNs, etc. are all able to tease out pretty complicated relationships among features on their own.

PCA is literally identical to an autoencoder minimizing the MSE with no non-linear layers. It is a very good first step towards understanding what your NN will eventually do. After all, all NNs perform a non-linear matrix transformation so that your final vector space is ultimately linearly separable.

rcar · on May 2, 2021

Sure, everyone wants to get to the latent factors that really drive the outcome of interest, but I've never seen a situation in which principal components _really_ represent latent factors unless you squint hard at them and want to believe. As for gaining insight and explaining user behavior, I'd much rather just fit a decent model and share some SHAP plots for understanding how your features relate to the target and to each other.

If you like PCA and find it works in your particular domains, all the more power to you. I just don't find it practically useful for fitting better models and am generally suspicious of the insights drawn from that and other unsupervised techniques, especially given how much of the meaning of the results gets imparted by the observer who often has a particular story they'd like to tell.

fredophile · on May 2, 2021

I've used PCA with good results in the past. My problem essentially simplified down to trying to find nearest neighbours in high dimensional spaces. Distance metrics in high dimensional spaces don't behave nicely. Using PCA to cut reduce the number of dimensions to something more manageable made the problem much more tractable.

zmk_ · on May 3, 2021

Plenty of examples for these in finance and economics (term structure, asset pricing factors).

Kranar · on May 3, 2021

By definition there are more accurate models, the PCA is kind of like a general lossy compression algorithm. Any model you come up with can be superseded by a more accurate model up until you have a perfect description of a phenomenon, but PCA is a well understood technique, can be computed very fast using optimized algorithms and GPUs and pretty much anyone can easily understand PCA and apply it to a wide variety of problems, and from a technical standpoint the ratio of output bits to input bits preserves the maximum amount of information.

We use PCA quite a lot at my quant firm do something similar to clustering in high dimensional spaces. A simple use case would be to arrange stocks so that stocks that move similarly to one another are grouped close together.

Another use case for PCA is breaking stocks down into constituent components, for example being able to express the price of a stock as a linear combination of factors: MSFT = 5% oil + 10% interest rates + 40% tech sector + ...

You can also do this for things like ETFs, where in principle an ETF is potentially made up of 100 stocks, but in practice only 10 of those stocks really determine the price, so if you're engaged in ETF market making you can hold neutral portfolio by carrying the ETF long and a small handful of stocks short.

mcguire · on May 2, 2021

By definition, it's going to result in a less accurate model, unless you keep all of the dimensions or your data is very weird, right? And NNs are going to complicate your interpretability more?

oivey · on May 3, 2021

When/if used properly, no. The idea behind PCA is to find a set of features with far less dimensionality than the original data. The hope/intent with this sort of approach is that any more fitted features are just fitting noise.

hervature · on May 3, 2021

For people who are curious, the GP is correct when it comes to fitting the training data. Recall, with enough parameters, we can get 100% on training. The parent’s comment is about testing/validation where we want to avoid overfitting so removing the least important parameters can be helpful.

zmk_ · on May 3, 2021

Not if many columns in your data are driven by some common latent factors.

asdff · on May 3, 2021

PCA is good enough for a lot of things. For example, it is used in genetics to measure relatedness between populations reasonably well. A perfect model doesn't really exist when the data you are able to realistically collect is only a subset of the population anyway, perhaps biased toward how it was collected.

a-dub · on May 2, 2021

i can think of a few places where it's useful:

if you know that your data comes from a stationary distribution, you can use it as a compression technique which reduces the computational demands on your model. sure, computing the initial svd or covariance matrix is expensive, but once you have it, the projection is just a matrix multiply and a vector subtraction. (with the reverse being the same)

if you have some high dimensional data and you just want to look at it, it's a pretty good start. not only does it give you a sense for whether higher dimensions are just noise (by looking at the eigenspectrums) it also makes low dimensional plots possible.

pca, cca and ica have been around for a very long time. i doubt "their time has passed."

but who knows, maybe i'm wrong.

ivalm · on May 2, 2021

It is still a nice tool for projecting things (at least to visualize) where you expect the data to be on a lower dimensional hyperplane. I do agree in most cases t-SNE or UMAP are better (esp if you don’t care about distances).

rsj_hn · on May 2, 2021

I put the four dots on the corners of a square and the fifth in the center. This results in the same square in the PCA pane but rotated about 45 degrees. Then, if you take one of the dots on the square corner and move it ever so sligthly in and out, you see the PCA square wildly rotating. Pretty cool to demonstrate sensitivity to small changes in the inputs.

atum47 · on May 3, 2021

Cool animations. Statsquest has a really well made video about PCA on youtube, recommend.

foota · on May 3, 2021

I was thinking the other night, PCA can be on used on images for compression, what would it look like if you took two images, combined into pairs the principle components, and then lerped between them as a transition effect.

jazzyjackson · on May 3, 2021

Lookup “data moshing“ and see if it’s what you’re thinking

foota · on May 3, 2021

Not really, I was thinking that by gradually shifting between the principal components of the image you could subtly shift from one to the other, but it might just look like visual garbage instead :) maybe like start with the lowest components and then gradually move to the strongest components.

Sinidir · on May 2, 2021

Question: is there any difference between the highest variance dimension pca finds and a line that linear regression would find?

hsiang_jih_kueh · on May 2, 2021

if recall yeah there probably will be. linear regression minimises the vertical distance of a point to the regression line whereas PCA minimises the orthogonal distance of the point to the line.

osipov · on May 2, 2021

Linear regression uses a measure of an "error" for every data point. Visually, the error is the vertical difference between a data point and the line/plane of linear regression. In contrast, PCA measures the distance from the data point along the line perpendicular to the PCA axis. The PCA distance is also known as a "projection".

There is something known as orthogonal regression (total least squares) which uses the same measure as PCA. Unfortunately it doesn't work well across incompatible variables.

kwhitefoot · on May 2, 2021

Very interesting. It would have been even better if there had been a link to an explanation of how PCA is performed.

baron_harkonnen · on May 2, 2021

If you know a bit of linear algebra the transformation is surprisingly intuitive.

Your goal is to create a set of orthogonal vectors, each that captures the highest amount of variance in the original data (the assumption is that variance is where the most information is).

This is achieved by performing Eigen-decomposition on the Covariance matrix of the original data. Essentially you are learning the eigenvectors of the covariance matrix, ordered by their eigenvalues.

bigdict · on May 2, 2021

Or the singular vectors of the zero-centered data, ordered by singular values.

peteretep · on May 3, 2021

Today I learned the toy recommendations algorithm I sometimes day dream about has a name...

strontian · on May 2, 2021

What was used to make the visualizations?

alexcnwy · on May 2, 2021

https://d3js.org/ & https://threejs.org/

gentleman11 · on May 2, 2021

I bet you could use three.js as well