“Genes mirror geography within Europe” is a recent paper looking at genetic data on 300 people from across Europe, with >500,000 variables. It’s fairly common to use dimension reduction techniques in genetics and principal component analysis (PCA) is the oldest and fastest of these.
The paper is a representative example of this sort of technique, so I thought I would just explain it and its visualisations this week. PCA is sometimes included in a “machine learning” toolbox, and its capacity to crunch through many variables makes it appealing to fans of contemporary predictive modelling in the “data science” school.
You can think of the data as a cloud of 3000 dots. In the same way that, if you know a horizontal location and a vertical location for each dot, you can draw a scatterplot, and if you had depth, height and width, you could place each dot in a 3-dimensional space, each dot has 500,000 co-ordinates (genetic features measured at a certain point in the genome), so they have a location in 500,000-dimensional space. That’s hard to visualise, so some compromise will be needed.
We want to show as much information as possible. But what do we mean by “information”? If we can quantify that concept, we can find a way of looking at the cloud in just two dimensions — projecting it onto a piece of paper, or photographing it from a certain angle — that maximises the “information”. Well, one obvious candidate is the variance, a statistical measure of how spread out data are along one axis. There are other such measures, but variance has some a nice property and relates to PCA.
First, though, imagine a 3-dimensional cloud of data shaped like a pitta bread. The three axes have different variances. If you wanted to take a photograph of the pitta and show people what it is like, it would be odd to take the photograph end-on so that it looked like a long pencil-shaped finger of bread (right, below). You’d be losing a lot of information. However, there is of course no ideal way of doing this photograph: even taking it so that the two highest variance dimensions are visible (left, below) loses that little bit contained in the depth of the pitta.
So, if you have to reduce dimensionality, you want to be sure to include dimensions that have high variance without much distortion. Variance is (kind of) the root-mean-square distance from the mean, so it is related to what matrix algebra calls the L2 norm. I bet you’re thinking here about Pythagoras, and how the squared distances in x and y directions add up to the squared distance straight across them (the hypoteneuse). This is what’s neat about using the variance as a measure of information in dimension reduction: the variances of each dimension add together to give the generalized variance, related to the L2 norm of the whole data matrix. You might recall some tedious stuff like this in Analysis Of Variance if you studied Old-Fashioned Statistics 101. By showing some dimensions, you are partitioning the generalized variance into seen and unseen.
So, let’s look at what they got out of this genetic data.
The 2-dimensional reduction caught a lot of people’s eyes because it nearly re-creates a geographical map of Europe based solely on the genetic data. There are some nice features, like the elongation of Italy from France and Switzerland down to Greece and Cyprus.
Here are some more dataviz thoughts on it:
- It’s difficult to know what to do in a scatterplot with many categories like this. Including a two-letter code, if it is unambiguous, is a pretty good option.
- The letters have to be quite small and are jumbled on top of one another. I wonder if it would help to show a random sample in one version of this scatterplot. Or maybe convex hulls (bagplot, if you prefer) in another.
- The colours mean nothing to me (Vienna)
- The big opaque markers for mean/median/whatever are far too overbearing
- Why does Scotland get special treatment? I mean, that’s nice of the authors, but I don’t want the rest of Europe getting envious.
- Some countries have few data; reducing the size of the mean/median/whatever marker would help convey this.
- I always want to see some measure of discarded information per observation in dimension reduction. Perhaps the same scatterplot with L1/L2 norm of the other principal components encoded as colour of the marker, or something like that. I want to know if there are pockets of data where this summary doesn’t do them justice.
- The authors write “The rotation of axes used in Fig. 1 is 16 degrees counterclockwise and was determined by finding the angle that maximizes the summed correlation of the median PC1 and PC2 values with the latitude and longitude of each country.” Fair enough, but why has nobody ever heard of Procrustes analysis? Go Google it. Procrustes is the basis of analysis of shapes. If you can match the countries in principal component space with the countries in Europe (some map projection anyway), then you can calculate what isotropic or anisotropic transform is required to get them as close to each other as possible. It’s especially useful if you have more than one dimension reduction and want to quantitatively compare them in terms of how they match up to the gold standard (the map).
- Before you do any more criticism, read the paper, at least the intro section. The authors acknowledge and explain many limitations in a very clear and interesting way.
There are many other ways of reducing dimensionality, like correspondence analysis or t-SNE. I’ll come back to them here one day. They all get a broad-brush overview in my dataviz book which is planned to come out next year.