Data visualization with statistical reasoning: many variables

This post is in a series that take one image from my book Data Visualization: charts, maps and interactive graphics, published September 2018 by CRC Press, and explore the statistical thinking behind it.

In the chapter Many Variables, I look at the problem of visualising data like this:

Screen Shot 2018-07-30 at 16.05.53
Data collected by Jim Sidanius at Harvard

Each row is a student who has answered a questionnaire on their satisfaction with teaching at their university, and each column is one of the questions they were asked. Often in data visualisation, we use a familiar two-dimensional format where one variable is represented by horizontal position and another by vertical position. This is easy to read but to include more variables, you have to use some tricks.

I’ll explore this with a simple example where we have three variables (which we can almost visualise in three dimensions as a cloud of dots, each observation getting a left-right location, a front-back location, and a up-down location from those three variables), and want to show it in two dimensions so that it can be printed on a page or shown on a screen.

First, imagine this cloud of dots. It is going to look like a murmuration of starlings — one of those huge, swirling, self-organising flocks, containing so many birds that they just appear to be minuscule dots — and when the photographer points the camera at it from a certain direction and clicks, the light arriving from the starlings lands on a two-dimensional surface inside the camera and is captured.

A murmuration of starlings at Rigg. Photograph by Walter Baxter. CC-BY-SA 2.0

From that direction, you get an idea of where the birds are to the photographer’s left, right, up and down orientations, but no idea of whether they are closer to, or further away from, the lens. You obtain a two-dimensional image but at some cost. You have projected the light from the starlings onto the camera at a certain angle, in straight lines, and this idea of a projection is one we need to grapple with.

In the book, I show this image, where I created a cloud of 1000 points, shaped like the planet Saturn. I used R and you can access the code at my webpage for the book. Each point has three coordinates: front-back, left-right and up-down. But we need to represent it on a page or screen, in just two dimensions.

The Saturn data shown in two dimensions by principal components analysis (PCA) on the left, and t-distributed stochastic neighbor embedding (t-SNE) on the right. Image by Robert Grant, copyright CRC Press.

There are many instances like this with even higher-dimensional data. Every variable that gives you values for observations in your data can be thought of as a dimension.

Here’s the same projections, but using colour to identify North (orange) and South (blue) in the left image and longitude (rainbow colours, sorry for offending any dataviz cognoscenti, but this is intended to show how the analysis works, rather than give real insights into the shape of Saturn) in the right image.

This is the PCA projection, so it is just looking straight down from above the North pole. It appears oval just because of the aspect ratio when I created the graphs (I didn’t want them to take too much vertical space on the screen). Longitudes are separated but latitudes are mixed up. Points at the North pole are mingled with those from the South pole, despite being at opposite ends of the “planet”. PCA is doing its best to project into 2 dimensions, and with a sphere, there is no reason to pick one projection over another. However, the rings add extra data which push PCA towards looking directly down.

PCA chooses the direction along which you point the proverbial camera in order to capture as much variance as possible in the resulting image, and sometimes that’s what you want, but sometimes you need to think about it and overrule PCA to show something more meaningful. Notably, it might be that two dimensions just doesn’t capture enough of the variance, and more than one image is called for. I’ll leave the details of how you make that decision, and what to do for multiple images, aside for now. It’s best at that point to involve an expert in multivariate statistics anyway, rather than trying to wing it.

PCA is like the photograph I described above. Every point is projected by a straight line to the camera, and those lines are all straight and parallel. Mathematically, we call it isotropic: the surfaces are all the same. But sometimes we can understand patterns in the data better by warping the lines of projection, and that’s called aniostropic. Some regions might be squashed together and other stretched apart. Support vector machines or Procrustes analysis of shapes use aniostropic projections too, for different purposes in the world of statistics.

Now, let’s look at the t-SNE projections.

t-SNE is an iterative procedure; it tries various warpings, and keeps moving towards a better separation of points that are distant in the full 3-dimensional space (or higher, if your data have more than three variables). There’s no shortcut calculation that can take it straight to the optimal warp, and in fact there’s no guarantee that its iterations will arrive at the optimum in the end. But in dataviz, there generally is no optimum anyway; we have to compromise and present our message clearly without misleading the reader.

One parameter that controls its iterations is called perplexity, and essentially measures distances by referring each point to a certain number of neighbours. Increase perplexity and you force it to try to be fair in representing distances across a wider region. That can sometimes reveal insights about the data structure in high-dimensional space, and sometimes a low perplexity is better. Above, I used the default perplexity of 30 (30 neighbouring points out of the total 1000).

As you can see, the warping has kept the idea that the rings are separate from the planet, and twisted the planet so that north and south poles are separated. In doing so, it has broken the rings apart. Some distances in the rings are now mis-represented compared to PCA, but others in the planet are improved. Because there are more points in the planet than in the rings, the planet won and the rings got distorted.

Unfortunately, the longitude is not well represented. Individual colours appear in two or three distinct patches; you can see this for the pink-orange zone or the teal zone. Perhaps this is because the perplexity needs to be increased. That would allow the neighbourhood of each point to stretch out further, and it might keep the colours together. Let’s try 60:

That seems a little better, in that there’s a pink stripe running through the projected points, but it’s not great. Let’s go up to 90:

It’s hard to say whether this has helped. It’s at least not degraded the North-South separation on the left, so let’s try 120:

I think this is a little better. Now 150:

Here’s the colour (longitude) separation is not very different to perplexity 120. As we increased it, the image turned round and the colours flipped from side to side, but that doesn’t matter. One thing about this final image that makes me uneasy is that the rings are increasingly getting pulled into the mass of points that make up the planet, and that feels like a distortion. I would probably stick with 90 or 120 for that reason.

The same problems we had to think about (but could quite easily overcome) with Saturn return with reinforcements as the number of variables / dimensions increase. Soon, the compromises you have to make become so severe that a single image is just not an option.

The bottom line

If you have data with many variables, and you want to show how the observations cluster together, which points are similar to another, etc, then don’t give up. There are many dimension reduction techniques you can use. In the book, I also describe correspondence analysis, which does this for categorical variables. They are not hard to achieve with a little bit of code, as you’ll see in the webpage, and you can get correspondence analysis or principal components analysis through drop-down menus if you prefer, in at least Stata and SPSS that I know of. For t-SNE, you need R or Python or Julia.

Try different approaches. Try different parameters. Keep in mind what you want to show, and highlight particular data points (for your own contemplation) so that you can understand what’s going on and make an informed choice of final image. Be prepared to explain the procedure to your audience. Keep it simple so they stay engaged; talk about things like taking photographs of flocks of birds rather than appealing to matrix algebra, and don’t wave it off as a mysterious and magical process. Always put yourself in your audience’s shoes, and if you can, user-test your visualisation and explanation before launching it.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s