Special mention for Gwilym Lockwood:

**Methodviz** of the year goes to John Zech for “What are radiological deep learning models actually learning,” an investigation into what one particular deep learning model did when it was charged with predicting respiratory diseases by looking at chest X-rays. If you are excited about deep learning, AI and such in the medical world, * stop!* and read this. There’s a helluva punch line; I’ll leave it to John.

Here are three approaches to showing uncertainty around a single curve in two-dimensional space.

There are two variables, and implicitly a third, which might be time; the curve moves from bottom left (where, initially, there is no uncertainty) to top right (and the uncertainty increases along the way). The data for this image are artificial, but I was thinking of hurricane tracks as well as any time series forecast where uncertainty is so important and increases further into the future.

The three approaches are, in essence:

- Show one shape with an X% chance that the “truth” will turn out to lie outside it. By “truth”, I mean a population parameter, if you are doing inferential statistics, or future data if you are doing prediction. It could be something else too, like a rank, model, or missing value, but the same principle applies. Take this idea of the chance of the truth lying in some location and imagine it as a surface, rising where the chance is high and dropping where it is low. This surface rises up to the summit, which is our best guess. We could add the best guess to the image, which is the central line in this image, but could also be a point. Then, our shape is a contour line: the surface is at the same height all around that contour. By the way, that surface is something that we really do deal with statistically. It is a posterior probability density function of you are doing Bayesian statistics, or a likelihood function otherwise. But there are ways of getting these uncertainty measures without full-on likelihood or Bayes. The bootstrap gives you this by just picking the central X% of resampled statistics. Inferential shortcut (asymptotic) formulas sometimes work on approximations. Approximate Bayesian Computation (ABC) generates phony data according to different parameter values, and compares it to what you actually observed.
- Show several of these contours, like a topographic map. You might prefer to colour in the area according to the height of the surface instead, in which case it might be clearer to do some kind of smoothing; I’ll have another blog post soon on the subject of smoothing in data visualisation.
- Instead of showing the values (height) of the probability / likelihood, draw values at random from that function and show them. If you are bootstrapping or doing Bayesian stats by simulation, then this is simple because you can just draw the bootstrap stats or the retained simulations. The posterior / likelihood then acts as a
*data generating process*, a crucially important mental tool for good data analysis, but something we’ll have to leave for another time.

If you are showing one value at a time, the classic error bar is the result of option 1. Tukey and others have proposed versions with multiple levels of uncertainty, relating to option 2; you could try gradations of colour or line width too. Option 3 would involve a scatter of dots, preferably semi-transparent and/or jittered.

More exotic tweaks to this general idea are also out there — I included examples of a Bank of England fan chart and a funnel plot for comparing clusters of data (hospital mortality, in my example) in the book — and if something like that is the accepted, understood and expected approach in your line of work, then you should go with it. I had to choose what to include to keep the book from getting too long and expensive, and some fun approaches to uncertainty got, unfortunately, spiked, such as visually weighted regression.

Why hurricanes? Lots of interesting dataviz work has been done on them (at least, American hurricanes, because that’s where the dataviz muscle is) in recent years by journalists. Most recently, Alberto Cairo has led an effort to improve them. He says that option 1 from above is poorly understood and introduces a false dichotomy: if you live inside the cone, you’re gonna get whacked, and if you live outside, you’re totally safe. Also, people mistake the size of the cone for the size of the hurricane itself. Option 2 helps a bit, but not totally. Option 3 is good but in some settings (such as weather forecasting), not all the lines carry the same weight (some forecast models are known to be more reliable and sophisticated than others) — how do you show that?

When you are choosing how to visualise uncertainty, there are some important considerations. Here are some that come to mind:

- What is the statistical literacy of your audience? If it’s a mix, you probably need more than one image. Provide something they know how to use, rather than something you’re convinced they’ll love once they’ve learned how to use it (more Bill Gates than Steve Jobs).
- What summary statistic is of interest, if you are doing inferential statistics? Not the statistic you can easily get, or the one with a handy formula for standard errors, but the one your audience needs for decision making.
- If you are going to show contours, error bars, or some other depiction of a given level of (un)certainty at X%, find out what X is meaningful to your audience. For example, if it is a business decision that depends on your information, then ask what level of uncertainty (risk of being wrong) would change the decision, then draw that level.
- Is sampling error (having a sample, not the whole population) the only source of uncertainty? If not, if your estimates are also affected by things like missing data, confounding / endogeneity, or response bias, then consider a Bayesian approach, where you can incorporate all those sources of uncertainty into one posterior probability surface.
- Is it enough to see the uncertainty around each estimate / statistic / prediction on its own, or does your audience need to see how they interact? Sometimes, over-estimating A implies under-estimating B, and in these cases, you need to think about not just the variance (spread) of the uncertainty of A and B individually, but also the covariance between them.

- Is the uncertainty likely to be asymmetric? Imagine you are estimating a small percentage. You shouldn’t use a shortcut formula that will return an interval extending into negative values: you will get laughed at. In cases like these, you can sometimes transform the data / stats before calculation, to induce asymmetry, or you could swap over to the bootstrap.
- What if you are worried that some outlier or clustering in the data is going to spoil the shortcut formula? You can switch to using a formula robust to outliers, like the Huber-White sandwich estimator, or robust to clusters, like the Huber-Rogers clustered sandwich estimator. (Feeling hungry?)
- Consider whether interactivity or animation could help your audience to understand how uncertainty could have come about, and how it could be affecting your results.

It’s great to be able to tailor to the audience by giving them a range of images, assumptions, models, etc, to mull over. Moritz Stefaner argued for *worlds, not stories*, and he is right to do so. I want to sound a note of caution though. This is an old statistical concept, multiplicity. You might know it as researcher degrees of freedom or the garden of forking paths.

By letting the audience explore and draw their own conclusions, you co-opt them as fellow researchers, and they become capable of the sins of researchers, especially as they have had no training in how to go about their investigations. If they keep looking long enough, and making enough comparisons, they will find something that looks like it is outside the bounds of uncertainty: two error bars that don’t overlap, or a hospital outside the funnel, or a brief period where a time series departs from the predicted fan of uncertainty. These aberrations could give an important insight, or they could just be noise; the more you look, the more likely you are to spot a pattern, but there’s no guarantee that it is “true”. In fact, the risk of error generally goes up as you keep looking.

So, the X% you set for your dataviz becomes corrupted, depending on how hard the reader looks at it. There is no very clear answer on how to tackle this; you just have to help the reader learn how to read your dataviz. And that brings me to the final point, the exemplarium. This is where you lead the reader into the visual package that you created by talking them through a single example. It happens in US Gun Deaths, and in many of the images in *London: the Information Capital*. This is how you give the reader a key or legend, plus advice on how not to get carried away, plus what uncertainty is in this context. I think it’s the only way to give an inroads when dataviz gets complex, and without swamping the reader.

One of the most common concerns that I hear from dataviz people is that they need to visualise not just a best estimate about the behaviour of their data, but also the uncertainty around that estimate. Sometimes, the estimate is a statistic, like the risk of side effects of a particular drug, or the percentage of voters intending to back a given candidate. Sometimes, it is a prediction of future data, and sometimes it is a more esoteric parameter in a statistical model. The objective is always the same: if they just show a best estimate, some readers may conclude that it is known with 100% certainty, and generally that’s not the case.

I want to describe a very simple and flexible technique for quantifying uncertainty called the bootstrap. This tries to tackle the problem that your data are often just a sample from a bigger population, and so that sample could yield an under- or over-estimate just by chance. We can’t tell if the sample’s estimate is off the true value, because we don’t know the true value, but (and I found this incredible when I first learnt it) statistical theory allows us to work out how likely we are to be off by a certain distance. That lets us put bounds on the uncertainty.

Now, it is worth saying here, before we go on, that this is not the only type of uncertainty you might come across. The poll of voters is uncertain because you didn’t ask every voter, just a sample, and we can quantify that as I’m describing here, but it’s also likely to be uncertain because the voters who agreed to answer your questions are not like the ones who did not agree. That latter source of uncertainty calls for other methods.

The underlying task is to work out what the estimates would look like if you had obtained a different sample from the same population. Sometimes, there are mathematical shortcut formulas that give you this — the familiar standard error, for example — immediately, by just plugging the right stats into a formula. But, there are some difficulties. For one, the datavizzer needs to know about these formulas, which one applies to their purposes, and to be confident in obtaining them from some analytical software or programming them. The second problem is that these formulas are sometimes approximations, which might be fine or might be off, and it takes experience and skill to know the difference. The third is that there are several useful stats, like the median, for which no decent shortcut formula exists, only rough approximations. The fourth problem is that shortcut formulas (I take this term from the Locks) mask the thought process and logic behind quantifying uncertainty, while the bootstrap opens it up to examination and critical thought.

The American Statistical Association’s GAISE guidelines for teaching stats now recommend starting with the bootstrap and related methods before you bring in shortcut formulas. So, if you didn’t study stats, yet want to visualise uncertainty from sampling, read on.

If you do dataviz, and you come from a non-statistical background, you will probably find bootstrapping useful. Here it is in a nutshell. If we had lots of samples (of the same size, picked the same way) from the same population, then it would be simple. We could get an estimate from each of the samples and look at how variable those estimates are. Of course, that would also be pointless because we could just put all the samples together to make a megasample. Real life isn’t like that. The next best thing to having another sample from the same population is having a pseudo-sample by picking from our existing data. Say you have 100 observations in your sample. Pick one at random, record it, and put it back — repeat one hundred times. Some observations will get picked more than once, some not at all. You will have a new sample that behaves like it came from the whole population.

Sounds too easy to be true, huh? Most people think that when they first hear about it. Yet its mathematical behaviour was established back in 1979 by Brad Efron.

Now, if you work out the estimate of interest from that pseudo-sample, and do this a lot (as the computer’s doing it for you, no sweat, you can generate 1000 pseudo-samples and their estimates of interest). Look at the distribution of those bootstrap estimates. The average of them should be similar to your original estimate, but you can shift them up or down to match (a *bias-corrected* bootstrap). How far away from the original do they stretch? Suppose you pick the central 95% of the bootstrap estimates; that gives you a 95% bootstrap confidence interval. You can draw that as an error bar, or an ellipse, or a shaded region around a line. Or, you could draw the bootstrap estimates themselves, all 1000 of them, and just make them very faint and semi-transparent. There are other, more experimental approaches too.

You can apply the bootstrap to a lot of different statistics and a lot of different data, but use some common sense. If you are interested in the maximum value in a population, then your sample is always going to be a poor estimate. Bootstrapping will not help; it will just reproduce the highest few values in your sample. If your data are very unrepresentative of the population for some reason, bootstrapping won’t help. If you only have a handful of observations, bootstrapping isn’t going to fill in more details than you already have. But, in that way, it can be more honest than the shortcut formulas.

If you want to read more about bootstrapping, you’ll need some algebra at the ready. There are two key books, one by bootstrap-meister Brad Efron with Stanford colleague Rob Tibshirani, and the other by Davison and Hinkley. They are pretty similar for accessibility. I own a copy of Davison and Hinkley, for what it’s worth.

You could do bootstrapping in pretty much any software you like, as long as you know how to pick one observation out of your data at random. You could do it in a spreadsheet, though you should be aware of the heightened risk of programming errors. I wrote a simple R function for bootstraps a while back, for my students when I was teaching intro stats at St George’s Medical School & Kingston Uni. If you use R, check that out.

]]>

Firstly, the ** Bayesian Taster** webinar. This lasts for one hour and costs only ten of those British pounds (currently, 13 USD or 11 EUR). There’s no maths and no coding; this is for complete beginners and is all about common sense. If you get the fundamental concepts right, you won’t get tripped up as it gets more complex later on. We will think about defining analytical problems in probability terms, what probability can be used for, and how practically to go about getting answers (fitting probability models to your data). We’ll look at a range of real-life problems with data and models that are too hard with old-fashioned stats / machine learning, but readily solved with Bayes. I’ll describe the spectrum of available software. This happens on 12 October, around lunch time for Africa and Europe, then later around lunch time for eastern Americas, or breakfast time for western Americas. You can book here (Afro-Euro edition) or here (Americas edition).

Secondly, a half-day online workshop called * Packages for Bayesian Analysis in R*, which is ideal for anyone with some R familiarity, who knows in essence what Bayesian analysis is about, but wants to find out about the options for actually doing it. We will look at a range of packages, from those that are easier to learn but restrict you to a collection of preset models, through to probabilistic programming that allows full flexibility. There will be plenty of mini exercises for you to try out on your own computers to get a feel for it as we go along. This will happen on 26 October, 1300–1700 UK time. You can book for this workshop here.

When I decided to start my own business doing training in Bayesian methods, I read a lot of other people’s introductions to the subject. I wanted to see how others approached the subject, and I wanted to steal the best ideas. I looked at books, videos, websites, blogs… and I’m still going, because they keep coming out and some are buried away in obscure places. Although there are some absolutely outstanding exemplars, the very beginning never quite satisfies me. I mean, the way that the idea of Bayesian statistics is introduced to the reader / listener / whatever.

In this post, I’ll set out what I like and don’t like about introductions to Bayes, and I’ll explain how I do it as I go along.

First, I have to be clear about my intended audience; teaching a room of doctors would be different to a room of maths grad students. Not necessarily better or worse, easier or harder, just different. I aim at people who think about problems quantitatively, but not mathematicians or statisticians. I want to help everyone else who falls between the cracks. They might be healthcare professionals, marketing analysts or machine learning folk who want to get stronger at stats.

So, for starters, my introduction is not very mathematical. It’s not that I don’t appreciate the importance of mathematical ability if you want to be a theoretical statistician, it’s just that my audience don’t intend to become statisticians. Plenty of visual aids help here, and I think that the flipchart or whiteboard is a much more useful tool than slides, because it is interactive and allows students to come up and try out things (for instance, after a small group activity). I like to prepare several pages on the flipchart ahead of time so we can just skip through from one concept to another while it’s fresh in their minds without them being distracted, thinking stuff like, “I wonder if that pen is going to hold out… the ink is looking thin.” An example of this is showing a regression line in variable space (X on the horizontal axis and Y on the vertical), then flipping to parameter space (beta1 on the horizontal and beta0 on the vertical) to show it as a point.

Later, when we get into software, there is huge value in demonstrating how to code something up and look at the results via a projector. A Jupyter notebook is a great way of doing this because you can quickly go back and tweak something and see its effect, although I feel uneasy about getting my learners to spend time gaining familiarity with a tool that few of them will use in earnest. It’s not considered cool, but I still think WinBUGS is a neat way of walking through reading in the data, checking the model, running a few hundred warmup iterations, then going for it. Of course, this is another tool that learners probably won’t use in years to come, but there’s no reason why you can’t do that same sequence of steps in R+Stan, for example.

Bayes theorem at the outset. Why do people do this? It shows us two things, the historical connection (who cares?) and the principle of reversing conditional probability by multiplying likelihood and prior. But that’s not what we do in practice, we simulate, so why not show students that. They can also work out quickly that the formula you would really use is not that simple; there’s also a normalising constant and a denominator that has to be integrated. It also confuses them that this theorem can be used for reversing-probability reasons that are not “Bayesian”, like the classic examples with medical test results. I prefer to introduce the value and flexibility of multiplying conditional probabilities together with a practical example and an explanation more like a particle filter (although I wouldn’t use that term), because that’s closer to the simulation that follows.

Philosophical distinctions about the meaning of probability or randomness (but see below). This is important for clever students but even then will only interest them once they are getting comfortable with the analysis. We should make our learners good at applying the methods first, then they can reflect on theory and finally history.

History. Nobody gives a monkey’s whether Jaynes and Keynes were the inspiration for a Gilbert and Sullivan operetta, or Peirce abducted Neyman’s cat.

Contrived examples. Oh, I like old, well-worn datasets, like irises or the Titanic (more on that another time), but tossing a coin ten times? Come on. (As a self-defensive footnote, I have used coin-tossing with success when introducing the concept of hypothesis testing, but that’s a very different goal and hence a different metaphor for the mental processes and quantification at work. I got that coin-tossing exercise from Beth Chance and Nathan Tintle at ICOTS9.)

A lot of maths: theorem-proof-lemma format for example, or matrix algebra when the learners would get the idea faster from talking about what happens to one observation, one parameter, one iteration at a time. Mathematicians get a habit to set out the most general possible exposition at the beginning, in the most general possible terms. But you can’t fully grasp it until later, when the fine details have sunk in. I think it’s better to have carefully chosen examples that illustrate one principle at a time, then gradually accumulate them. Debugging where someone lost the thread is much simpler. And we don’t need to see the proof unless we are studying how to prove similar things in future. You, the teacher, need to do the maths, but keep it out of sight.

Analytical solutions and conjugacy. Yawn. It’s not the 50s. Don’t waste your learners’ time.

1-d density plots of prior, likelihood and posterior that hardly overlap. You all know the sort. They are mathematically correct but unrealistic. That prior is a BAD prior, and your students can see that. The likelihood is out in low-prior region and that should give you pause for thought in real life. Don’t drag your students out of real life critical thinking and into some abstract ritual!

Simulation at the outset, shortcut formulas later (a la GAISE). Note that calling asymptotics “shortcut formulas” gives students the right attitude to conceptualising their analyses in a grounded and critical way; it’s not intended to disparage the value of solid statistical theory.

Prior and posterior predictive checking, where you use your model before it sees any data, and after it has “learnt” from the data, to generate new phony data. Take a look at the phony data and see if they look anything like the real ones. Where the prior does not include the data, you’ve got problems. Likewise, when the posterior does n’t look like the data in some way. These are intuitive ways of doing an open-ended check on your model.

A focus on computation, even if it’s not specific to Bayes (floating point accuracy, digital rounding error, or setting RNG seeds are all good examples).

Approximate Bayesian Computation (ABC) as an inroads to thinking about: (1) simulation as a way of combining probability densities and/or likelihoods, (2) letting the computer try different values of the parameter and seeing how it matches the data (3) the need for a sensible prior to guide the computer away from no-hope regions as well as problems like no overlap in logistic regression. However, this will work well for people who have already thought about random number generators, less well for those who haven’t. Because we are going to simulate, we need to introduce RNGs and distributions anyway, and throwing ABC on top of that might just be too much.

x ~ norm(10,3) notation. Rasmus Bååth and I both like this for introductory teaching and neither of us know what to call it. Let’s go with “tilde probability notation”. It’s much easier to write once you know a few common distributions, and you can pile them up thus:

mu ~ unif(0.5, 10)

y[i] ~ poisson(mu)

That’s a little univariate Poisson model. But it easily extends into models that are quite painful to read in algebra.

mu ~ unif(0.5, 10)

sigma ~ unif(0.1, 5)

y[i] ~ poisson(mu)

x_mean[i] = ((y[i] > 5) * 5) + ((y[i] <= 5) * y[i])

x_measure_error[i] = ((y[i] > 5) * 0.001) + ((y[i] <= 5) * sigma)

x[i] ~ round(norm(x_mean[i], x_measure_error[i]))

That’s a model for Poisson-distributed data that are censored and heaped at 5 and also have some measurement error below that point. Pretty advanced, pretty quickly. Also, once you get familiarity with this, you can use it in Stan or BUGS or JAGS.

Emphasising the communication advantage, for example, “our analysis shows that there’s an 81% chance that return on investment will be over $1m within 5 years” (check out Frank Harrell’s blog for some medical examples). Who doesn’t love that? Perverts, that’s who. Or ultra-frequentists, though I’m not sure they exist any more. So that just leaves perverts.

Emphasising flexibility — we can get beyond simple models quickly and without a lot of jiggery-pokery like frequentists, who have to juggle with REML and E-M and profile likelihoods and goodness knows what all.

Grounding everything we demonstrate in real-life research needs, like the communication and flexibility above.

Showing data space and parameter space and flipping between them

Getting quickly into models that are complex enough to be of real-life value. Students know when they are being shown some dumbed-down stuff that they could never use in vocational settings. This is a challenge of course, but you’re not being paid to dodge challenges.

Emphasising ways of thinking about data, models, truth, etc (a la McElreath). It’s extremenly important, to be sure, but I don’t know if it helps to hear it early on. I am too immersed in the subject to be able to judge.

Bayes as the only true approach to probability for adherents of religions who contend that everything is predetermined or decided by supernatural force (this would include most Muslims and Calvinists). In essence, if mortal humans know nothing for certain, including the immutability of parameters or the infinite replicability of your experiments, then it follows that data, latent variables, parameters and hyperparameters all move in ways you cannot fully understand, and so are subject to the same mathematics of probability. Is this a helpful assertion? Or not. I tend to think it best not to get involved in such matters, especially as I don’t believe it.

Emphasising networks early on, like David Barber’s (otherwise great) book does. I suppose some people work with those models and that decision-theoretic application, and need it. I just don’t know how it fits into everyone else’s learning curve.

Now that I am free to design my own educational products in Bayes, I find that actually, it always has to be tailored to the audience, unless it’s a quick overview. So, I set myself up to provide face-to-face training and coaching. I might have the odd quick overview as an online course, but the really in-depth stuff has to involve discussion, reflection and interaction, not just with me but with the other learners too. Of course, that means it’s not fair on people in far-flung places, but I can’t reach everyone.

I do training (a group of learners with clear learning outcomes at the outset) and coaching (one person and me, talking about their career and goals, where I mostly ask questions and there are no outcomes at the outset).

I think any training session should avoid getting bogged down in long sessions of chalk ‘n’ talk, but inevitably there has to be some of that. So, I keep them to 30 minutes long if I can, and alternate them with some small group activity. You could have several small group activities over the course of the day. I think it’s a good idea to put the groups together so that they are diverse, and that requires finding out a little about learners’ work experience, qualifications, etc before we begin. That mimics the diverse composition of data science teams in this day and age, and I tell people that from the outset so that they know they should respect and listen in the group to get the most out of the learning experience.

What about this maths avoidance? It strikes me as odd that there are many introductory textbooks and courses for statistics that play down the maths, on the basis that the learners are going to operate a computer and construct a model in code, not by matrix algebra and calculus. Yet, this doesn’t happen for Bayesian statistics. This is perhaps down to two historical factors. Firstly, Bayes was an advanced subject so only people who already had degrees in statistics or mathematics would encounter it. Secondly, Bayesians spent decades being mocked and sidelined, and responded by foregrounding their mathematical rigour in the hope of beating their critics. But nowadays, we want all sorts of people who analyse data to think about using Bayes from the beginning of their careers, so we should offer them the same option. If you want the maths, there are plenty of options for you, but I would like to offer something a little different, a little more inclusive.

]]>

In the chapter *Many Variables*, I look at the problem of visualising data like this:

Each row is a student who has answered a questionnaire on their satisfaction with teaching at their university, and each column is one of the questions they were asked. Often in data visualisation, we use a familiar two-dimensional format where one variable is represented by horizontal position and another by vertical position. This is easy to read but to include more variables, you have to use some tricks.

I’ll explore this with a simple example where we have three variables (which we can almost visualise in three dimensions as a cloud of dots, each observation getting a left-right location, a front-back location, and a up-down location from those three variables), and want to show it in two dimensions so that it can be printed on a page or shown on a screen.

First, imagine this cloud of dots. It is going to look like a murmuration of starlings — one of those huge, swirling, self-organising flocks, containing so many birds that they just appear to be minuscule dots — and when the photographer points the camera at it from a certain direction and clicks, the light arriving from the starlings lands on a two-dimensional surface inside the camera and is captured.

From that direction, you get an idea of where the birds are to the photographer’s left, right, up and down orientations, but no idea of whether they are closer to, or further away from, the lens. You obtain a two-dimensional image but at some cost. You have projected the light from the starlings onto the camera at a certain angle, in straight lines, and this idea of a projection is one we need to grapple with.

In the book, I show this image, where I created a cloud of 1000 points, shaped like the planet Saturn. I used R and you can access the code at my webpage for the book. Each point has three coordinates: front-back, left-right and up-down. But we need to represent it on a page or screen, in just two dimensions.

There are many instances like this with even higher-dimensional data. Every variable that gives you values for observations in your data can be thought of as a dimension.

Here’s the same projections, but using colour to identify North (orange) and South (blue) in the left image and longitude (rainbow colours, sorry for offending any dataviz cognoscenti, but this is intended to show how the analysis works, rather than give real insights into the shape of Saturn) in the right image.

This is the PCA projection, so it is just looking straight down from above the North pole. It appears oval just because of the aspect ratio when I created the graphs (I didn’t want them to take too much vertical space on the screen). Longitudes are separated but latitudes are mixed up. Points at the North pole are mingled with those from the South pole, despite being at opposite ends of the “planet”. PCA is doing its best to project into 2 dimensions, and with a sphere, there is no reason to pick one projection over another. However, the rings add extra data which push PCA towards looking directly down.

PCA chooses the direction along which you point the proverbial camera in order to capture as much variance as possible in the resulting image, and sometimes that’s what you want, but sometimes you need to think about it and overrule PCA to show something more meaningful. Notably, it might be that two dimensions just doesn’t capture enough of the variance, and more than one image is called for. I’ll leave the details of how you make that decision, and what to do for multiple images, aside for now. It’s best at that point to involve an expert in multivariate statistics anyway, rather than trying to wing it.

PCA is like the photograph I described above. Every point is projected by a straight line to the camera, and those lines are all straight and parallel. Mathematically, we call it *isotropic*: the surfaces are all the same. But sometimes we can understand patterns in the data better by warping the lines of projection, and that’s called *aniostropic*. Some regions might be squashed together and other stretched apart. Support vector machines or Procrustes analysis of shapes use aniostropic projections too, for different purposes in the world of statistics.

Now, let’s look at the t-SNE projections.

t-SNE is an iterative procedure; it tries various warpings, and keeps moving towards a better separation of points that are distant in the full 3-dimensional space (or higher, if your data have more than three variables). There’s no shortcut calculation that can take it straight to the optimal warp, and in fact there’s no guarantee that its iterations will arrive at the optimum in the end. But in dataviz, there generally is no optimum anyway; we have to compromise and present our message clearly without misleading the reader.

One parameter that controls its iterations is called *perplexity*, and essentially measures distances by referring each point to a certain number of neighbours. Increase perplexity and you force it to try to be fair in representing distances across a wider region. That can sometimes reveal insights about the data structure in high-dimensional space, and sometimes a low perplexity is better. Above, I used the default perplexity of 30 (30 neighbouring points out of the total 1000).

As you can see, the warping has kept the idea that the rings are separate from the planet, and twisted the planet so that north and south poles are separated. In doing so, it has broken the rings apart. Some distances in the rings are now mis-represented compared to PCA, but others in the planet are improved. Because there are more points in the planet than in the rings, the planet won and the rings got distorted.

Unfortunately, the longitude is not well represented. Individual colours appear in two or three distinct patches; you can see this for the pink-orange zone or the teal zone. Perhaps this is because the perplexity needs to be increased. That would allow the neighbourhood of each point to stretch out further, and it might keep the colours together. Let’s try 60:

That seems a little better, in that there’s a pink stripe running through the projected points, but it’s not great. Let’s go up to 90:

It’s hard to say whether this has helped. It’s at least not degraded the North-South separation on the left, so let’s try 120:

I think this is a little better. Now 150:

Here’s the colour (longitude) separation is not very different to perplexity 120. As we increased it, the image turned round and the colours flipped from side to side, but that doesn’t matter. One thing about this final image that makes me uneasy is that the rings are increasingly getting pulled into the mass of points that make up the planet, and that feels like a distortion. I would probably stick with 90 or 120 for that reason.

The same problems we had to think about (but could quite easily overcome) with Saturn return with reinforcements as the number of variables / dimensions increase. Soon, the compromises you have to make become so severe that a single image is just not an option.

**The bottom line**

If you have data with many variables, and you want to show how the observations cluster together, which points are similar to another, etc, then don’t give up. There are many dimension reduction techniques you can use. In the book, I also describe correspondence analysis, which does this for categorical variables. They are not hard to achieve with a little bit of code, as you’ll see in the webpage, and you can get correspondence analysis or principal components analysis through drop-down menus if you prefer, in at least Stata and SPSS that I know of. For t-SNE, you need R or Python or Julia.

Try different approaches. Try different parameters. Keep in mind what you want to show, and highlight particular data points (for your own contemplation) so that you can understand what’s going on and make an informed choice of final image. Be prepared to explain the procedure to your audience. Keep it simple so they stay engaged; talk about things like taking photographs of flocks of birds rather than appealing to matrix algebra, and don’t wave it off as a mysterious and magical process. Always put yourself in your audience’s shoes, and if you can, user-test your visualisation and explanation before launching it.

]]>I saw a few recurring problems.

- Their employing organisation refused any transmission; I had to turn up in person with a memory stick. This is OK, but often their local IT department had blocked encrypted USB drives, so it had to be carried across the country in my pocket in an unencrypted drive. This is very risky and you shouldn’t do it!
- They were forbidden from using email and had to do something like Google Drive instead. Well, you just gave those medical records to Google, who are under no obligation to delete it when you click delete on your remote copy. They also share stuff with intelligence agencies, which is for good reasons, but you are responsible for the protection of sensitive data and you can’t bury your head in the sand about this sharing.
- They added a password to an Excel spreadsheet and sent me that. Remember that they are often using old versions of Office so the encryption is very poor and easily cracked. Then they’d send the password in a other email. I don’t know where folk wisdom like that comes from. Emails reside indefinitely on a whole chain of servers from thee to me, which might criss-cross the world in doing so, getting tapped along the way. So Johnny Hacker can just pick up both emails and type in your password.
- They asked my advice, and I sent them some links for OpenSSL, but that can be confusing for people who aren’t total computer nerds, hard to install on Windows networked machines, and so on.

To add to that, there is some doubt as to whether or not an encryption algorithm like AES might already have been cracked.

To be clear, if you use commercial encryption software, you are probably discharging your duties and won’t get in trouble. One day, that encryption will be broken, so it’s a question of whether you feel that you’ve done what’s required of you and the future is not your problem (go commercial) or that you have to take personal responsibility for posterity too (use xormydata).

The point of xormydata is to make it easy to send and receive data files, securely, without any of that silly stuff like sending passwords in a separate email. It doesn’t use an algorithm, it takes your data file at a binary level:

1001 0110

and combines it with a “code file”, which acts like a password:

1100 1011

using exclusive or (XOR). This is a logical operation, like OR and AND. It works like this: if the bit in the data file and the corresponding bit in the code file are the same, the result is 0, if they are different, it is a 1. With the bytes above, you would get:

0101 1101

You can read more musing about the process, and how you should use it, on the Github page. * You should also read the warning there about how it can get you in serious trouble*.

You download xormydata.cpp from Github, or clone the repository.

That is a C++ source data file. You need to compile it so it can run on your computer. Typically, your computer might have a compiler such as “clang” or “g++”. If you have Linux or Mac, you can just go straight to the terminal, cd to the folder where you saved xormydata.cpp, and type:

g++ xormydata.cpp -o xormydata

This should produce a new, executable file in the same folder, called xormydata. If you are using Windows, you probably need to install a C++ compiler first, and if you are networked and controlled by a central admin, you’ll probably need their help to get permissions to do this. They will be suspicious. One compromise might be to use an old, unwanted laptop for this encryption and decryption, though obviously that’s a bit of a pain.

Now you are ready to go. You need a collection of code files (see the Github page), and your recipient needs xormydata and their own code files. Crucially, there is no need for you and your recipient to communicate about the code files you are using (like sending passwords).

Alice has a data file (patient_HIV_status.xls), which she wants to send to researcher Bob. They both install xormydata and away they go! Alice is going to use a music mp3 file (Schools_Out.mp3) as the code file, so she types

./xormydata patient_HIV_status.xls Schools_Out.mp3 data_for_Bob.xor 118309

The order of this command is

- the command itself, ./xormydata in Linux/Mac and xormydata.exe in Windows
- the input data file
- the code file
- the name of the desired output file
- optionally, a number indicating where (in bytes) to start using the code file’s 1s and 0s. I strongly recommend you include this because there can be predictable sections of metadata at the start of certain file types.

Now, she sends Bob “data_for_Bob.xor”

Bob is using video files as codes. He types

./xormydata data_for_Bob.xor Go_Pro_commuting.avi data_back_to_Alice.xor 7199003

the file is now double encrypted, with both Alice’s code file and Bob’s code file added. Alice removes her code thus:

./xormydata data_back_to_Alice.xor Schools_Out.mp3 final_data_to_Bob.xor 118309

Now, it just has Bob’s code applied, and she sends it back to him. He types:

./xormydata final_data_to_Bob.xor Go_Pro_commuting.avi patient_HIV_status.xls 7199003

and the original file is revealed. This is a *triple-pass* system, which is simple (at the cost of sending stuff three times), and requires no handing over of passwords and such, but not perfect. Charles can intercept Alice’s emails and pretend to be Bob (man-in-the-middle attack), or Charles can just go snooping on their email servers afterwards; by xor’ing the encrypted files together the right way, even with knowing the code files, he can get the original data back. So, **if you are worried about people intercepting your stuff and trying hard to break into it, then you probably shouldn’t be using xormydata**. I suggest you don’t just use vanilla email to send your xor’d files, but maybe an end-to-end encryption like ProtonMail. That will ensure that your data-transmitting messages are indistinguishable from the ones where you discuss where to go for your colleague’s leaving party.

Also, this is intended as a *one-time pad*, which means you use that code file once only (or, at least, that code file at that start byte). You should keep track of the pairings of code files and data files so you can get them back later, and of course, don’t store that list somewhere where people can get at it. Does it need to be digital at all? Can you just write it in a notebook?

**How do I know I can trust you, Mr Grant?** You don’t; that’s life, kid. But you can read the source code, it’s only 120 lines.

**If this is so simple, and you’re, like, not even a professional programmer, how come nobody else is doing it already?** I don’t know. The crypto world generally went off triple-pass systems decades back, because of the risk of a man-in-the-middle attack. It’s not cool.

**This is still hard work … isn’t there something with just a one-click option?** Not if you want it to be secure into the future, and secure from even the big guys. There’s no free lunch.

**My organisation wants to use a trusted commercial package instead; what can I do?** Not a lot in my experience, though I suppose you could xormydata it and then put it through the commercial package.

**Isn’t this going to be used by bad guys too?** I hope not, but potentially, yes. The same way that you can use a hammer to build a hospital or whack someone over the head. This is technology; if we avoided risk of abuse we would not even have adopted the flint hand axe.

This developed out of my Masters dissertation in the Medical Statistics course at the London School of Hygiene and Tropical Medicine. I was comparing different composite measures of hospital quality, and then I went on to explore ways of assessing and visualising the uncertainty in those measures.

**What are composite performance indicators?**

In the context of New Public Management, we have a bunch of hospitals (you can substitute schools, prisons, privatised railways or privatised deportation agencies or whatever), and politicians have set some very broad-brush goals for them (perhaps, that they should have low mortality and low re-admission rates, and that they should reduce any debt year-on-year). Some agency or Death Panel (the sort of thing I used to do for a living When We Were Very Young) expands this into some measurable indicators. They might have to prioritise things so that it isn’t too burdensome, and they end up with things like:

- % of patients with fungal toenail infections seen by a fungo-podiatrist within 24 hours of being diagnosed
- number of nurses per patient on the fungal toenail infection ward
- % of patients turning up a second time for their FTI, after you said you’d fixed it

(with apologies to anyone who suffers from fungal infections in the toenail, and feels I am making light of their plight; someone had to take the fall (why not you?))

Great, now we have three numbers but someone is sure to say that it doesn’t help patients choose a hospital and doesn’t help funders direct the money to the best performers. You might be tempted to make a composite indicator by some mathematical process. It can often be as crude as averaging them.

One more thing I’ll mention here is that, following Donabedian, it is typical to classify indicators as structure (like the 2nd one above, measuring the facilities), process (like the 1st one, measuring whether you do the right things), or outcome (like the 3rd, measuring how the patient is doing after your care).

__Sources of uncertainty__

**Sampling error**

The most obvious way in which your composite indicator can give you the wrong answer is because it is assessed on the basis of a sample of patients, and not all of them. This is sampling error, and we have a lot of statistical theory to tell us how big it might be. But there are other problems too.

**Order of averaging**

Reeves and colleagues wrote a paper in 2007 which hardly anyone has heard of — but they should have. They explored what happens when you have multiple indicators assessed on multiple patients, as is often the case. Do you summarise the indicators into one number for each patient, and then summarise the patients, or do it the other way round? It turns out that you can get quite different composite scores.

**Weighting and other calculations**

To combine your indicators, you have some formula that takes multiple numbers as input, and produces one number as output. That formula might give more weight to one input than other. You could choose weights on the basis of clinical importance, or you could opt for a variance-maximising summary such as the first principal component. Or, you might also introduce changing implicit weights by steps like dichotomising some of the inputs before averaging them.

That choice obviously affects the composite scores. The tricky thing is that you cannot avoid a judgement of relative weights. Even if you just average the inputs, you will still be giving more weight to some than others, specifically, those with higher inter-hospital variance will come to have a bigger impact on ranking. *There is no value-free composite*.

__Poster__

So, I made a poster and it was shown at a visualisation conference at the Open University in Milton Keynes in 2011. And here it is below. I haven’t managed to do anything further on this subject since then. If you would like to take it on, feel free. Get in touch if you want to discuss it.

]]>Bottom line: there are none (at least in 2013). I then looked at codes used by GPs (family doctors) in the UK for dementia and incontinence, which I had analysed with colleagues. I found some variation by GP but again there wasn’t enough appetite to take it further. I don’t have those data or the output of that analysis any more. That’s the price of confidentiality.

I also organised a session at RSS conference in Sheffield on “Checking and cleaning in big data”. It was not very well attended as everyone had probably gone to hear about something trendier. But the people who were there appreciated the problem, and wanted to learn from experts, and that was pleasing. My invited speakers were Ton de Waal from Statistics Netherlands and Liz Ford from Brighton & Sussex Medical School. You should look them up if you are into this kind of thing. I had someone else from Reuters lined up to talk about automated processing of text streams in real-time but they moved jobs and were contractually gagged, alas.

Anyway, here’s the write-up on the preliminary review. You might find it stimulating. I think it’s an interesting and under-valued avenue for research. I would have liked to have developed some Bayesian model that incorporated the hierarchical structure of the data by the professional doing the coding, and then included latent variables for coding habits. These could have been developed from a preliminary study to hand-classify coding habits and maybe dimension-reduce them into a manageable number of factors.

Over to you now.

**Data linkage**

The goal of data linkage is to combine information from different databases into one. When there is not a unique identifying variable for each subject, special techniques have to be employed to find a likely match and unbiased results from any analysis that follows. Established data linkage methods, whether probabilistic or not, typically lead to the creation of a single linked dataset which is then analysed as if it were perfectly matched. This effectively ignores any uncertainty arising from the matching, and can introduce bias if the incorrect matches are different to the correct ones in terms of some of the variables used in the analysis. However, Bayesian approaches by McGlincy (“A Bayesian Record Linkage Methodology for Multiple Imputation of Missing Links”, 2004) and Goldstein, Harron and Wade (“The analysis of record-linked data using multiple imputation with data value priors”, 2012) have capitalised on the ease with which computational methods such as MCMC can perform analysis and editing / imputation in a single step. Both approaches allow data to be imputed from conditional distributions if no match is sufficiently probable. Goldstein, Harron and Wade used a multiple imputation approach to create several potential matched datasets in order to capture the uncertainty that arises from the matching process. In none of these papers is there any mention of the possibility of human coding necessitating a multilevel structure to the linkage probabilities and weights.

**Automated edit/impute procedures**

Large surveys require a computerised approach to checking data for errors and correcting them where possible. A statistical approach can be traced back to Fellegi and Holt’s seminal paper (“A Systematic Approach to Automatic Edit and Imputation”, 1976). Census agencies, particularly in the USA and the Netherlands, have led the way in developing methods and software, but adoption among a broader statistics community has been rare. De Waal, Pannekoek and Scholtus (“Handbook of statistical data editing”, 2011) provide a comprehensive review of edit/impute methods. A number of common forms of human error are detailed but none of the methods incorporate the identity of the individual recording the data, perhaps because national surveys typically do not have more than one record per individual. There is however a passing reference (p. 28) to a certain type of error being made consistently throughout different variables.

The Fellegi-Holt paradigm aims to produce “internally consistent records, not the construction of a data set that possesses certain distributional properties” (de Waal, Pannekoek, Scholtus, p. 63).

de Waal, Pannekoek and Scholtus note that influential and unusual observations are still generally identified by computer and considered by experts, possibly by contacting the source.

**Coding bias**

Because of the prominence of coding systems in medical data (for example, ICD or Read codes), a search of the Medline database was conducted for the terms “coding bias” (13 retrieved, none relevant), “interviewer bias” (40 retrieved, likewise).

These searches were augmented by searches for the same terms on Google Scholar and Google Web Search, and consideration of references in any partly relevant documents.

Jameson & Reed (“Payment by results and coding practice in the National Health Service”, 2007) and Joy, Velagala & Akhtar (“Coding: An audit of its

accuracy and implications”, 2008) suggest that coding can lead to a considerable change in a healthcare provider’s income within the British NHS’s Payment By Results scheme. This has been emphasised as a system-wide problem by the Audit Commission and NHS Connecting For Health.

Systematic investigation of the bias arising from coding is much rarer. Lindenauer and colleagues (“Association of diagnostic coding with trends in hospitalizations and mortality of patients with pneumonia, 2003-2009.”, 2012) conducted a thorough analysis of coding trends over time for hospital patients with pneumonia and/or sepsis, and found that the use of pneumonia codes had declined between 2003 and 2009, while codes for sepsis secondary to pneumonia, and respiratory failure with pneumonia, had increased. While mortality rates (adjusted for age, sex and co-morbidities) in each category had dropped significantly over the same time period, taken together as a single category, the mortality rate had not significantly changed. The authors suggest that patients that would have been at high risk of dying with a pneumonia code in 2003 were increasingly given sepsis or respiratory failure codes (thus artificially improving mortality rates in the pneumonia group), where they became comparatively low-risk patients. Meanwhile, advances in treatment for sepsis had improved mortality in the other two groups’ higher-risk patients. Commenting on the medical website Medscape (http://www.medscape.com/viewarticle/765523), Shorr described the coding bias exposed by this study as “not comparing apples with apples and oranges with oranges [but]… mixing things up and making fruit salad”.

]]>If you are a Stata beginner, you’re writing do-file code, but you want it to be more efficient, more reliable, and to take you less time, and certainly no copying and pasting of almost-identical blocks, then this is for you. It’s happening on Friday 4 May in the afternoon, UK time, and you can book here.

We ran it a couple of months back and got some very positive feedback from participants.

You will learn how to save time and avoid errors by writing bespoke commands for your own use, getting Stata to loop through your data for repetitive work, and including automated checks to keep Stata running smoothly.

By the end of the course, all participants will feel comfortable undertaking the following tasks:

- Automating repetitive and time-consuming tasks
- How to provide non-technical colleagues with a Stata analysis they can easily run
- Using Stata’s loop functions to save you time and work with big data
- Protecting your automated analysis against common data issues
- Understand how to create your own Stata commands