Data visualization with statistical reasoning: formats for showing uncertainty

This blog post is one of a series highlighting specific images from my book Data Visualization: charts, maps and interactive graphics. This is from Chapter 8, “Visualizing Uncertainty”, and very much follows on from the previous post on the bootstrap.

Here are three approaches to showing uncertainty around a single curve in two-dimensional space.

There are two variables, and implicitly a third, which might be time; the curve moves from bottom left (where, initially, there is no uncertainty) to top right (and the uncertainty increases along the way). The data for this image are artificial, but I was thinking of hurricane tracks as well as any time series forecast where uncertainty is so important and increases further into the future.

The three approaches are, in essence:

1. Show one shape with an X% chance that the “truth” will turn out to lie outside it. By “truth”, I mean a population parameter, if you are doing inferential statistics, or future data if you are doing prediction. It could be something else too, like a rank, model, or missing value, but the same principle applies. Take this idea of the chance of the truth lying in some location and imagine it as a surface, rising where the chance is high and dropping where it is low. This surface rises up to the summit, which is our best guess. We could add the best guess to the image, which is the central line in this image, but could also be a point. Then, our shape is a contour line: the surface is at the same height all around that contour. By the way, that surface is something that we really do deal with statistically. It is a posterior probability density function of you are doing Bayesian statistics, or a likelihood function otherwise. But there are ways of getting these uncertainty measures without full-on likelihood or Bayes. The bootstrap gives you this by just picking the central X% of resampled statistics. Inferential shortcut (asymptotic) formulas sometimes work on approximations. Approximate Bayesian Computation (ABC) generates phony data according to different parameter values, and compares it to what you actually observed.
2. Show several of these contours, like a topographic map. You might prefer to colour in the area according to the height of the surface instead, in which case it might be clearer to do some kind of smoothing; I’ll have another blog post soon on the subject of smoothing in data visualisation.
3. Instead of showing the values (height) of the probability / likelihood, draw values at random from that function and show them. If you are bootstrapping or doing Bayesian stats by simulation, then this is simple because you can just draw the bootstrap stats or the retained simulations. The posterior / likelihood then acts as a data generating process, a crucially important mental tool for good data analysis, but something we’ll have to leave for another time.

If you are showing one value at a time, the classic error bar is the result of option 1. Tukey and others have proposed versions with multiple levels of uncertainty, relating to option 2; you could try gradations of colour or line width too. Option 3 would involve a scatter of dots, preferably semi-transparent and/or jittered.

More exotic tweaks to this general idea are also out there — I included examples of a Bank of England fan chart and a funnel plot for comparing clusters of data (hospital mortality, in my example) in the book — and if something like that is the accepted, understood and expected approach in your line of work, then you should go with it. I had to choose what to include to keep the book from getting too long and expensive, and some fun approaches to uncertainty got, unfortunately, spiked, such as visually weighted regression.

Why hurricanes? Lots of interesting dataviz work has been done on them (at least, American hurricanes, because that’s where the dataviz muscle is) in recent years by journalists. Most recently, Alberto Cairo has led an effort to improve them. He says that option 1 from above is poorly understood and introduces a false dichotomy: if you live inside the cone, you’re gonna get whacked, and if you live outside, you’re totally safe. Also, people mistake the size of the cone for the size of the hurricane itself. Option 2 helps a bit, but not totally. Option 3 is good but in some settings (such as weather forecasting), not all the lines carry the same weight (some forecast models are known to be more reliable and sophisticated than others) — how do you show that?

When you are choosing how to visualise uncertainty, there are some important considerations. Here are some that come to mind:

• What is the statistical literacy of your audience? If it’s a mix, you probably need more than one image. Provide something they know how to use, rather than something you’re convinced they’ll love once they’ve learned how to use it (more Bill Gates than Steve Jobs).
• What summary statistic is of interest, if you are doing inferential statistics? Not the statistic you can easily get, or the one with a handy formula for standard errors, but the one your audience needs for decision making.
• If you are going to show contours, error bars, or some other depiction of a given level of (un)certainty at X%, find out what X is meaningful to your audience. For example, if it is a business decision that depends on your information, then ask what level of uncertainty (risk of being wrong) would change the decision, then draw that level.
• Is sampling error (having a sample, not the whole population) the only source of uncertainty? If not, if your estimates are also affected by things like missing data, confounding / endogeneity, or response bias, then consider a Bayesian approach, where you can incorporate all those sources of uncertainty into one posterior probability surface.
• Is it enough to see the uncertainty around each estimate / statistic / prediction on its own, or does your audience need to see how they interact? Sometimes, over-estimating A implies under-estimating B, and in these cases, you need to think about not just the variance (spread) of the uncertainty of A and B individually, but also the covariance between them.

• Is the uncertainty likely to be asymmetric? Imagine you are estimating a small percentage. You shouldn’t use a shortcut formula that will return an interval extending into negative values: you will get laughed at. In cases like these, you can sometimes transform the data / stats before calculation, to induce asymmetry, or you could swap over to the bootstrap.
• What if you are worried that some outlier or clustering in the data is going to spoil the shortcut formula? You can switch to using a formula robust to outliers, like the Huber-White sandwich estimator, or robust to clusters, like the Huber-Rogers clustered sandwich estimator. (Feeling hungry?)