Best dataviz of 2014

I expect everyone in the dataviz world would tell you this year was better than ever. It certainly seemed that way to me. I’m going to separate excellent visualisation for the purpose of communicating data from that for communicating methods.

In the first category, the minute I saw “How the Recession Reshaped the Economy, in 255 Charts“, it was so clearly head and shoulders above everything else that I could have started writing this post right then. It’s beautiful, intriguing and profoundly rich in information. And! quite unlike anything I’d seen in D3 before, or that’s to say it brings together a few hot trends, like scrolling to go through a deck, in exemplary style.

recession

Next, the use of JavaScript as a powerful programming language to do all manner of clever things in your web browser. Last year I was impressed by Rasmus Bååth’s MCMC in JavaScript, allowing me to do Bayesian analyses on my cellphone. This year I went off to ICOTS in Flagstaff AZ and learnt about StatKey, a pedagogical collection of simulation / randomisation / bootstrap methods, but you can put your own data in so why not use them in earnest? It is entirely written in JavaScript, and you know what that means – it’s open source, so take it and adapt it, making sure to acknowledge the work of this remarkable stats dynasty!

statkey

So, happy holidays. If the good Lord spares me, I expect to enjoy even more amazing viz in 2015.

Leave a comment

Filed under JavaScript, Visualization

An audit of audits

In England, and to some extent other parts of the UK (it’s confusing over here), clinical audits with a national scope are funded by HM Government via the Healthcare Quality Improvement Partnership (HQIP). Today, they have released a report from ongoing work to find out how these different audits operate. You can download it here. I am co-opted onto one of the sub-groups of the NHS England committee that decides which projects to fund, and as a statistician I always look for methodological rigour in these applications. The sort of thing that catches my eye, or more often worries me by its absence: plans for sampling, plans for data linkage, plans for imputing missing data, plans for risk adjustment and how these will be updated as the project accumulates data. Also, it’s important that the data collected is available to researchers, in a responsible way, and that requires good record-keeping, archiving and planning ahead.

I’ve just looked through the audit-of-audits report for statistical topics (which are not its main focus) and want to pick up a couple of points. In Table 3, we see that the statistical analysis plan is the area most likely to be missed out of an audit’s protocol. It’s amazing really, considering how central that is to their function. 24/28 work streams provide a user manual including data dictionary to the poor devils who have to type in their patients’ details late at night when they were supposed to have been at their anniversary party long ago (that’s how I picture it anyway); this really matters because the results are only as good what got typed in at 1 am. 4 of them take a sample of patients, rather than aiming for everyone, and although they can all say how many they are aiming for, only one could explain how they check for external validity and none could say what potential biases existed in their process. 20/28 use risk-adjustment, 16 of whom had done some form of validation.

Clearly there is some way to go, although a few audits achieve excellent standards. The problem is in getting those good practices passed along. Hopefully this piece of work will continue to get support and to feed into steady improvements in the audits.

Leave a comment

Filed under healthcare

Slice bivariate densities, or the Joy Division “waterfall plot”

This has been on my to-do list for a long old time. Lining up slices through a bivariate density seems a much more intuitive way of depicting it than contour plots or some ghastly rotating 3-D thing (urgh). Of course, there is the danger of features being hidden, but you know I’m a semi-transparency nut, so it’s no surprise I think that’s the answer to this too.

slicedens

Here’s an R function for you:

# x, y: data
# slices: number of horizontal slices through the data
# lboost: coefficient to increase the height of the lines
# gboost: coefficient to increase the height of the graph (ylim)
# xinc: horizontal offset for each succesive slice 
# (typically something like 1/80)
# yinc: vertical offset for each succesive slice
# bcol: background color
# fcol: fill color for each slice (polygon)
# lcol: line color for each slice
# lwidth: line width
# extend: Boolean to extend lines to edge of plot area
# densopt: list of strings containing density() arguments that
# are passed verbatim.
# NB if you want to cycle slice colors through vectors, you
# need to change the function code; it sounds like a
# pretty bad idea to me, but each to their own.
slicedens<-function(x,y,slices=50,lboost=1,gboost=1,xinc=0,yinc=0.01,
 bcol="black",fcol="black",lcol="white",lwidth=1,extend=FALSE,
 densopt=NULL) {
 ycut<-min(y)+((0:(slices))*(max(y)-min(y))/slices)
 height<-gboost*((slices*yinc)+max(density(x)$y))
 plot( c(min(x),max(x)+((max(x)-min(x))/4)),
 c(0,height),
 xaxt="n",yaxt="n",ylab="",xlab="")
 rect(par("usr")[1],par("usr")[3],par("usr")[2],par("usr")[4],col=bcol)
 for(i in slices:1) {
 miny<-ycut[i]
 maxy<-ycut[i+1]
 gx<-(i-1)*(max(x)-min(x))*xinc
 gy<-(i-1)*(height)*yinc
 dd<-do.call(density,append(list(x=x[y>=miny & y<maxy]),
 densopt))
 polygon(dd$x+gx,lboost*dd$y+gy,col=fcol)
 lines(dd$x+gx,lboost*dd$y+gy,col=lcol,lwd=lwidth)
 if(extend) {
 lines(c(par("usr")[1],dd$x[1]+gx),
 rep(lboost*dd$y[1]+gy,2),col=lcol)
 lines(c(dd$x[length(dd$x)]+gx,par("usr")[2]),
 rep(lboost*dd$y[length(dd$y)]+gy,2),col=lcol)
 }
 }
}
# Example 1:
y<-runif(5000,min=-1,max=1)
x<-runif(5000,min=-1,max=1)+rnorm(5000,mean=1/(y+1.1),sd=0.8-(y*0.5))
slicedens(x,y,lboost=0.2,fcol=rgb(0,0,0,200,maxColorValue=255))
# Example 2:
library(iris)
slicedens(x=iris$Sepal.Width,y=iris$Sepal.Length,slices=12,
 lboost=0.02,fcol=rgb(0,0,0,200,maxColorValue=255),
 extend=TRUE,densopt=list(kernel="cosine",adjust=0.5))

Some places call this a waterfall plot. Anyway, the white-on-black color scheme is clearly inspired by the Joy Division album cover. Enjoy.

Edit 9 October 2014: added the “extend” and “densopt” arguments.

3 Comments

Filed under R, Visualization

Transparent hurricane paths in R

Arthur Charpentier has written a really nice blog post about obtaining hurricane tracks and plotting them. He then goes on to do other clever Markov process models, but as a dataviz guy who knows almost nothing about meteorology, I want to suggest just a small improvement to the maps. Almost every graphic with a lot of data benefits from transparency. It’s easy to add in R too, you can just specify a color with a function like rgb(). Here’s the opaque version:

hurricanes

and here it is with color=”red” replaced by color=rgb(255,0,0,18,maxColorValue=255).

transparent-hurricanes

The only other thing I did to Arthur’s code was to wrap the crucial TOTTRACK line in a try() function because some of the files came back with error 404, and you don’t want to stop just for a few ol’ hurricanes:

try(TOTTRACK<-rbind(TOTTRACK,extract.track(y)),silent=FALSE)

Longer-term, it would be worth trying to replace the tracks with splines to avoid jumpiness, and it does look like there’s some dirt in the data: what are those tracks going straight from the Azores to Iceland?

1 Comment

Filed under R

Save your simulation study seeds

Here in the Northern hemisphere, gardeners are gathering seeds from their prize-winning vegetables are storing them away for next year’s crop. Today at the 20th London Stata Users’ Group meeting, I learnt a similar trick. It’s strange I never thought of it before; regular readers will know how keen I am on simulation studies and techniques. Sometimes you want to go back and investigate one particular simulated dataset, either because the results were particularly interesting or because your program didn’t behave the way you hoped, and you think there will be a clue on how to debug it in that one weird dataset.

Bill Gould’s sage advice was to gather the seed for the random number generator (RNG) at the beginning of each iteration and store it as a string. The seed initialises the RNG (which is never really random*, just unpredictable) and if you set it, you can reproduce the results. A basic example in Stata:

clear all

local iter=10
set obs 100
gen str64 myseed=""
gen slopes=.
gen x=rnormal(0,1)
gen mu=1+(2*x)
gen y=rnormal(mu,1)

forvalues i=1/`iter' {
 local myseed=c(seed)
 qui replace myseed="`myseed'" in `i'
 qui replace x=rnormal(0,1)
 qui replace mu=1+(2*x)
 qui replace y=rnormal(mu,1)
 qui regress y x
 matrix A=e(b)
 local beta=A[1,1]
 qui replace slopes=`beta' in `i'
}

local revisit=myseed[2]
set seed `revisit'
qui replace x=rnormal(0,1)
qui replace mu=1+(2*x)
qui replace y=rnormal(mu,1)
regress y x
dis "`revisit'" 

In this example, we run a linear regression 10 times and save the slope parameters and the seeds. We can then pick out simulation number 2 and recreate it without retracing any other steps. Nice! Here it is in R (but R RNGs are much more complex and I have to say that the help documentation is far from helpful):

iter<-10
size<-100
slopes<-rep(NA,iter)
seeds<-list(NULL)
for(i in 1:iter) {
 seeds[[i]]<-.Random.seed
 x<-rnorm(size,0,1)
 y<-rnorm(size,1+(2*x),1)
 slopes[i]<-lm(y~x)$coefficients[2]
}
.Random.seed<-seeds[[2]]
x<-rnorm(size,0,1)
y<-rnorm(size,1+(2*x),1)
lm(y~x)

* – I’m glossing over the philosophical meaning of ‘random’ here

11 Comments

Filed under R, Stata

Should every nonparametric test be accompanied by a bootstrap confidence interval?

Well, duh. Obviously. Because (a) every test should have a CI and (b) bootstrap CIs are just awesome. You can get a CI around almost any statistic, they account for non-normality and boundaries.

But you might have to be a little careful in the interpretation, because they might not be measuring the same thing as the test.

Take a classic Wilcoxon rank-sum / [Wilcoxon-]Mann-Whitney independent-samples test (don’t you just love those consistent and memorable names?). This ranks all the data and compares them across two groups. Every bit of the distribution is contributing, and there isn’t an intuitive statistic; what you’re testing is the W statistic. Do you know what a W of 65000 looks like? No, neither do I. If there’s a difference somewhere in terms of location, it might come up.

It’s so much simpler for the jolly old t-test. You take means and compare them. You get CIs around those means with a simple formula. And everybody knows what a mean is, even if they don’t really want to grapple with a t-statistic and Satterthwaite’s degrees of freedom.

So, in the Mann-Whitney case, the most sensible measure might be the difference between the medians. There is no formula for a CI for this, though undoubtedly we could get a pretty bad approximation by the usual techniques. So, we reach for the bootstrap. In fact, perhaps we should just be using it all the time…?

So the problem here is that you could have a significant Mann-Whitney but a median difference Ci that crosses zero. Interpreting that is not so easy, and I found one of my students in just that pickle recently. It was my fault really; I’d suggested the bootstrap CI. How could we deal with this situation? Running the risk of cliché, it’s not a problem but an opportunity. Because the test and the CI look at the data in slightly different ways, you’re actually getting more insight into the distribution, not less. Consider this situation:

Spend hours making it just so in R? No. Use a pencil.

Spend hours making it just so in R? No. Use a pencil.

Here, the groups have the same median but should get a significant Mann-Whitney result if the sample size is not tiny. You can surely imagine the opposite too, with a bimodal distribution where the median flips from one clump to another through only a tiny movement in the distribution as a whole.

So, in conclusion:

  • my enthusiasm for bootstrapping is undimmed
  • there is still no substitute for drawing lots of graphs to explore your data (and for this, pencils are probably best avoided)

2 Comments

Filed under learning

Measuring nothing and saying the opposite: Stats in the service of the Scottish independence debate

I was half-heartedly considering writing about the ways that GDP can be twisted to back up any argument, when what should come along but this unedifying spectacle.

The Unionists (No campaign) have produced a league table of GDP, showing how far down Scotland would be. So, the argument goes, you should vote for them. This is, however, irrelevant to whether Scots would be better off. The GDP would drop because it would be a small country. GDP is the total sum of economic activities. If you have fewer people, you do less. That doesn’t make you poor, as Monaco or the Vatican City will tell you. If Manhattan declared independence from the rest of the USA, its GDP would go down. If Scotland not only stayed in the UK but convinced North Korea to join too, the UK’s GDP would go up. On this logic, every time a border is removed, people immediately get rich.

Meanwhile, the Separatists (Yes campaign) have produced a league table of GDP per capita, showing how far up Scotland would be. So, the argument goes, you should vote for them. This is, however, irrelevant to whether Scots would be better off, despite being a far less bad guess than GDP in toto. It tells the individual voter (let’s call him Mr Grant) practically nothing about whether the Grants would be better off, because that depends on a million other factors. Small countries with some very wealthy people, relying on foreign investment, will have inflated GDP per capita. It’s just the same thing kids learn in primary school now: when there are outliers, the mean is not so useful.

Anyway, the whole measure is used to form arguments about wellbeing, which is just nonsense. Otherwise all our ex-colonies would be kicking themselves at trading the jackpot of Saxe-Coburg-Gotha rule for silly stuff like identity and self-determination. (Except the USA, Canada, Australia and Ireland, who have higher GDP per capita than us and so are presumably happier. We should try to join them instead, as that will make us happy – though they will feel sad when we arrive.) No wonder there are still plenty of people who, having asked what I do for a living, gleefully say “lies, damned lies…”

PS: This Langtonian doesn’t get a vote because I live in London – that “giant suction machine” – and here’s a great post at Forbes about the UK joining the USA and becoming the lowest of the low. 

1 Comment

Filed under Uncategorized