Should every nonparametric test be accompanied by a bootstrap confidence interval?

Well, duh. Obviously. Because (a) every test should have a CI and (b) bootstrap CIs are just awesome. You can get a CI around almost any statistic, they account for non-normality and boundaries.

But you might have to be a little careful in the interpretation, because they might not be measuring the same thing as the test.

Take a classic Wilcoxon rank-sum / [Wilcoxon-]Mann-Whitney independent-samples test (don’t you just love those consistent and memorable names?). This ranks all the data and compares them across two groups. Every bit of the distribution is contributing, and there isn’t an intuitive statistic; what you’re testing is the W statistic. Do you know what a W of 65000 looks like? No, neither do I. If there’s a difference somewhere in terms of location, it might come up.

It’s so much simpler for the jolly old t-test. You take means and compare them. You get CIs around those means with a simple formula. And everybody knows what a mean is, even if they don’t really want to grapple with a t-statistic and Satterthwaite’s degrees of freedom.

So, in the Mann-Whitney case, the most sensible measure might be the difference between the medians. There is no formula for a CI for this, though undoubtedly we could get a pretty bad approximation by the usual techniques. So, we reach for the bootstrap. In fact, perhaps we should just be using it all the time…?

So the problem here is that you could have a significant Mann-Whitney but a median difference Ci that crosses zero. Interpreting that is not so easy, and I found one of my students in just that pickle recently. It was my fault really; I’d suggested the bootstrap CI. How could we deal with this situation? Running the risk of cliché, it’s not a problem but an opportunity. Because the test and the CI look at the data in slightly different ways, you’re actually getting more insight into the distribution, not less. Consider this situation:

Spend hours making it just so in R? No. Use a pencil.
Spend hours making it just so in R? No. Use a pencil.

Here, the groups have the same median but should get a significant Mann-Whitney result if the sample size is not tiny. You can surely imagine the opposite too, with a bimodal distribution where the median flips from one clump to another through only a tiny movement in the distribution as a whole.

So, in conclusion:

  • my enthusiasm for bootstrapping is undimmed
  • there is still no substitute for drawing lots of graphs to explore your data (and for this, pencils are probably best avoided)


  1. Nice post Robert! As you’ve illustrated, there are certain tests (e.g. Wilcoxon rank sum) which test hypotheses which are not about parameters, but about other things like the whole distribution of a variable. If there is no parameter, there can’t be a corresponding confidence interval.
    I looked into what the rank sum test was actually testing a while back, because in some books/papers it says it is a test of location / test of equal medians whereas in others it says it is a test that the two distributions are identical in every way. The blog post is here, and contains links to a few papers you might be interested in too:

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s