Well, duh. Obviously. Because (a) every test should have a CI and (b) bootstrap CIs are just awesome. You can get a CI around almost any statistic, they account for non-normality and boundaries.

But you might have to be a little careful in the interpretation, because they might not be measuring the same thing as the test.

Take a classic Wilcoxon rank-sum / [Wilcoxon-]Mann-Whitney independent-samples test (don’t you just love those consistent and memorable names?). This ranks all the data and compares them across two groups. Every bit of the distribution is contributing, and there isn’t an intuitive statistic; what you’re testing is the W statistic. Do you know what a W of 65000 looks like? No, neither do I. If there’s a difference somewhere in terms of location, it might come up.

It’s so much simpler for the jolly old t-test. You take means and compare them. You get CIs around those means with a simple formula. And everybody knows what a mean is, even if they don’t really want to grapple with a t-statistic and Satterthwaite’s degrees of freedom.

So, in the Mann-Whitney case, the most sensible measure might be the difference between the medians. There is no formula for a CI for this, though undoubtedly we could get a pretty bad approximation by the usual techniques. So, we reach for the bootstrap. In fact, perhaps we should just be using it all the time…?

So the problem here is that you could have a significant Mann-Whitney but a median difference Ci that crosses zero. Interpreting that is not so easy, and I found one of my students in just that pickle recently. It was my fault really; I’d suggested the bootstrap CI. How could we deal with this situation? Running the risk of cliché, it’s not a problem but an opportunity. Because the test and the CI look at the data in slightly different ways, you’re actually getting more insight into the distribution, not less. Consider this situation:

Here, the groups have the same median but should get a significant Mann-Whitney result if the sample size is not tiny. You can surely imagine the opposite too, with a bimodal distribution where the median flips from one clump to another through only a tiny movement in the distribution as a whole.

So, in conclusion:

my enthusiasm for bootstrapping is undimmed

there is still no substitute for drawing lots of graphs to explore your data (and for this, pencils are probably best avoided)

I'm a medical statistician at Kingston University and St George's, University of London. I'm interested in Bayesian latent variable models, data visualization and stats curriculum reform. I use (and sometimes blog about) R, Stata, BUGS, JAGS, Stan, and program sometimes with C++, Julia and JavaScript. I make the StataStan interface and Stata2D3 package. I sit on committees for statistical computing at the Royal Statistical Society, and clinical audit and confidential enquiries for NHS England. I am the Honorary Statistician at Princess Alice Hospice, a research-active hospice in Surrey, England, and I teach statistics with Stata software on Harvard Medical School's GCSRT, ICRT, CSRT-PT and UKCSRT blended learning programs for clinicians. Sleep is for wimps.
View all posts by Robert

3 Comments

Nice post Robert! As you’ve illustrated, there are certain tests (e.g. Wilcoxon rank sum) which test hypotheses which are not about parameters, but about other things like the whole distribution of a variable. If there is no parameter, there can’t be a corresponding confidence interval.
I looked into what the rank sum test was actually testing a while back, because in some books/papers it says it is a test of location / test of equal medians whereas in others it says it is a test that the two distributions are identical in every way. The blog post is here, and contains links to a few papers you might be interested in too: http://thestatsgeek.com/2014/04/12/is-the-wilcoxon-mann-whitney-test-a-good-non-parametric-alternative-to-the-t-test/

Nice post Robert! As you’ve illustrated, there are certain tests (e.g. Wilcoxon rank sum) which test hypotheses which are not about parameters, but about other things like the whole distribution of a variable. If there is no parameter, there can’t be a corresponding confidence interval.

I looked into what the rank sum test was actually testing a while back, because in some books/papers it says it is a test of location / test of equal medians whereas in others it says it is a test that the two distributions are identical in every way. The blog post is here, and contains links to a few papers you might be interested in too:

http://thestatsgeek.com/2014/04/12/is-the-wilcoxon-mann-whitney-test-a-good-non-parametric-alternative-to-the-t-test/

Thanks to @ThomasSpeidel, who tweeted that Rasmus Bååth’s blog includes a nice alternative to t-testin’ and Mann-Whitney / median testin’, especially: http://www.sumsar.net/blog/2014/05/jeffreys-substitution-posterior/ & http://www.sumsar.net/blog/2014/02/bayesian-first-aid-two-sample-t-test/