A brief one this week, as I’m working on the dataviz book.

I’m a medical statistician, so I get asked about sample size calculations *a lot*. This is despite them being nonsense much of the time (wholly exploratory studies, no hypothesis, pilot study, feasibility study, qualitative study, validating a questionnaire…). In the case of randomised, experimental studies, they’re fine, and especially if there’s a potentially dangerous intervention or lack thereof. But we have a culture now where reviewers, ethics committees and such ask to see one for any quant study. No sample size, no approval.

So, I went back through six years of e-mails (I throw nothing out) and found all the sample size calculations. Others might have been on paper and lost forever, and there are many occasions where I’ve argued successfully that no calculation is needed. If it’s simple, I let students do it themselves. Those do not appear here, but what we do have (79 numbers from 21 distinct requests) give an idea of the spread.

You see, I am so down on these damned things that I started thinking I could just draw sizes from the distribution in the above histogram like a prior, given that I think it is possible to tweak the study here and there and make it as big or as small as you like. If the information the requesting person lavishes on me makes no difference to the final size, then the sizes must be identically distributed even conditional on the study design etc., and so a draw from this prior will suffice. (Pedants: this is a light-hearted remark.)

You might well ask why there are multiple — and often very different — sizes for each request, and that is because there are usually unknowns in the values required for calculating error rates, so we try a range of values. We could get Bayesian! Then it would be tempting to include another level of uncertainty, being the colleague/student’s desire to force the number down by any means available to them. Of course I know the tricks but don’t tell them. Sometimes people ask outright, “how can we make that smaller”, to which my reply is “do a bad job”.

And in those occasions where I argue that no calculation is relevant, and the reviewers still come back asking for one, I just throw in any old rubbish. Usually 31. (I would say 30 but off-round numbers are more convincing.) It doesn’t matter.

If you want to read people (other than me) saying how terrible sample size calculations are, start with “Current sample size conventions: Flaws, harms, and alternatives” by Peter Bacchetti, in BMC Medicine 2010, 8:17 (open access). He pulls his punches, minces his words, and generally takes mercy on the calculators:

“Common conventions and expectations concerning sample size are deeply flawed, cause serious harm to the research process, and should be replaced by more rational alternatives.”

In a paper called “Sample size calculations: should the emperor’s clothes be off the peg or made to measure”, which wasn’t nearly as controversial as it should have been, Geoffrey Norman, Sandra Monteiro and Suzette Salama (no strangers to the ethics committee), point out that they are such guesswork, we should just save people’s anxiety, delays waiting for a reply from the near-mythical statistician, and brain work, and let them pick some standard numbers. 65! 250! These sound like nice numbers to me; why not? In fact, their paper backs up these numbers pretty well.

In the special case of ex-post “power” calculations, see “The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis” by John M. Hoenig and Dennis M. Heisey, in The American Statistician (2001); 55(1): 19-24.

This is not a ‘field’ of scientific endeavour, it is a malarial swamp of steaming assumptions and reeking misunderstandings. Apart from multiple testing in its various guises, it’s hard to think of a worse problem in biomedical research today.

Absolutely! I’m right with you on this.

The one bit I would disagree with, though, is “In the case of randomised, experimental studies, they’re fine…” I don’t think they are. For one thing, the standard sample size calculation only makes sense if you’re going to conclude that a treatment “works” or “doesn’t work” using a significance test, which is what most do but really shouldn’t. Also, people project strange and magical beliefs onto the sample size; it’s “the number needed to answer the question” and failure to achieve it is a fatal flaw. There isn’t generally much thought about or understanding of what people are doing or what it means. The sample size is just part of the checklist/cookbook approach to clinical trials.