An interesting thread has been discussed in the last couple of days on the Allstat mailing list. Here is Andy Cooper’s original post:

Dear All,

I have the following question which I am hoping native statisticians (I am a physicist by training) can help me address. To set the background as to why I am asking this question: a manuscript recently submitted to a Physics journal got rejected because 1 of the 3 reviewers claims that a result presented in the given manuscript is wrong. I would therefore be very grateful to hear the opinion of statisticians.

The issue in question is as follows: Suppose we have 6 statistics (e.g z-statistics), each derived from an independent data set (i.e 6 independent data sets in total). We can assume that the number of degrees of freedom is the same in each data set, so that the corresponding P-values are also comparable. We can further assume that each independent data set is a sample from an underlying population. Under the null hypothesis (z=0), the P-values would be distributed uniformly between 0 and 1. Now, the observed P-values are in fact (3e-9, 0.04, 0.05, 0.03, 0.02, 0.005), i.e they are all less than 0.06. It is clear, at least to me, that the chance that these P-values are drawn from a uniform distribution is pretty small (<1e-8). Yet the reviewer in question claims that there is no overall significance. His/her argument is based on the Bonferroni correction: using a threshold of 0.05/6~0.008 only 2 P-values pass this threshold, which he/she then goes on to claim is not meaningful enough.

My response to the reviewer’s comment is that the use of a Bonferroni correction to establish the overall significance of the 6 P-values is wrong. The Bonferroni correction is ill-suited for this particular application since it is overly conservative, leading to a large fraction of false negatives. Remarkably, the editor of the Physics journal in question finds the reviewers arguments (i.e using the Bonferroni correction) as “persuasive”.

I would be most grateful for your comments.

He then clarified:

I should have clarified of course that the directionality of the statistic is consistent across the 6 data sets.

This post kicked the proverbial hornet’s nest. It seems many scientists, whether statisticians or otherwise, have encountered this sort of response from a reviewer. They are just wrong, and so is the editor*. If I was commenting on a clinical trial of cardiac surgery, I wouldn’t shoot from the hip about where to stick the aorta or how many mils of Pumpmax to infuse. If I did, I would expect the surgeons to tell me to get lost and stick to the stats.

So why do people feel able to pronounce on statistics when they plainly have had a few hours’ training squeezed into their undergraduate degree? When reviewing papers, there is often a box that says “*This requires review by a statistician: yes/no*“. In my experience this gets ticked in a completely unpredictable way. It seems to me that teaching statistics like a flowchart or checklist is to blame. It is very tempting to do this when faced with another cohort of medical / psychology / whatever students. They memorise the basics, they pass the exam, what happens after that is someone else’s problem. In this case one could imagine the reviewer having remembered some sage advice like “when you do lots of tests, use Bonferroni and everything will be OK”. And having stashed that half-truth away, they leave with a false sense of confidence in their mastery of statistics. Lecturers are also faced with students who feel very anxious about their mathematical abilities, and they are often repeatedly reassured, rewarded for doing the basics and packed in cotton wool until they are no longer anxious. But life is not like that. Life involves data that do not match any of the methods in your textbook; life involves going back to first principles to find the method, and even though a lot of the time some very clever person has programmed the computer to do it for you, you should never forget that the maths is lurking just below the surface. When you hit a snag, you should go down the corridor and knock on the door of someone who knows how to do the maths. If you’re reviewing a paper, tick the statistician box. What you should never do is to bumble on regardless, substituting eminence for evidence.

Professional statisticians are of course outnumbered by the dilettanti and may not be able to provide insight and review on every study that is done, but perhaps it is our duty to be more publically critical, a bit nastier, dare I say it – a bit more like a cardiac surgeon. After all, if you stick the aorta in the wrong place, you can only kill a few people before the hospital hands you the contents of your desk in a cardboard box and escorts you off the premises. If you publish the wrong stats, you can kill millions, and get away with it.

* – because analysing 6 data sets gives you 6 answers, which is a lot of information. Analysing the same data 6 ways then picking the interesting stuff is not very much information, and to guard against reading too much into it, we have post-hoc adjustments, of which Bonferroni is the Fisher-Price “My First Post Hoc Adjustment” (with no offence to Fisher-Price and their fine products). The thing that really is beyond the grasp of bumblers like Andy Cooper’s reviewer is that the decision about whether or not to adjust depends on *context* and *intention*; it is as much a philosophical debate as it is methodological. Adjusted p-values seek to replace your human judgement of whether you can trust the (non-)significance of the results, and that makes them much closer in spirit to a subjective Bayesian posterior distribution than most analysts would care to admit.