This is some unpublished work that I started a few years ago and never finished.

Power is the chance that you will, in a future study that you are designing, get a significant hypothesis test result if the true value of the test statistic in question is equal to some minimally important value. Sometimes, people carry out studies and they are disappointed to get p>0.05: non-significant. You know, p=0.049 means you collect a Nobel Prize and p=0.051 means you collect a P45 (the tax document you get when you lose your job in the UK). So, when their fears are realised and they get non-significance, they go looking for excuses they can tell the boss. One is that the study maybe turned out (through no fault of their own, of course) to have lacked power, and they ask for retrospective power calculations: not when designing the study but after it has been conducted. That is meaningless, says the statistician, and they go away despondent. I have to say that, to me, it seems a reasonable question — it just can’t be answered by power.

In the last few days, Shravan Vasishth and I have passed the idea around through Twitter. He proposed a calculation on his blog based on treating the probability of H0 vs HA as a random variable, and that spurred me on to type up what notes I had. My approach is to look at a hypothetical identical future study, but the *really* interesting aspect is that you would only ask this question of your friendly statistician when you think there’s a chance of snatching victory from the jaws of defeat, and that introduces a complex bias. Forcing people to think about and justify this bias might actually make the practice of retrospective power calculation, or as I prefer to put it, false non-significance rate, quite a positive one. Here’s the text but I recommend reading it in PDF so the maths makes more sense.

## “Retrospective power” revisited: a Bayesian false non-significance rate

### Introduction and notation

Power, the probability of obtaining a significant hypothesis test result if the population test statistic is equal to a minimally important value, is a ubiquitous concern in many fields of applied statistics, including my own, biomedical research. It is usually operationalised as a frequentist concept, and so calculating it after the study has been conducted — so-called retrospective power — is meaningless.

Let theta be the true value of the test statistic, and sigma its true standard error given the sample size of the completed study. delta is the minimally important value upon which the power calculation for the study was based, and theta^ and sigma^ are the completed study’s estimates of theta and sigma respectively. Here, for simplicity, they are regarded as transformed so that the null hypothesis is h0: theta=0. I will also assume normality of the sampling distribution for illustrative purposes, although the formulas do not require that.The target difference in the original sample size calculation is delta.

1-P (theta^<1.96sigma^ | theta= delta, sigma=sigma) is either 0 or 1 once you know sigma^ and theta^. Nevertheless there seems to be a widespread urge to answer this question: "could my study's non-significant result have been a mistake?" This seems a reasonable question, but to answer it requires something other than power.

### Previous work

Ioannides considered possible ways of assessing the probability of the truth or falsehood of study results in 2005 in his widely cited paper “Why most published research findings are false”. His formulas take into account various other forms of bias such as unacknowledged occult multiplicity, publication and reporting bias. However, he considers only a dichotomised finding (significant / not) and true value (effect / no effect), which limits the applicability of the approach to individual studies. This was one aspect criticised afterwards by Goodman & Greenland [http://www.plosmedicine.org/article/info%3Adoi%2F10.1371%2Fjournal.pmed.0040168].

In a 2008 paper, expanding on an article in American Scientist magazine [http://www.americanscientist.org/issues/page2/of-beauty-sex-and-power], and published only on the first author’s own website, Gelman and Weakliem considered underpowered studies and set out the probabilities of various types of error: the familiar I and II as well as errors of magnitude (type M) and of sign (type S). They conclude that the system of type I and II has not been helpful. In most cases they consider, the probabilities of type M or S errors turn out to be so high as to call any conclusion of the study into question. In the same year, Ralph O’Brien spoke at the JSM conference on “crucial type I and II error rates”. This proposal reverses the familiar formulas through Bayes’ Theorem. The contemporaneous discussion at http://andrewgelman.com/2008/12/26/what_is_the_poi/ sets the scene.

### Objective

To address the question of whether a study’s non-significant result could be a type II error, we must deal with a theoretical identical future study, in the same prospective way that classical power is calculated. The completed study’s estimates of the parameter and its standard error are fixed values, and the true population values are unknown, but we can establish a distribution for the estimates of an identical future study.

The question “what is the probability that my study’s non-significant result is wrong?” then can be rephrased as “given what we can infer about the true parameter given our data, what is the probability that the true effect is as big or bigger than the target difference and yet an identical study would yield a non-significant result?” This is then a form of Bayesian posterior predictive model checking.

### Some theory

Let theta* and sigma* be the estimates arising from an identical study. We are interested in:

P (theta* < 1.96sigma* | theta^, sigma^, theta ≥ delta)

which we can calculate from the sampling distribution P(theta,sigma | theta^, sigma^) and the conditional P(theta*, sigma* | theta, sigma) and by integrating out the unknown (theta,sigma).

To emphasise the distinction from the type II error rate, I propose the clear term “false non-significance rate” for this.

There is a further complication to consider, which mirrors Ioannides’s various biases. Retrospective power is usually only considered when theta^ < 1.96 sigma^ (that is, a non-significant result), and either sigma^ is larger than expected (including problems of sample size) or theta^ is close to, but smaller than, delta. This introduces a bias because we will only ask the retrospective power question of a subset of the possible values of (theta^,sigma^). To counter this requires us to introduce a prior on theta^,sigma^ and so derive a posterior distribution for theta*,sigma*. This could be informed by previous research or opinion in the usual way, but it does not make sense for it to be diffuse. Sensitivity analysis with various priors is advisable.

It is almost always the case that there is some other information on anticipated findings, so that theta^,sigma^ is not the only information about theta,sigma and hence theta*,sigma*. We should attempt to incorporate this as a prior distribution because it is hard to interpret retrospective power in the context of other studies, in the way that we expect people to interpret study findings (without informative priors).

### Some discussion

As a former medical statistician in a university and hospital setting, I was regularly called on to advise on sample size and power calculations. I was, and still am, convinced that the great majority of these calculations were uninformative acts of sophistry, performed for the comfort of the tutor, ethics committee or funding body, and based on such an accumulation of assumptions as to be meaningless. My message to all colleagues and students in such a situation (because it is not their fault to expect a simple answer) is to think very carefully and in depth about what they are trying to investigate, and what they would do having found various potential results. This critical thought helps to innoculate them against the lure of simplicity that comes from one calculation on one hypothesis, under one set of assumptions. The calculations set out in this paper are no different, and require careful justification for all the assumptions behind them. Indeed, I approach publication of this proposition with some trepidation, lest what is intended as a stimulating exercise in defining slippery concepts is reduced instead to a catch-all formula that permits retrospective power calculations to proceed under a new name, and unhindered by cerebral activity. I hope that by encoding the most difficult part of this – the retrospective power bias – as an informative prior distribution, researchers will be forced to slow down and consider what has happened with their study, and what they are seeking to achieve by such calculation, very carefully, like the QWERTY keyboard, intended to slow typists sufficiently to avoid jamming of keys on a manual typewriter.

A final note of caution concerns the target difference delta, which appears in all the formulas here. A review of methods to establish this value (Cook et al, HTA) is essential reading for everyone working with sample size and power calculations, because our recommendations for designing future studies and interpreting completed ones are undermined by irrelevant or unreliable target differences.