Category Archives: Bayesian

Dataviz of the week, 12/4/17

This week, a chart with some Bayesian polemic behind it. Alexander Etz put this on Twitter:


He is working on an R package to provide easy Bayesian adjustments for reporting bias with a method by Guan & Vandekerchhove. Imagine a study reporting three p-values, all just under the threshold of significance, and with small-ish sample sizes. Sound suspicious?


Sounds like someone’s been sniffing around after any pattern they could find. Trouble is, if they don’t tell you about the other results they threw away (reporting bias), you don’t know whether to believe them or not. Or there are a thousand similar studies but this is the (un)lucky one and this author didn’t do anything wrong in their own study (publication bias).

Well, you have to make some assumptions to do the adjustment, but at least being Bayesian, you don’t have to assume one number for the bias, you can have a distribution. Here, the orange distribution is the posterior for the true effect once the bias has been added (in this case, p>0.05 has a 0% chance of getting published, which is not unrealistic in some circles!) This is standard probabilistic stuff but it doesn’t get done  because the programming seems so daunting to a lot of people. The more easy tools – with nice helpful visualisations – the better.

Leave a comment

Filed under Bayesian, R, Visualization

Complex systems reading

Tomorrow I’ll be giving a seminar in our faculty on inference in complex systems (like the health service, or social services, or local government, or society more generally). It’s the latest talk on this subject that is really gelling now into something of a manifesto. Rick Hood and I intend to send off the paper version before Xmas, so I won’t say more about the substance of it here (and the slides are just a bunch of aide-memoire images), other than to list the references, which contains some of my favourite sources on data+science:


I deliberately omit the methodologically detailed papers from this list, but in the main you should look into Bayesian modelling, generalised coarsening, generalised instrumental variable models, structural equation models, and their various intersections.

Leave a comment

Filed under Bayesian, research

Frequentist accuracy of Bayesian estimates

Today I sat in on the Royal Statistical Society’s webinar discussing Brad Efron’s recent paper “Frequentist Accuracy of Bayesian Estimates”. Prof Efron introduced the paper, Andrew Gelman gave a response with three questions, and there were questions from the floor. It all worked rather well in terms of logistics, and I hope the RSS does more online broadcasts of events – or broadcasts of online events. It is quite easy to do this sort of activity now, and you don’t have to do a training course beforehand: it was Prof Efron’s first ever webinar. I think the trick is to make use of existing simple technology and not to attempt to reinvent the wheel, or feel so anxious about it that one hires an expensive agency to take care of the event. I base this on the very positive experience of Harvard Med School’s GCSRT and associated blended learning courses, which built a strong curriculum out of mostly free tools like Facebook, Dropbox and Google Hangouts. Not only do the students get an online learning environment, they learn how to create their own environment for collaboration.

I wanted to write this and post it quickly because the paper is available to all for free until 4 November 2015. The recording of the webinar is on the RSS journals webinar page, and will move to Youtube soon.

The point of this paper is that when we make some estimate and an interval around it by Bayesian methods, we can interpret the interval in a frequentist framework provided the prior is informative and informed by prior evidence. That is to say, given that prior, we could reasonably imagine the study and its analysis taking place repeatedly forever in identical circumstances. Our interval, whether we call it confidence or credible, would be calculated afresh each time and 95% of them would contain the true value.

Here comes a philosophical aside; you can omit it without loss of understanding, as they say. Now, here I confess some confusion regarding what that value really is, because it seems to contain both a sample of data and a prior, but for a population the prior would cease to influence it. Is it a statistic summarising the posterior distribution, or the population? Does it matter or am I splitting hairs? Furthermore, is it any more reasonable to imagine indefinite collection of data, influenced by an empirical prior, without the prior being updated? Perhaps they have to be collected simultaneously without communication, but even if we admit the plausibility of this (remember that we cannot make inferences about one-off events like elections without bending the rules), we are faced with the problem of subjectivity lurking within. If information is available but is just not known, does that make events ‘random’ in the frequentist sense that they can be described by probability distributions? Most old-school frequentists would say no, because to say yes means that when researcher A learns of researcher B’s data, researcher A inferences collapse like a Schrödinger wave function. What if A knows about B but nobody knows that A knows? It could be random for B but not for A: subjectivity, goddam it! Everybody for the lifeboats! This is not to mention the fact that the (human) subject of the research knows even before you ask them. On the other hand, if they say no, then they deny their own frequentist framework, because repeated experiments mean that they cannot be truly repeated, because the very existence of information means there is no such thing as probability (and they might be onto something there…but that’s another story). This, friends, is why frequentism is a steaming pile of poop. But we will leave that aside, because nothing in statistics stands up to philosophical scrutiny except (I think!) personally subjective Bayes, and probably (boom boom!) Jaynes’s fuzzy logic. It doesn’t make sense but it is useful nonetheless.

So, on to the paper. If you don’t have an empirically informed prior (and not many of us use them), what would the meaning be of the frequentist properties of intervals? In such circumstances, we have to some extent made a subjective choice of our prior. This immediately made me think of Leamer’s fragility analysis, in his paper “Let’s Take The Con Out Of Econometrics” (I’m grateful to Andrew Chesher for pointing me to this very entertaining old paper, something of a classic of econometrics but unknown to statistical me). Efron explores this question via the multivariate gradient of the log-probability of the data over values of its summary statistic(s). This gives standard errors via delta method, or you can get a bootstrap too. There is a very nice motivating example where a summary of the CDF of a particular participant in a lasso regression predicting diabetes progression gets both a Bayesian and a frequentist CI, and the ratio of their widths is explored. It turns out that there is a formula for this ratio yielding matrix whose eigenvalues give a range of possible values, and this sets bounds on the frequentist-Bayesian disagreement for any given parameter – potentially a useful diagnostic.

Prof Gelman liked the non-ideological nature of the paper (I do too, despite the rant above), but wondered what the practical conclusion was: to trust Bayes because it is close to frequentist under some specific conditions, or to do the comparison with theoretical simulation each time one does a Bayesian analysis (and to use empirical priors… well, that might not work very often). What if one used Weakly Informative Priors (WIPs), falling somewhere between informative and non-informative? Could inference be done on some measure of the width of the prior? Efron liked this idea.

Gelman also liked the “creative incoherence” between the method employed for the bootstrap, and the estimator it aims at. This is a bit like a confidence interval for difference of medians alongside a rank-based test, as discussed in these very pages a few months back. Why is coherence good? Why not be incoherent and see more than the two sides of the coin familiar from asymptotics: test and interval? Why not employ different methods to understand complementary answers about a Bayesian model?

Leave a comment

Filed under Bayesian

Responses to the BASP psychologists’ p-value ban: the one that got away

When StatsLife collected responses to the BASP p-value ban (see blogs hither and yon), I suggested they contact Ian Hunt, a wise and philosophically minded critical voice in the wilderness of cookbook analysts. I also know that he and I take rather divergent opinions on deduction and induction and such, but I hold his arguments in the highest respect because they are carefully constructed. Alas! he couldn’t send in a response in time, but here it is, reproduced with his kind permission:

How good are the reasons given by the editors of Basic and Applied Social Psychology (BASP) to ban hypothesis tests and p-values?

I argue that BASPs reasoning for banning “statistical inference” is weak.

First, they (the editors) offer a non sequitur: ban p-values because “the state of the art remains uncertain” (2015 editorial) and “there exists no inferential statistical procedure that has elicited widespread agreement” (2014 editorial).  I argue inter-subjective (dis)agreement is not decisive.

Secondly, they imply that good inductive inferences require posterior probabilities. This is contentious, especially since both posteriors and p-values are just deductions.

Thirdly, they plead for larger sample sizes “because as the sample size increases, descriptive statistics become increasingly stable and sampling error is less of a problem.” This is contradictory: the evidence for this claim is best shown by the statistical inferences being banned.

Fourthly, they correctly assume that with a large enough sample size many “significant effects” (or “null rejections” or low p-values or interesting things or whatever) can be identified by looking at canny descriptive statistics and adroitly drawn charts. But I believe p-values ARE descriptive statistics – with which both frequentists and Bayesians can work.

Finally, BASP “welcomes the submission of null effects”.  But without tests and concomitant power profiles the evidential value of a “null effect” is unclear.

BASPs editors appear to conclude that modern statistics is inductive and akin to “the art of discovery” (as David Hand puts it). Fair enough. But I conclude that careful deductive inferences, in the form of hypothesis tests with clear premisses and verifiable mathematics, still have a role in discovering interesting things.

Now, it would be unfair of me to say anything more on this here but I believe you can hear Ian talking on this very subject at this year’s RSS conference, which is in Exeter. Personally, I’ve never been to Exeter, and I don’t think this is going to be the year for it either, but as Southern towns go, I suspect it’s neither as depressing as Weymouth nor as humourlessly overrated as Salisbury. (That counts as enthusiasm round here.) I recommend the conference to you. It’s just about optimal size and always interestingly diverse.

Leave a comment

Filed under Bayesian, learning

Introducing StataStan


I have been working on a Stata add-on command to fit Bayesian models using Stan, and this is now out for testing. In this post, I want to introduce it, explain why it’s important, and encourage you all to try it out and give your feedback. I have already used it ‘in anger’ in two substantive projects, predicting student achievement and diagnosing tuberculosis of the eye, and it works fine – at least on my computers.

Stata version 14 includes, for the first time, Bayesian commands which provide the Metropolis-Hastings algorithm and the Gibbs sampler. These are the procedures available in the popular software BUGS and JAGS. However, there are many situations where they can grind along very slowly. The most common cause is a lot of correlated parameters, for example in a Bayesian multilevel model where each cluster gets its own parameter. Broadly speaking, you can picture this as the Gibbs sampler tries out each parameter one at a time, conditional on the others. This is like trying to move diagonally across a grid of orthogonal streets: it will take a while to get to the other side. Sometimes you can rotate the grid (orthogonal parameterisation), but sometimes you can’t. What you really need is a different algorithm, which is where Hamiltonian Monte Carlo (HMC) comes in. Radford Neal’s chapter in the Handbook of Markov Chain Monte Carlo provides an excellent overview. Further improvements in efficiency were achieved with the No U-Turn Sampler (NUTS), proposed by Hoffman and Gelman and explained in full in their 2014 paper. This was the starting point for Stan.

Stan is collaboratively-built, open-source software to run HMC, NUTS and more. HQ is at Columbia University, but the developers are based all over the world. Personally, I think it’s amazing, and they don’t pay me to say that. It is stable and superfast and tries to help you get the model code right, which is more than can be said for its Gibbs predecessors. There’s a very active and supportive online community on Google Groups. At heart, it’s a C++ library, but you don’t have to tangle with that stuff because there’s a command-line interface and interfaces for R, Python, Julia and Matlab. Now there is one for Stata too.

There is currently only one place where you can get StataStan:, where you can get the main file and a file. It is under alpha-testing, meaning that we have not run it on every combination of Stata version, flavor and operating system, and need to be assured that there are no fundamental incompatibilities before we move to beta-testing. This is where you come in: please try it out and let us know if it’s OK or not, also what version and flavor of Stata you have (like “12/SE”) and what operating system you’re using, including version and 32/64-bit if using Windows.

When it goes to beta-testing I’ll add it to my website so you could download from there inside Stata, and we’ll put links on the Stan homepage. When it passes that, and all the important wishes have been incorporated, I’ll send it to the SSC repository. I will update this blog post as we pass each of those hurdles. The latest stable and under-development versions will always be on GitHub, so if you are registered with them, you can contribute to StataStan. Don’t be shy, there’s plenty to do.

Now, here’s a toy example of using it, where you have a model file called bernoulli.stan (this is contained in the examples folder when you install Stan) that contains this model:

data {
int N;
int y[N];
parameters {
real theta;
model {
theta ~ beta(1,1);
for (n in 1:N)
y[n] ~ bernoulli(theta);

That means there are two parts to the data: N, the total number of observations, and y a vector of integers of length N. Our model is that y arose from a Bernoulli process with probability parameter theta, and we are putting a flat prior on theta anywhere from zero to one. That prior is specified as a beta distribution in the Stan example folder but you could make it even more efficient with a uniform distribution (HMC is more forgiving of uniform priors than M-H/Gibbs). Then in Stata, you can make some silly data like this:

set obs 10
generate y=0
replace y=1 in 1/2

That’s basically two 1s and eight 0s. OK, now get ready to set the world alight by estimating the probability of success when you’ve just got 2 out of 10 in an experiment. StataStan will pass the variables you specify over to Stan, as well as global macros, so let’s put the total number of observations into a macro called N:

quietly count
global N=r(N)

Now we can call the stan command:

stan y, modelfile("bernoulli.stan") ///
cmd("/root/cmdstan/cmdstan-2.6.2") globals("N") load mode

The options for the stan command are explained at the top of the file, and in the GitHub ‘README’ file. Soon, they will move into a Stata help file and a pdf manual. In this case we say that we want to send the variable called y, we name the file that contains the model, point out the location of Stan (you’ll need to change this for your computer), say that we want to send the global macro called N to Stan, along with the variable y, that we want to load the chains of HMC steps back into Stata when it’s done, and that we want to find the posterior mode as well as the default mean and median.

It’s also possible to specify the model inline, which is to say inside your do-file, so you don’t have to mess around with Stata and a text editor open side-by-side. You can read about the different ways to achieve this in the file on the GitHub page.

Note: on a Windows computer, you won’t see progress in the Stata output window; all the output will appear when it’s done. It’s a Windows thing; what can I say? Compiling even this toy example can take a couple of minutes, so don’t panic if you see nothing happening, but I’ve got a cunning plan to get around this and will add it to StataStan soon.


Filed under Bayesian, computing, Stan, Stata

Roman dataviz and inference in complex systems

I’m in Rome at the International Workshop on Computational Economics and Econometrics. I gave a seminar on Monday on the ever-popular subject of data visualization. Slides are here. In a few minutes, I’ll be speaking on Inference in Complex Systems, a topic of some interest from practical research experience my colleague Rick Hood and I have had in health and social care research.

Here’s a link to my handout for that: iwcee-handout

In essence, we draw on realist evaluation and mixed-methods research to emphasise understanding the complex system and how the intervention works inside it. Unsurprisingly for regular readers, I try to promote transparency around subjectivities, awareness of philosophy of science, and Bayesian methods.


Filed under Bayesian, healthcare, learning, R, research, Stata, Visualization

The rise of the imputers (maybe)

I thought this article in BMC Med Res Meth sounded interesting: “The rise of multiple imputation: a review of the reporting and implementation of the method in medical research“. I feel that we should know more about the adoption of analytical techniques, how they spread and are facilitated or blocked, but such reviews are almost unheard of. Unfortunately this one doesn’t answer a whole bunch of questions I have, largely because it just looks at two big-name medical journals in the years after software became widely available. I want to know what happens to get people using a new method (earlier in the history) and what gets printed all the way from top predators to bottom feeders.


Sadly, the crucial table 4, which lists the technical specifications of the methods used, is totally confusing. The half-page of miniscule footnotes (tl;dr) don’t help. All I can say is that not many of the papers used 100+ imputations, which is an interesting point because it seems recently (see a paper by Ian White which I can’t remember right now) that Don Rubin’s original experiments with a handful of imputed datasets might not be enough in some circumstances, and also with computer power on every desk (it’s not 1977 any more), there’s no reason not pump it up. Yet the original advice lives on.

It would be interesting to look at tutorial papers and reports and websites for various professions, as these are often badly written by someone who has A Little Knowledge, and see how these are referenced. I suspect a lot of people are going around referencing the original paper or the Rubin & Little book having never read either of them. Then someone should do some interviews of applied researchers who have published papers including MI. There you go, I’ve just given you a PhD topic. Off and do the thing.

Leave a comment

Filed under Bayesian, computing, healthcare