Introducing StataStan

stanlogo-main

I have been working on a Stata add-on command to fit Bayesian models using Stan, and this is now out for testing. In this post, I want to introduce it, explain why it’s important, and encourage you all to try it out and give your feedback. I have already used it ‘in anger’ in two substantive projects, predicting student achievement and diagnosing tuberculosis of the eye, and it works fine – at least on my computers.

Stata version 14 includes, for the first time, Bayesian commands which provide the Metropolis-Hastings algorithm and the Gibbs sampler. These are the procedures available in the popular software BUGS and JAGS. However, there are many situations where they can grind along very slowly. The most common cause is a lot of correlated parameters, for example in a Bayesian multilevel model where each cluster gets its own parameter. Broadly speaking, you can picture this as the Gibbs sampler tries out each parameter one at a time, conditional on the others. This is like trying to move diagonally across a grid of orthogonal streets: it will take a while to get to the other side. Sometimes you can rotate the grid (orthogonal parameterisation), but sometimes you can’t. What you really need is a different algorithm, which is where Hamiltonian Monte Carlo (HMC) comes in. Radford Neal’s chapter in the Handbook of Markov Chain Monte Carlo provides an excellent overview. Further improvements in efficiency were achieved with the No U-Turn Sampler (NUTS), proposed by Hoffman and Gelman and explained in full in their 2014 paper. This was the starting point for Stan.

Stan is collaboratively-built, open-source software to run HMC, NUTS and more. HQ is at Columbia University, but the developers are based all over the world. Personally, I think it’s amazing, and they don’t pay me to say that. It is stable and superfast and tries to help you get the model code right, which is more than can be said for its Gibbs predecessors. There’s a very active and supportive online community on Google Groups. At heart, it’s a C++ library, but you don’t have to tangle with that stuff because there’s a command-line interface and interfaces for R, Python, Julia and Matlab. Now there is one for Stata too.

There is currently only one place where you can get StataStan: https://github.com/stan-dev/statastan/tree/alpha-test, where you can get the main stan.do file and a stan-example.do file. It is under alpha-testing, meaning that we have not run it on every combination of Stata version, flavor and operating system, and need to be assured that there are no fundamental incompatibilities before we move to beta-testing. This is where you come in: please try it out and let us know if it’s OK or not, also what version and flavor of Stata you have (like “12/SE”) and what operating system you’re using, including version and 32/64-bit if using Windows.

When it goes to beta-testing I’ll add it to my website so you could download from there inside Stata, and we’ll put links on the Stan homepage. When it passes that, and all the important wishes have been incorporated, I’ll send it to the SSC repository. I will update this blog post as we pass each of those hurdles. The latest stable and under-development versions will always be on GitHub, so if you are registered with them, you can contribute to StataStan. Don’t be shy, there’s plenty to do.

Now, here’s a toy example of using it, where you have a model file called bernoulli.stan (this is contained in the examples folder when you install Stan) that contains this model:

data {
int N;
int y[N];
}
parameters {
real theta;
}
model {
theta ~ beta(1,1);
for (n in 1:N)
y[n] ~ bernoulli(theta);
}

That means there are two parts to the data: N, the total number of observations, and y a vector of integers of length N. Our model is that y arose from a Bernoulli process with probability parameter theta, and we are putting a flat prior on theta anywhere from zero to one. That prior is specified as a beta distribution in the Stan example folder but you could make it even more efficient with a uniform distribution (HMC is more forgiving of uniform priors than M-H/Gibbs). Then in Stata, you can make some silly data like this:

clear
set obs 10
generate y=0
replace y=1 in 1/2

That’s basically two 1s and eight 0s. OK, now get ready to set the world alight by estimating the probability of success when you’ve just got 2 out of 10 in an experiment. StataStan will pass the variables you specify over to Stan, as well as global macros, so let’s put the total number of observations into a macro called N:

quietly count
global N=r(N)

Now we can call the stan command:

stan y, modelfile("bernoulli.stan") ///
cmd("/root/cmdstan/cmdstan-2.6.2") globals("N") load mode

The options for the stan command are explained at the top of the stan.do file, and in the GitHub ‘README’ file. Soon, they will move into a Stata help file and a pdf manual. In this case we say that we want to send the variable called y, we name the file that contains the model, point out the location of Stan (you’ll need to change this for your computer), say that we want to send the global macro called N to Stan, along with the variable y, that we want to load the chains of HMC steps back into Stata when it’s done, and that we want to find the posterior mode as well as the default mean and median.

It’s also possible to specify the model inline, which is to say inside your do-file, so you don’t have to mess around with Stata and a text editor open side-by-side. You can read about the different ways to achieve this in the stan-example.do file on the GitHub page.

Note: on a Windows computer, you won’t see progress in the Stata output window; all the output will appear when it’s done. It’s a Windows thing; what can I say? Compiling even this toy example can take a couple of minutes, so don’t panic if you see nothing happening, but I’ve got a cunning plan to get around this and will add it to StataStan soon.

Leave a comment

Filed under Bayesian, computing, Stan, Stata

Visualizing HDI: a fine D3 exemplar

This interactive visualisation of Human Development Index values, by country and over time, was released last week.
visualizing-hdi

For me, it follows in the mould of The State of Obesity, but is much more transparent in how it is constructed when you look at the source code. That makes it a good exemplar — in fact, perhaps the exemplar currently available — for introducing people to the possibility of making interactive dataviz for their research projects.

Oh for those early days of D3, when nobody was terribly fluent with it, and websites would have all the code right there, easy to read and learn from.

That transparency is important, not just for teaching about dataviz, but for the whole community making and innovating interactive data visualisation. Oh for those early days of D3, when nobody was terribly fluent with it, and websites would have all the code right there, easy to read and learn from. Now they are tucked away in obscure links upon links, uglified and mashed up with other JS libraries, familiar to the maker but (probably) not you. There are obvious commercial pressures to tucking the code away somewhere, and you can actually obtain software to obfuscate it deliberately. At the same time, having everything in one file, hard-coded for the task at hand, may be easy to learn from, but it isn’t good practice in any kind of coding culture, so if you want to be respected by your peers and land that next big job, you’d better tuck it all away in reusable super-flexible in-house libraries. And yet, the very availability of simple D3 code was what kick-started the current dataviz boom. Everyone could learn from everyone else really quickly because everything was open source. I don’t like to think that was a short-lived phase in the early part of the technology’s life cycle, but maybe it was…

Anyway, that’s enough wistful nostalgia (I learnt yesterday that I am the median age for British people, so I am trying not to sound like an old duffer). Here’s the things I don’t like about it:

  1. it requires a very wide screen; there’s no responsiveness (remember your audience may not work in a web design studio with a screen as big as their beard)
  2. life expectancy gets smoothed while the other variables don’t – just looks a bit odd
  3. why have colors for continents? Doesn’t it distract from the shading? Don’t we already know which one is which?
  4. Why give up on South Sudan, Somalia (which seems to be bunged together with Somaliland and Puntland in one big “hey it’s somewhere far away, they won’t notice” sort of way) and North Korea? Aren’t these countries’ estimates kind of important in the context, even if they are not very good? Do you really believe the Chinese estimates more than these just because they’re official?

But all in all, these are minor points and a nice amount of grit for the mill of students thinking about it. I commend it to you.

Leave a comment

Filed under Visualization

Roman dataviz and inference in complex systems

I’m in Rome at the International Workshop on Computational Economics and Econometrics. I gave a seminar on Monday on the ever-popular subject of data visualization. Slides are here. In a few minutes, I’ll be speaking on Inference in Complex Systems, a topic of some interest from practical research experience my colleague Rick Hood and I have had in health and social care research.

Here’s a link to my handout for that: iwcee-handout

In essence, we draw on realist evaluation and mixed-methods research to emphasise understanding the complex system and how the intervention works inside it. Unsurprisingly for regular readers, I try to promote transparency around subjectivities, awareness of philosophy of science, and Bayesian methods.

4 Comments

Filed under Bayesian, healthcare, learning, R, research, Stata, Visualization

Overpowered and underpowered chi-squared tests – or are they?

This is a very quick thought for teaching, in passing. People often talk about chi-squared tests being overpowered when n is large. It occurs to me that a good way to broach this concept in an intuitive way is to point out that they are no different to t-tests and the like, but do not provide a meaningful point estimate. When you see the mean difference in blood pressure from drug X is 0.3mmHg, with p<0.001, you know it is clinically meaningless. When you see X2=3.89, nobody knows what to think. So perhaps the best thing to do is to mention this alongside non-parametric rank-based procedures, when you explain that they don’t give you an estimate or confidence interval.

Leave a comment

Filed under learning

We were there when they made Dear Data

 

Easy links: dear-data.com deardata-deliveries.tumblr.com

dd001

dd002

dd003

dd004

dd005

dd006

 

 

Procedural notes that can be skipped:

I had previously intended to write something about the shapes employed by Giorgia Lupi and the Accurat studio – and indeed I still will. But that takes some time and it got leapfrogged by Dear Data. This post came at a good time because I didn’t get around to it straight away (we’re now at week 35 of the project) and by the time I did, some other ideas had bubbled up in conversations, focussing my attention on the process of design, critique and refinement (which is getting added to my reading pile for the summer). These ideas are so alien to statisticians that I am not sure any of them will have read this far into this post, but they (we) are the ones that need to up their (our) game in communication. Nobody else will do it for us! The other building block that came along in time was finally finding really nice writing paper and resolving to draft everything by hand from now on, preferably in time when I’m physically away from a computer. It has already proven very productive. People seem to have different approaches that work (like starting with bullet points, or cutting out phrases, or mind maps), but mine is to start writing at sentence one, like Evelyn Waugh, and just carry straight on. There is no draft; why should there be? Finding that technique and place to write is really valuable; don’t devalue it and try to squeeze it into a train journey or between phone calls. It’s the principal way in which you communicate your work, and probably the most overlooked.

Leave a comment

Filed under Visualization

On cryptic multiplicity (for the last time, surely!)

Since some psychologists with no statistical qualifications decided to clean up psychology by banning statistics, other journals and assorted rags are seeing an opportunity to publish something kinda scientific, kinda popular, kinda easy. Here’s three that spectacularly miss the point. I don’t mind people being wrong on the internet, but when peer-reviewed journals tell you how to fix bad stats in peer-reviewed journals by using bad stats, it probably ought to be pointed out. Maybe it’s how Tom Lehrer felt on hearing of Kissinger’s Nobel Prize.

hiking-78x78Exhibit one: “The extent and consequences of P-hacking in science“, written by biologists in PLoS Biology. They are examining overt multiplicity – running loads and loads of tests and reporting only the super funky ones – which is quite rare. Most people are agreed on this. The real problem is cryptic multiplicity, or “the garden of forking paths“. Get rid of the overt abusers and you would still have Ioannides’s problem of everything ever published being absolute nonsense (I paraphrase).

“peer-reviewed journals tell you how to fix bad stats in peer-reviewed journals by using bad stats”

Exhibit two: “P-values are just the tip of the iceberg“, written by biostatisticians in Nature. This is pretty good, but falls into that medical habit of envisioning all hypotheses as terribly clear-cut (mean systolic blood pressure at 6 weeks, per protocol, differs between the drug and placebo groups, &c &c), and they really are not. If we could only sort out study designs and data cleaning, they argue, then we could resume p-values with impunity, safe in the knowledge they would lead us to the Universal Truth with alpha type 1 errors and beta type 2 errors. They name their own online course in statistics as the solution (I suppose there’s nothing wrong with that in a comment article, but we Brits feel a tad, you know, uncomfortable with the old self-promotion sort of thing (for a moment I imagined there would be a toll-free number to enrol in the footnotes, but there wasn’t; anyway, it’s $470 and you can pay up front or as-you-go with Coursera, they take all the usual credit cards)). There’s, tellingly, this diagram which shows how you collect your data, then clean it, then do some analyses and find what looks sexy, and then make a hypothesis, and then test it. Cool!

Spot the problem?

  • Yes? Well done. You don’t really need to read this. Get on with your work.
  • No? Scroll back up and read that garden o’ forking paths paper. Read it good.

Now, let’s be generous. I criticise Leek and Peng with some trepidation, because they know their onions. It could be that the comment piece was slashed about the editing room by Nature’s staffers, and maybe the diagram was also the product of fevered interns. Heaven knows it wouldn’t be the first time.

Exhibit three: “What is medicine’s five sigma?” by the editor of the Lancet (a doctor, that most humble and self-effacing of professions) writes that p-values are awful so the thing to do is to accept only papers with p<0.0000003. This kind of misses the point so profoundly it's like treating an ingrown toenail by amputation. Of the other leg.

I'm not even going to bother you with the psychology magazines.

Conclusion: plenty of experience of critiquing papers and running analyses does not seem to prevent a lack of understanding of what a hypothesis test really does and how it goes wrong familywise, sciencewise, or whatever. The recommended solution is to read some philosophy of science. The late, great Peter Lipton is my homeboy on this one. We should pre-specify everything we can, and not allow those around us to get away with vague scientific hypotheses that map to many possible statistical hypotheses (forks in the path). But we must also remember that sometimes you can't pre-specify everything, and that ultimately you are doing a sort-of-deductive process after an inductive guess at an interesting topic and before an inductive set of practical conclusions, so we should acknowledge this creativity and not brush it under the carpet.

By the way, my new book “Frog Sperm for Dummies” is published next week. Order it now on Amazon! (with apologies to Michael Healy).

Leave a comment

Filed under learning

UK election facts clarified with interactive graphics

I’ve been impressed with this website (constituencyexplorer.org.uk) put together by Jim Ridgway and colleagues at Durham, with input from the House of Commons Library and dataviz guru Alan Smith from the ONS. In part, it is aimed at members of parliament, so they can test their knowledge of facts about constituencies and learn more along the way. But it makes for a fun quiz for residents too. Everything is realised in D3, so it runs everywhere, even on your phone. There are a few features I really like: the clean design, the link between map, list and dotplot in the election results:

constituency-explorer-results

… the animation after choosing a value with the slider, highlighting the extra/shortfall icons and the numbers dropping in: nice!

constituency-explorer-quiz

… the simple but quite ambitious help pop-up:

constituency-explorer-prolegomena

… and the way that the dotplots are always reset to the full width of the variable, so you can’t be misled by small differences appearing bigger than they are. The user has to choose to zoom after seeing the full picture.

All in all, a very nice piece of work. I must declare that I did contribute a few design suggestions in its latter stages of development but I really take no credit for its overall style and functionality. Budding D3 scripters could learn a lot from the source code.

And while we’re on the topic, here some more innovative electoral dataviz:

electoralcalculus.co.uk

bbc.co.uk/news/election-2015-32336071

And finally, take a moment to ask election candidates to commit to one afternoon of free statistical training, a great initiative from the RSS – and frankly, not much to ask. Unfortunately, none of my local (Croydon Central) would-be lawmakers have been bothered to write back yet. But here’s the parties that are most interested in accurate statistics, in descending order (by mashing up this and this):

  1. National Health Alliance: 4/13
  2. Pirate Party: 1/6
  3. Green Party: 47/568
  4. Labour Party: 51/647
  5. Plaid Cymru: 3/40
  6. Liberal Democrats: 47/647
  7. Ulster Unionist Party: 1/15
  8. Christian People’s Alliance: 1/17
  9. Conservative and Unionist Party: 27/647
  10. Scottish National Party: 2/59
  11. United Kingdom Independence Party: 15/624

Leave a comment

Filed under Visualization