Category Archives: Stata

Introducing StataStan

stanlogo-main

I have been working on a Stata add-on command to fit Bayesian models using Stan, and this is now out for testing. In this post, I want to introduce it, explain why it’s important, and encourage you all to try it out and give your feedback. I have already used it ‘in anger’ in two substantive projects, predicting student achievement and diagnosing tuberculosis of the eye, and it works fine – at least on my computers.

Stata version 14 includes, for the first time, Bayesian commands which provide the Metropolis-Hastings algorithm and the Gibbs sampler. These are the procedures available in the popular software BUGS and JAGS. However, there are many situations where they can grind along very slowly. The most common cause is a lot of correlated parameters, for example in a Bayesian multilevel model where each cluster gets its own parameter. Broadly speaking, you can picture this as the Gibbs sampler tries out each parameter one at a time, conditional on the others. This is like trying to move diagonally across a grid of orthogonal streets: it will take a while to get to the other side. Sometimes you can rotate the grid (orthogonal parameterisation), but sometimes you can’t. What you really need is a different algorithm, which is where Hamiltonian Monte Carlo (HMC) comes in. Radford Neal’s chapter in the Handbook of Markov Chain Monte Carlo provides an excellent overview. Further improvements in efficiency were achieved with the No U-Turn Sampler (NUTS), proposed by Hoffman and Gelman and explained in full in their 2014 paper. This was the starting point for Stan.

Stan is collaboratively-built, open-source software to run HMC, NUTS and more. HQ is at Columbia University, but the developers are based all over the world. Personally, I think it’s amazing, and they don’t pay me to say that. It is stable and superfast and tries to help you get the model code right, which is more than can be said for its Gibbs predecessors. There’s a very active and supportive online community on Google Groups. At heart, it’s a C++ library, but you don’t have to tangle with that stuff because there’s a command-line interface and interfaces for R, Python, Julia and Matlab. Now there is one for Stata too.

There is currently only one place where you can get StataStan: https://github.com/stan-dev/statastan/tree/alpha-test, where you can get the main stan.do file and a stan-example.do file. It is under alpha-testing, meaning that we have not run it on every combination of Stata version, flavor and operating system, and need to be assured that there are no fundamental incompatibilities before we move to beta-testing. This is where you come in: please try it out and let us know if it’s OK or not, also what version and flavor of Stata you have (like “12/SE”) and what operating system you’re using, including version and 32/64-bit if using Windows.

When it goes to beta-testing I’ll add it to my website so you could download from there inside Stata, and we’ll put links on the Stan homepage. When it passes that, and all the important wishes have been incorporated, I’ll send it to the SSC repository. I will update this blog post as we pass each of those hurdles. The latest stable and under-development versions will always be on GitHub, so if you are registered with them, you can contribute to StataStan. Don’t be shy, there’s plenty to do.

Now, here’s a toy example of using it, where you have a model file called bernoulli.stan (this is contained in the examples folder when you install Stan) that contains this model:

data {
int N;
int y[N];
}
parameters {
real theta;
}
model {
theta ~ beta(1,1);
for (n in 1:N)
y[n] ~ bernoulli(theta);
}

That means there are two parts to the data: N, the total number of observations, and y a vector of integers of length N. Our model is that y arose from a Bernoulli process with probability parameter theta, and we are putting a flat prior on theta anywhere from zero to one. That prior is specified as a beta distribution in the Stan example folder but you could make it even more efficient with a uniform distribution (HMC is more forgiving of uniform priors than M-H/Gibbs). Then in Stata, you can make some silly data like this:

clear
set obs 10
generate y=0
replace y=1 in 1/2

That’s basically two 1s and eight 0s. OK, now get ready to set the world alight by estimating the probability of success when you’ve just got 2 out of 10 in an experiment. StataStan will pass the variables you specify over to Stan, as well as global macros, so let’s put the total number of observations into a macro called N:

quietly count
global N=r(N)

Now we can call the stan command:

stan y, modelfile("bernoulli.stan") ///
cmd("/root/cmdstan/cmdstan-2.6.2") globals("N") load mode

The options for the stan command are explained at the top of the stan.do file, and in the GitHub ‘README’ file. Soon, they will move into a Stata help file and a pdf manual. In this case we say that we want to send the variable called y, we name the file that contains the model, point out the location of Stan (you’ll need to change this for your computer), say that we want to send the global macro called N to Stan, along with the variable y, that we want to load the chains of HMC steps back into Stata when it’s done, and that we want to find the posterior mode as well as the default mean and median.

It’s also possible to specify the model inline, which is to say inside your do-file, so you don’t have to mess around with Stata and a text editor open side-by-side. You can read about the different ways to achieve this in the stan-example.do file on the GitHub page.

Note: on a Windows computer, you won’t see progress in the Stata output window; all the output will appear when it’s done. It’s a Windows thing; what can I say? Compiling even this toy example can take a couple of minutes, so don’t panic if you see nothing happening, but I’ve got a cunning plan to get around this and will add it to StataStan soon.

2 Comments

Filed under Bayesian, computing, Stan, Stata

Roman dataviz and inference in complex systems

I’m in Rome at the International Workshop on Computational Economics and Econometrics. I gave a seminar on Monday on the ever-popular subject of data visualization. Slides are here. In a few minutes, I’ll be speaking on Inference in Complex Systems, a topic of some interest from practical research experience my colleague Rick Hood and I have had in health and social care research.

Here’s a link to my handout for that:¬†iwcee-handout

In essence, we draw on realist evaluation and mixed-methods research to emphasise understanding the complex system and how the intervention works inside it. Unsurprisingly for regular readers, I try to promote transparency around subjectivities, awareness of philosophy of science, and Bayesian methods.

4 Comments

Filed under Bayesian, healthcare, learning, R, research, Stata, Visualization

Extract data from Dementia Care Mapping spreadsheets into Stata

March 2017 edit: I removed chunks of code now that I’m freelancing – if you’re interested in an automated, fast, reliable way of getting LOTS of DCM data into one place, contact me! I can do it for you or provide software to do it. Stata and R are both options and then data can go into SPSS or whatever, or it can be a stand-alone executable program.

I’ve recently been involved in three projects using the Dementia Care Mapping data collection tool. This is a neat way of getting really detailed longitudinal observations on the activity, mood, interactions and wellbeing of people with dementia in group settings. The makers of the DCM provide a spreadsheet which looks like this:

Copied from the Bradford DCM manual. These are not real data (I hope!)

Copied from the Bradford DCM manual. These are not real data (I hope!)

that is, there is a row of BCC behaviour codes and a row of ME quality of life codes for each person, and they are captured typically every five minutes. Date and time and other stuff that might be important are floating above the main body of the data. Subsequent worksheets provide descriptive tables and graphs, but we will ignore those as we want to extract data into Stata for more detailed analysis. (But let me, in passing, point out this work in Bradford, which is moving towards electronic collection.)

The good news is that Stata versions 12 and 13 have improved import commands, with specific support for Excel spreadsheets. You can specify that you want to take only particular cells, so that allows us to pull in the stuff at the top like data and time, and then go back and get the bulk of the data.

In the most recent project, there were several care homes, and each was visited several times, and within each visit there was at least a before, during, and after spreadsheet. Thankfully, my colleague who did the collection had very consistently filed everything away so the file and folder structure was very consistent, and that is crucial if you want to automate the process of extracting and compiling the data.

1 Comment

Filed under Stata

Save your simulation study seeds

Here in the Northern hemisphere, gardeners are gathering seeds from their prize-winning vegetables are storing them away for next year’s crop. Today at the 20th London Stata Users’ Group meeting, I learnt a similar trick. It’s strange I never thought of it before; regular readers will know how keen I am on simulation studies and techniques. Sometimes you want to go back and investigate one particular simulated dataset, either because the results were particularly interesting or because your program didn’t behave the way you hoped, and you think there will be a clue on how to debug it in that one weird dataset.

Bill Gould’s sage advice was to gather the seed for the random number generator (RNG) at the beginning of each iteration and store it as a string. The seed initialises the RNG (which is never really random*, just unpredictable) and if you set it, you can reproduce the results. A basic example in Stata:

clear all

local iter=10
set obs 100
gen str64 myseed=""
gen slopes=.
gen x=rnormal(0,1)
gen mu=1+(2*x)
gen y=rnormal(mu,1)

forvalues i=1/`iter' {
 local myseed=c(seed)
 qui replace myseed="`myseed'" in `i'
 qui replace x=rnormal(0,1)
 qui replace mu=1+(2*x)
 qui replace y=rnormal(mu,1)
 qui regress y x
 matrix A=e(b)
 local beta=A[1,1]
 qui replace slopes=`beta' in `i'
}

local revisit=myseed[2]
set seed `revisit'
qui replace x=rnormal(0,1)
qui replace mu=1+(2*x)
qui replace y=rnormal(mu,1)
regress y x
dis "`revisit'" 

In this example, we run a linear regression 10 times and save the slope parameters and the seeds. We can then pick out simulation number 2 and recreate it without retracing any other steps. Nice! Here it is in R (but R RNGs are much more complex and I have to say that the help documentation is far from helpful):

iter<-10
size<-100
slopes<-rep(NA,iter)
seeds<-list(NULL)
for(i in 1:iter) {
 seeds[[i]]<-.Random.seed
 x<-rnorm(size,0,1)
 y<-rnorm(size,1+(2*x),1)
 slopes[i]<-lm(y~x)$coefficients[2]
}
.Random.seed<-seeds[[2]]
x<-rnorm(size,0,1)
y<-rnorm(size,1+(2*x),1)
lm(y~x)

* – I’m glossing over the philosophical meaning of ‘random’ here

11 Comments

Filed under R, Stata

Meta-analysis methods when studies are not normally distributed

Yesterday I was reading Kontopantelis & Reeves’s 2010 paper “Performance of statistical methods for meta-analysis when true study effects are non-normally distributed: A simulation study“, which compares fixed-effects and a variety of random effects models under the (entirely realistic) situation where the studies do not happen to be drawn from a normal distribution. In theory, they would be if they were just perturbed from a global mean by sampling error, and that leads us to the fixed effects model, but the random effects says that there’s other stuff making the inter-study variation even bigger, and the trouble is that by definition you don’t know what that ‘stuff’ is (or you would have modelled it – wouldn’t you???)

The random effects options they consider (with my irreverent nutshell descriptions in brackets; don’t write and tell me they are over-simplifications, that’s the point) are DerSimonian-Laird (pretend we know the intra-study SD), Biggerstaff-Tweedie (intra-study SDs are themselves drawn from an inverse-chi-squared distribution), Sidik-Jonkman (account for unknown global SD by drawing the means from a t-distribution), “Q-based” (test Cochran’s Q for heterogeneity, if significant, use D-L, if not, use FE), maximum likelihood for both mean and SD, profile likelihood for mean under unknown SD, and a permutation version of D-L which was proposed by Follmann & Proschan, and seeing as everyone has been immortalized in the meta-analysis Hall of Infamy, I’m going to do it to them too. All in all a pretty exhaustive list. They tested them out in 10,000 simulations with data from a variety of skewed and leptokurtic population distributions, and different numbers of studies.

Conclusion one is that they are all pretty robust to all but the most bizarre deviations from normality.¬†Having said that, the authors offer a graph dividing the world into D-L optimal and profile likelihood optimal, which I found a bit odd because, firstly, DerSimonian-Laird is never true, it’s just an approximation, secondly, profile likelihood required bespoke, painful programming each time and thirdly, they just said it doesn’t matter. I rather like the look of Sidak-Jonkman in the tables of results, but that may be a cognitive bias in me that prefers old-skool solutions along the lines of “just log everything and do a t-test, it’ll be about right and then you can go home early” (a strange attitude in one who spends a lot of time doing Bayesian structural equation models). I also like Follmann-Proschan for their auto-correcting permutations, but if a non-permuting method can give me a decent answer, why bother?

Interestingly, the authors have provided all these methods in an Excel plug-in (I can’t recommend that, but on your head be it) and the Stata package metaan, which I shall be looking into next time I have to smash studies together. In R, you can get profile likelihood (and, I think, Biggerstaff-Tweedie) from the metaLik package, and maximum likelihood, Sidik-Jonkman and a REML estimator too from metafor. Simpler options are in rmeta and some more esoteric ones in meta. However, it still seems to me that the most important thing to do is to look at the individual study effects and try to work out what shape they follow and what factors in the study design and execution could have put them there. This could provide the reader with much richer information than just one mega-result (oops sorry, inadvertently strayed a little too close to Eysenck there) that sweeps the cause of heterogeneity under the carpet.

1 Comment

Filed under R, Stata

stata2leaflet v0.1 is released

Use Stata? Want to make an interactive online map with markers at various locations, colored according to some characteristic, with pop-up information when they’re clicked on? Easy. Head over to my website and download stata2leaflet. It’s in a kind of alpha testing version, so what I really want are your suggestions on making it even better. You’ll see my plans for v0.2 there too.

You have data like this:

s2l1

You type this:

stata2leaflet mlat mlong mlab, mcolorvar(mcol) replace nocomments ///
title("Here's my new map") ///
caption("Here's some more details")

A file appears which looks like this inside:

s2l2

And opens in your browser like this (static version because of WordPress.com restrictions – clickety click for the real McCoy):

stata2leaflet_example

Leave a comment

Filed under Stata, Visualization

Animated graphs hits the Stata blog

Chuck Huber of StataCorp (the voice behind those great YouTube videos) has just been blogging about animated graphs. He looks into using Camtasia software as well as my ffmpeg approach. And even if you’re not interested in making any such graph, go and look at some of his wonderful GIFs which would make great teaching tools, for example around power and sample size.

The more I use ffmpeg, the more I appreciate it. Working with video files is a real pain nowadays. There used to be more compatibility across software and operating systems and browsers, but not they all seem to be closing ranks. This is a good overview; although the terror of 2011 turned out to be a little overstated, the direction of travel is there and the HTML5 video tag remains flawed through lack of support from the software houses. Just today I’d been messing about moving video files from one computer to another in the vague hope that somewhere I would find the right combination of permissions that could open them, edit them and save them again. It was a struggle. The closest I got was the oldest OS I had on a laptop: XP (no, I’m not going to update it because the support ended yesterday! It was the last good one!). Then in the end I realised I could just do it all from the command line with ffmpeg. Plus you get to look like a badass hacker if anyone looks over your shoulder!

ffmpeg in action (compare and contrast with your favourite proprietary video software NOT in action). Borrowed from http://www.dedoimedo.com/computers/edit-multimedia-flash.html

Leave a comment

Filed under animation, Stata