Animated graphs hits the Stata blog

Chuck Huber of StataCorp (the voice behind those great YouTube videos) has just been blogging about animated graphs. He looks into using Camtasia software as well as my ffmpeg approach. And even if you’re not interested in making any such graph, go and look at some of his wonderful GIFs which would make great teaching tools, for example around power and sample size.

The more I use ffmpeg, the more I appreciate it. Working with video files is a real pain nowadays. There used to be more compatibility across software and operating systems and browsers, but not they all seem to be closing ranks. This is a good overview; although the terror of 2011 turned out to be a little overstated, the direction of travel is there and the HTML5 video tag remains flawed through lack of support from the software houses. Just today I’d been messing about moving video files from one computer to another in the vague hope that somewhere I would find the right combination of permissions that could open them, edit them and save them again. It was a struggle. The closest I got was the oldest OS I had on a laptop: XP (no, I’m not going to update it because the support ended yesterday! It was the last good one!). Then in the end I realised I could just do it all from the command line with ffmpeg. Plus you get to look like a badass hacker if anyone looks over your shoulder!

ffmpeg in action (compare and contrast with your favourite proprietary video software NOT in action). Borrowed from

Leave a comment

Filed under animation, Stata

Look down the datascope

Maarten Lambrechts has a great post over at his blog. It’s all about interactive dataviz, regarding it as a datascope, that – like a telescope – lets you look deep into the data and see stuff you couldn’t otherwise. You must read it! But just to give you the punchline:

A good datascope

  1. unlocks a big amount of data
  2. for everyone
  3. with intuitive controls
  4. of which changes are immediately represented by changes in the visual output
  5. that respects the basic rules of good data visualization design
  6. and goes beyond what can be done with static images.

Maybe I should add a 7th rule: a facet or view of the datascope should be saveable and shareable.

Thanks to Diego Kuonen for sharing on Twitter

Leave a comment

Filed under Visualization

Beeps and progress alerts to your phone

Recently I encountered an R package called pingr, made by Rasmus Bååth (the same guy who did MCMC in a web page, my visualization of 2013). You install it, you type ping(), and it goes ping. Nice.

Hear me now

Hear me now

In fact there are nine built-in pingr noises. It’s more useful than it may seem; I was using it within minutes of reading the blog post because I had a series of Bayesian models running on my laptop while I wrote some stuff on my desktop PC. When the models finished, they went ping, making everything as efficient as possible. It got me thinking about beeping alerts in all sorts of data analysis software.

In Stata, you can just type ‘beep’. Job done. In fact, that locates the system general alert sound (in Windows at least) and plays it. I spent some time extracting data from a primary care database recently, where there were several computers grinding through the big data for different researchers in a windowless room. Every now and then, a lion’s roar would emanate from one of them. I found it a bit disconcerting but played it cool until someone told me they had replaced the Windows alert beep with this .wav file for a laugh.

SPSS used to have sound alerts in the General Options menu, but they have quietly (?) been dropped sometime around version 20. The pain about that was it was either on, beeping every time some output was added, or off. There didn’t seem to be a syntax command for beeping. However, there is now one (STATS SOUND) in the extension commands package; it’s not clear whether one has to pay extra for that or not, and frankly, I’m not going to bother finding out.

When I’m able to glance at the computer regularly, perhaps because I’m eating what passes for lunch in Stats HQ, I particularly like R’s txtProgressBar with style=3. Stata users can easily display dots in a similar fashion, although it’s interesting to look online and see the alternative solutions, such as displaying progress in the window title, which could have advantages in some situations.

My latest long-running simulation made me try something quite different. I wanted progress reports but I was going to be in another room. If something went wrong, I would go back to the office and try to fix it. On my (Android) cellphone I have an app called Minutes. It’s a basic text editor that syncs very easily to Dropbox. So all I needed to do was have the stats software write periodically to a text file in the Minutes folder, and the update appears on my phone!


How 21st-century is that! This is how I’ve done it in R:

for (k in 1:1000) {
if(floor(k/100)>progress) {
writeLines(paste("Now on iteration ",k,sep=""),fileConn)
# more complicated stuff follows, and then ...

and in Stata:

local progress=0
forvalues i=1/1000 {
 if (floor(`i'/100)>`progress') {
 file open minutes using "progress.txt", write text replace
 file seek minutes tof
 file write minutes "Now on iteration `i'"
 file close minutes
 local progress=floor(`i'/100)
 // some complicated time-consuming stuff...

Notice how the file is written to the drive each time you writeLines in R, even without closing the fileConn, but in Stata you have to close inside the if branch. Also, R will carry trying to run commands after an error, so it’ll (probably) go ping, while Stata will stop and therefore you will hear no beep.

It will get a little more complicated to catch errors, but not much. If your program grinds to an unpleasant halt, your progress.txt file will just be stuck there on the last number, and it could be a while before you get suspicious and go to check. One simple solution is to write all your output to the progress.txt file, but this will slow things down if you can’t avoid (or don’t want to avoid) writing lots of lines to the output; this was the case for my simulation with rstan. You only want one special line written in case of an error that says

I'm afraid I can't do that, Dave.

You could send an SMS too, if you prefer…


Filed under R, SPSS, Stata

Including trials reporting medians in meta-analysis

I’ve been thinking a lot about how best to include trials that report incomplete stats (or just not the stats you want) in a meta-analysis. This led me to a 2005 paper by Hozo, Djulbegovic & Hozo. It’s a worthwhile read for all meta-analysts. They set out estimators for the mean & variance given the median, range & sample size. The process by which they got these estimators was a cunning use of inequalities.
However, I was left wondering about uncertainty around the estimates. Because I’ve been taking a Bayesian approach, I really want a conditional distribution for the unknown stats given what we do know. There is one point where the authors try a little sensitivity analysis by varying the mean and standard deviation that came from their estimators, and they found a change in the pooled estimate from their exemplar meta-analysis that is too big to ignore. They do give upper and lower bounds, but that’s not the same thing.
Another interesting problem is that the exemplar meta-analysis seems to have some substantial reporting bias; the studies reporting medians get converted to smaller means than those that reported means. A fully Bayesian approach would allow you to incorporate some prior information about that.

1 Comment

Filed under Bayesian

Data detective work: work out the numerator or denominator given a percentage

Here’s some fun I had today. If you are looking at some published stats and they tell you a percentage but not the numerator & denominator, you can still work them out. That’s to say, you can get your computer to grind through a lot of possible combinations and find which are compatible with the percentage. Usually you have some information about the range in which the numerator or denominator could lie. For example, I was looking at a paper which followed 63 people who had seen a nurse practitioner when they attended hospital, and the paper told me that 18.3% of those who responded had sought further healthcare. But not everyone had answered the question; we weren’t told how many but obviously it was less than or equal to 63. It didn’t take long to knock an R function together to find the compatible numerators given a range of possible denominators and the percentage, and later I did the opposite. Here they are:

 # deducing numerator from percentage and range of possible denominators
whatnum<-function(denoms,target,dp) {
	for (i in 1:(length(denoms))) {
		if(round(lo/d, digits=dp)==target) {
			if(round(hi/d, digits=dp)==target) {
				warning(paste("More than one numerator is compatible with denominator ",d,"; minima are returned",sep=""))
		else if(round(hi/d, digits=dp)==target) nums[i]<-hi
# and the opposite 
whatdenom<-function(nums,target,dp) {
	for (i in 1:(length(nums))) {
		if(round(n/lo, digits=dp)==target) {
			if(round(n/hi, digits=dp)==target) {
				warning(paste("More than one denominator is compatible with numerator ",n,"; minima are returned",sep=""))
		else if(round(n/hi, digits=dp)==target) denoms[i]<-hi

By typing
I could find straight away that the only possibility was 11/60.
That particular paper also had a typo in table 4 ("995.3%") which meant it could be 99.5% or 99.3% or 95.3%. I could run each of those through and establish that it could only possibly have been 95.3%. Handy for those pesky papers that you want to stick in a meta-analysis but are missing the raw numbers!


Filed under R

More peas, dear?

As soon as I woke I knew it was going to be one of those days. The first words I heard were from the BBC: researchers had discovered that eating five portions of fruit and vegetables a day (optimistic already) was not enough – we must all eat seven. I was suspicious. I said as much to Mrs Grant as I stumbled towards the kitchen: “residual confounding, socio-economic status”. She ignored me.

Eventually I got round to printing the paper in JECH and read it on the train into town (to hear the wise and insightful Sir David Cox talk at the RSS), with increasing alarm. Every day I see bad stats, of course, but the press coverage for this one makes it potentially very harmful by putting people off even attempting any increased fruit/veg consumption. I’ve no doubt that fruit & veg is good, but I don’t believe there is any decent evidence for 5 or 7 or 10 portions. Now, to be fair, the paper itself expresses some caution. Regular readers will know what’s coming next. UCL’s press release spins it a little bit, mentioning the 7 portions quite a lot. And then the press picked it up and spun it a bit more, into killjoy-docs-say-eat-two-pounds-of-broccoli-or-face-certain-death . It’s kind of nobody’s fault but it went wrong anyway. Like the Iraq War.

Well, that’s my kind words of comfort for the authors of the paper. From here on, it’s going to hurt.

I think there are six major flaws that make this study close to totally uninformative:

  1. Residual confounding, particularly by socio-economic status. SES is measured in the data source, the Health Survey for England (HSE) as the “head of household” having a manual or non-manual job. That’s all there is, and to put that into a regression as a covariate and pretend that SES has been taken out of the equation is sheer nonsense. For me, there is a smoking gun: eating more frozen or tinned fruit & veg is associated with significantly higher hazard of death. That just doesn’t make sense unless it is actually a confounded association. It is so ludicrous that they should have stopped at that point and considered things very carefully.
  2. The fruit & veg consumption in HSE relates to the 24 hours prior to the survey. We know that will balance out over the population, and also that it is no worse than other self-reported measures, but it remains biased.
  3. Several subgroup analyses (but not an exhaustive list) appear and get repeatedly quoted, with little or no rationale whatever for their selection. In every subgroup analysis that appears in the paper, the effect is significant and stronger than in the whole dataset. This may be above board – I don’t know – but it looks very much like cherry-picking.
  4. A linear assumption of hazard ratio for one more portion is assumed at one point, without theoretical justification or reference to the data. Presumably that’s where the idea of ten portions came from; we can just extrapolate a linear trend off the end of the observed data. Eat enough veg and you live forever; eat enough frozen veg and you die immediately.
  5. These are people who have chosen, of their own accord, to eat an awful lot of fruit & veg, or at least say they do. That’s not the same as the  UK population, encouraged and cajoled into eating X portions per day.
  6. Some variables are missing in as many as 62% of the participants. They are included as their own category, which we have known since Rubin (1976) is a very bad idea.

I don’t relish going out on a limb and attacking other people’s work that I’m not intimately acquainted with, but I do so here because it is potentially very harmful. If people are discouraged from even attempting to eat more veg because the bar has been set unrealistically high, that is a massive public health own-goal. Even if points 2-6 turn our to be fine, point 1 certainly isn’t, and that is the worst one.

Here is my own transcript of the BBC Radio 4 Today programme interview, and you will see the researcher is not entirely blameless in encouraging a certain bold interpretation of linear and universal benefit. Success and fame is a corrupting influence (or so I’m told).

JH: We’re used to being told that eating five portions of fruit and veg a day was good for us, and we should try to do it, although apparently two thirds of adults in this country don’t. Now researchers say that the benefits are even greater than we thought, and that eating seven or even more portions a day may have considerable benefits. They link high consumption of fruit and veg to longer life; it’s as simple as that. Lola Oyebode is the lead author of the research which is published today in the Journal of Epidemiology and Community Health, and she joins us now – good morning.
LO: Good morning.
JH: Now, tell us what you found, just putting it as simply as you can.
LO: We looked at the general population of England, and we grouped them by how many portions of fruit and vegetables they ate a day, so we looked at people who ate less than one, one to three, three to five, five to seven, and seven plus, and what we found was in each group, the more fruit and vegetables you ate, the better the benefit to your health, with the group who were eating seven or more portions a day having the lowest risk of mortality.
JH: Ah, well, lowest risk of mortality – but you seem to be suggesting there are wide benefits, that it’s a general prescription for good health.
LO: What we looked at was mortality and we looked at mortality from any cause, death from cancer and death from heart disease and stroke, so those were our outcomes.
JH: So, fruit and vegetables are enemies of the big killers?
LO: That’s right.
JH: Now, when you say seven portions, what do you mean? Seven carrots?
LO: The advice is that you have a variety of fruit and vegetables, so not to eat seven of the same sort.
JH: No…I’m not suggesting that most people would like to eat seven carrots!
LO: Well, actually, I would, but…
JH: Right, well, we’ll keep your personal habits out of it!
LO: A portion is about eighty grams, so that’s one large fruit, or a handful of smaller fruit or veg.
JH: Let’s just talk about the difference between fruit and vegetables. Is there any difference?
LO: Yes, what we found was that vegetables had a greater benefit than equivalent amounts of fruit, but we did still find that fruit gave significant benefit to health.
JH: What is it that causes this benefit to happen?
LO: Well, what we think is that the sugar content in fruit makes it not quite as good as vegetables, and that both fruit and vegetables have lots of micro-nutrients, which are important for the body to work properly, and also lots of fibre, which is good for health.
JH: So, in other words, what you’re saying is that if you want to increase your chances of living a long life rather than an artificially short life, if you cut down on red meat, fatty food, and all the rest of it, and increase your intake of fruit and vegetables, that’s the best thing you can do?
LO: Yes, that’s right.
JH: Does that sum it up?
LO: It does. Well, so we found that all additional portions of fruit and vegetables were of benefit, so even those eating one to three portions were doing significantly better than the people eating less than one portion. So, how ever many you’re eating now, eat more!
JH: And what about the age of the people, and the impact that it has? In other words, I mean if you are seventy, is it still worth increasing the amount of fruit and veg that you eat?
LO: Well, we included seventy year olds in our study – we looked at adults aged thirty five and over in the general population.
JH: So it’s true for everybody – get stuck in. Good news for greengrocers. Thank you very much indeed, Lola Oyebode.


1 Comment

Filed under Uncategorized

Kosara on stories vs worlds

Robert Kosara has written recently on his blog Eager Eyes about the tension or synergy between stories (showing the consumer of data what the message is, or leading them to the points of interest, or telling them a really compelling instance) and worlds (opening it up for exploration and leaving them to it). This is something I was reflecting on last week at the RSS (for which, by the way, videos are coming soon, hopefully in 2 weeks’ time). I read somewhere in The Politics of Large Numbers (and have never been able to find the page again – perhaps I dreamt it, in the same way I was convinced for a couple of years that Flavor Flav was dead before realising I dreamt that particular news broadcast) that a great debate raged through the innovative French statistical service set up after the Revolution. Some claimed that the role of the statistician was to present data without filtering or interpreting – or even summarising. Yes, some argued against percentages and averages; a very French sort of intellectual aggression!

Step up and show these people how to work out their own particular time that they find interesting

But the prolegomena makes or breaks a visualisation, as I raved about here recently. The best example might be Budget Forecasts, Compared With Reality, and while some bad ones come to mind, I don’t really want to single one out. I’m sure you have your own bugbear, or you can just visit wtfviz.

For me, there’s really no difference between introducing a visualization and introducing a table of stats, or a page of results text. Effective communication will involve some story and some world, and not just one-size-fits-all, as Gelman and Unwin pointed out. It’s interesting, though, that all this attention and investigation and debate goes on for grahics, while nobody pays any attention to what tables or words we should use to get people engaged with, understanding and remembering our/their stories. Where are the research studies comparing layouts of logistic regression coefficient tables in terms of comprehension and recall?

1 Comment

Filed under Visualization