Monthly Archives: April 2013

H7N9 bird flu in China

The paper out two days ago in the New England Journal of Medicine that details latest epidemiological information from this outbreak has some really thoughfully produced graphics. It also provokes some in-depth statistical pondering. It’s worth a look. I can’t reproduce the figures here without waiting for copyright permissions first, so I’ll just link you straight to the paper thus, and you can see them and the accompanying text for yourself.

Figure 1 seems to suggest that the first three provinces (Shanghai, Zhejiang and Jiangsu) to have more than an isolated case saw a similar rise then fall in the numbers. See those colored bars rise and then fall again? Maybe there is a localised outbreak, transmission for a few days, and then it dies out. Well, no, I don’t think so, although it’s tempting to infer a common history like that. There are two reasons argue against it for me. One, the cases are surprisingly widespread geographically (see Figure 2). The distance from eastern Henan to Shanghai is 800 km, which is the same as Land’s End to Dumfries, or New York to Quebec City. Two, the stacking of the bars make the ones on top look at a glance like they are rising even if it just the bars underneath that are moving.

It seemed to me that there were a lot of small numbers of cases away from the coast where the patient still alive. Now, this is very flawed because I should include the days since symptoms appeared, and I don’t know that, but I made a Poisson Q-Q plot using the data from Figure 2. Shanghai looks quite different to the other locations:

Image

In fact, if you base the quantiles on the mean death risk from all the sites except Shanghai, they all lie along the line, which suggests they are Poisson-distributed but something else is going on in Shanghai, producing a higher death rate, or a lower proportion of cases that survive and recover are being captured. I don’t think it is that Shanghai started having cases first, so they have had longer in which to die (sorry to be morbid folks, it’s what I do for a living), because the median time from onset to death is 11 days (IQR 7-20) and we have cases going back to March in three provinces, while Shanghai’s bulk of cases only really got going at the same time as everywhere else, 4 weeks ago.

Image

 

One more thing struck me: how much information we are given about the patients. We would never write all that potentially identifying information here. Is it all right if (a) the data come from a country where they are not so keen on anonymity in research, (b) if the future of humanity is at stake and a snippet of information in there could be the clue that saves us (at this stage, I can’t honestly tell you that my choice of words is entirely flippant), or (c) if they said it was all right? Discuss.

Leave a comment

Filed under Uncategorized

Fancharts in R

Guy Abel has done some nice work recreating David Spiegelhalter’s Bank of England fancharts in R. All the code is online here

Image

Leave a comment

Filed under Uncategorized

Some degree of varying confidence

Did you numerate people chuckle at Chuck Hagel when he said today:

“..the US intelligence community assesses with some degree of varying confidence that the Syrian regime has used chemical weapons…”

Well, stop it. That sounds quite honest and accurate to me. When you assess an explanation based on some data, you have a whole range of alternative explanations at your disposal. You can choose one that is likely (though that depends on assumptions about the mechanisms that generate the data given the explanation in the first place, a.k.a. probability) or one that is lovely (Peter Lipton’s turn of phrase, not mine), although what is lovely to me as a simple, elegant explanation that also sorts out some long-standing puzzles elsewhere in our experience (yup, they did), might not be your choice of lovely (the CIA did it to justify an invasion). Ideally we would all agree on an explanation that is likely and lovely, but the world doesn’t seem to be like that.

I make certain assumptions I’m comfortable with, you make different ones. I test hypotheses I think are interesting and plausible, you test different ones. I am intrigued by a p=0.07 result and report it, you ignore it on principle. We are both using “frequentist” statistics yet we end up with different answers. And to make matters ‘worse’, whoever of us publishes their analysis first will inspire the other one to do something different tomorrow. So my confidence will not only be rather contingent but also varying.

Science is subjective. Get over it!

Leave a comment

Filed under Uncategorized

A couple of new online interactive maps

Two interesting new online mapping / geographical sites just spotted. Crash stories is a crowd-sourcing website for data that just doesn’t exist elsewhere: traffic accident near-misses (only in NYC at present).

Crash Stories screen shot

Crash Stories screen shot

Of course, it’s a tradeoff with data quality. I accidentally put down an accident before I’d learnt which button did what (I didn’t save it but maybe others have, judging by the near-misses in the middle of the East River.

Also just out is this analysis of how integrated (or not) cities in the USA are in terms of ethnic mix. Nice Google Spreadsheet bubble plot! I’ve only just discovered how easy it is to make one of those. And they kindly provide the embed code so I could add their maps into my blog – except that WordPress didn’t like it, so you have to click here instead…

Atlantic Cities drew my attention to both of these. You should all read Atlantic! It’s realy good stuff.

Leave a comment

Filed under Uncategorized

Big Data Mining workshop 14-15 May 2013

Imperial College London are hosting a one-off workshop entitled “Big Data Mining”, under the auspices of EPSRC and the Royal Statistical Society Stat Comp section. This promises to bring together the academics who dream up new approaches and the commercial software producers who can make them come to life for the majority of researchers. It should be a great gathering.

Leave a comment

Filed under Uncategorized

Amanda Cox of the New York Times on the power of data visualization

Writing in the Harvard Business Review blog, Scott Berinato has interviewed top data visualizer Amanda Cox from the New York Times. Thanks to Nathan Yau for spotting this and posting it on Flowingdata.

Cox raises a couple of interesting points for me, points that are rarely said. Firstly, that statistics graduates leave university with almost no skills in creative problem solving and computing:

I come from a statistics background, and I’m finding statistics students’ portfolios are crazy weak compared to the computer science students, even though they’re playing with the same problems. I think it’s because comp sci students are encouraged to play, whereas stats majors it’s, “here’s your rule book, now make things.” I don’t think that’s the good model for making better visualization.

I think that’s absolutely true. I know because I am a stats (via maths) graduate myself, and everything I know about programming and visualization is self-taught in recent years. I mean no disrespect to my former teachers, it’s just that you can’t cover everything in the time available and the accepted norm is to teach the rule book. For the great majority of my fellow students, that’s exactly what they wanted: practical data analysis. But if you want to be able to do cutting-edge analyses, or create cutting-edge visualizations, you need different skills which are all about playing around with computers.

Secondly, that the very recent acceleration in the evolution of online interactive visualizations is in no small part down to the sharing of the nuts and bolts of how they are made. This is in part through collaborative sites like Github, but also importantly because JavaScript-based websites have everything up front: you can save the HTML and .js files containing the data and the instructions to visualize it, then examine and learn from them at your own leisure. 

Then some of the more tech competent people starting using D3 javascript and now we’re having fun with data again. In some ways it feels of the web in the way that the Flash stuff never did. Now, when someone does something interesting, how they did it is really just sitting out there on the Internet, so you get this great sharing and building off of each other.

In fact, I’m working on a little example to post here soon, on how you can access the data behind such a visualization and play around with it to make your own version.

Leave a comment

Filed under Uncategorized

A great example of careful causal analysis in observational data: restless legs and depression

Last year, Li and colleagues at Harvard brought out a paper looking into restless legs syndrome and the risk of depression (AmJEpidemiol (2012);176(4):279–288). This paper really caught my eye because it is one of those associations that we see in medical research but where it is incredibly hard to demonstrate causality. It turned out to be a really nice example of the many hurdles one has to overcome in retrospective analysis of observational data.

They used data from the Nurses’ Health Study to follow an impressive 56,399 women over 6 years. There had been a number of cross-sectional and case-control studies before but nothing prospective.

One major challenge in looking at retrospective data like this is defining when someone becomes at risk and when an endpoint of interest really occurs. It is quite straightforward in a hospital setting but in the community you are reliant on the participants’ own perceptions of their health, and their attitudes to seeking help. The first task was to exclude women with pre-existing (prevalent) depression, which took out 11% of the total 2002 study participants. A small number reported a diagnosis of depression at the next round of questionnaires but weren’t sure when the symptoms had begun, so they got excluded too. So far, so good. The only exclusion for prevalence that worries me is that they took out 9% of the 2002 total participants because they had incomplete data on depression and medication,  and that seems a high proportion to me. They question that comes to mind immediately is whether these 7000+ women are different in any way from those who did provide complete data. If they are, this could be a problem of non-ignorable missingness. We are also told that participants were regarded as having depression if they had a clinical diagnosis and were regular users of antidepressants. That seems quite stringent to me and I would have liked to have seen sensitivity analysis allowing any one of those through and reporting the results as an upper bound on the incidence of depression. (You have to remember that not everyone will seek or get a diagnosis,  yet not everyone taking antidepressants is doing so for depression; some are used for neuropathic pain etc)

Next, the most powerful and widely used tool we have to remove confounding is the multiple regression model. Remember that the aim is to obtain an estimate of a causal effect between restless legs and depression, so we have to control for other factors which we know or believe can cause depression, and which are correlated (causation not required) with restless legs. This is the classic epidemiological definition of a confounder and it is frequently forgotten by analysts who stick anything and everything into the regression model. This paper is one of those kitchen sink models, unfortunately.  Most of the factors that go into the model are sensible but I don’t see how smoking in itself can cause depression or how serum cholesterol is correlated with restless legs.

There are some more problems with the predictors in the models. Iron deficiency was not measured in the study (we shouldn’t blame the authors for this) and is replaced with the proxy measure of taking iron supplements (but this is questionable). As surely as we need to identify confounders by the classic definition, we must also bear in mind that adjusting for things that don’t fit that definition can make matters worse and not better (the key text here is Greenland et al 1991 Am J Epidemiol).

Another problem is that sleep problems were only captured once in the study, in 2002, and by adding this into the model as a single predictor the authors were essentially assuming that it never changes. This is arguably acceptable as there is no other good alternative with the data, but sleep problems are so central to the relationship between restless legs and depression that some people might argue it’s best to pack up and go home. I would suggest that doing some analysis with this imperfect but large dataset is laudable and potentially informative, although it would be nice to have more sensitivity analysis.

One of the measures used to capture depression (CESD-10) contained a question about restless sleep, which clearly is going to pick up restless legs with or without depression and might inflate the relationship between these conditions. The authors have taken a pragmatic and sensible approach to this by removing that question and otherwise recalculating the CESD in the same way as before. This is reasonable to my mind because any scale is just a rough guide anyway, but I’m sure there will be plenty of scale fiends out there who get upset whenever anyone tinkers with a scale which has undergone the ritualistic initiation of Cronbach’s alpha, factor analysis &c &c.

The sensitivity analyses that do appear in the results include a cautious exclusion of any women who were diagnosed with depression in the first three years of the study period. If they already had some symptoms at the time of the 2002 sleep questions, then including them will inflate the relationship between restless legs and depression. It will be as though none of them had any symptoms and then they all acquired them quite quickly. Whether this really is a problem depends on your interpretation of the study. If you want to get causal effects of restless legs on depression symptoms, exclusion is probably wise, but if you are interested in the diagnosis, it doesn’t make sense. This cohort has both forms of outcome data on depression but often you don’t have the choice, especially for less common conditions. Context always determines the appropriateness of the analytical choices, although most people are taught that there is one correct method for every problem and it leads to the unequivocal truth. Real-life research is messier and more complex than that,  although from my point of view it is also more interesting

Leave a comment

Filed under Uncategorized