My odds ratios have gone weird

Earlier today, a colleague came into my office to ask my advice about a logistic regression he was carrying out. It seems that one of the odds ratios for a covariate was very small, close to 1, yet highly significant, and this was puzzling him. After a bit of questioning, we worked out it was a computational problem, and got his analysis fixed up and back on the road. Here’s how the thought process went:

  1. Is the dataset absolutely massive? With huge numbers, even small, unimportant effects can appear to be significant. It’s a bit like being over-powered, and it rather goes to show how hypothesis tests are not the be-all-and-end-all that some textbooks and teachers would have you believe. You have to stop and think about what it really means.
  2. Are there also some really huge odds ratios? The answer was yes, there was one predictor with OR=25,000 “so we decided to ignore that one”. Hmmm. OK, that’s pretty diagnostic of non-convergence. Logistic regressions (and Poisson, negative binomial, and other GLMs, Cox regression, etc etc) are not like linear regression: it’s not a case of solving the equations and getting the answer. The computer has to iteratively zoom in on the most likely set of odds ratios. Sometimes it goes wrong! Not often, but sometimes.
  3. Is there a predictor which is the same for almost everybody? It turned out this was true, in fact it was the one with OR=25,000. Well, that sounds like the culprit. If you’re trying to guess the effect that something has on the outcome, and it hardly varies over the whole dataset, then you have very little information on which to base that effect size. One odds ratio is pretty much as likely as any other, and the iterative algorithm that does the regression can head off into some pretty weird places, because it doesn’t have enough information to know that it’s not getting closer to the answer. By the time it gets to 25,000, the landscape is completely featureless and flat; it’s a long way back to the maximum likelihood point and your computer is essentially not in Kansas any more.
  4. That uninformative predictor is not going to help us much anyway, so let’s take it out and see what happens. When he did this, my colleague’s regression model came out quickly with a sensible looking answer. So it sounds like that uninformative predictor was the problem. Hopefully, most of the time your software will spot the problem and change tack, but sometimes they get fooled, especially if you have lots of predictors and some don’t vary much. Some software allows you to change the algorithm options, and that might help. Doing a completely different analysis like a stratified Mantel-Haenszel odds ratio, or a Bayesian model by MCMC, might help, but if your data is defective, there’s not much you can do. When you write it up, you should report what happened. There’s no shame in finding a problem and fixing it!

Leave a Comment

Filed under Uncategorized

Residual confounding talk at RSS 2013 conference

I am speaking today at the first UK Causal Inference meeting in Manchester on my multiple imputation approach to residual confounding. I also learnt yesterday that I’ll also be presenting this method at the RSS conference in Newcastle in September, which will be exciting. And I am hoping to present software for it at the London Stata users’ group, also in September. If you are interested but can’t make any of those dates, email me and I’ll tell you more. The paper is out to review at present…

Leave a Comment

Filed under Uncategorized

H7N9 bird flu in China

The paper out two days ago in the New England Journal of Medicine that details latest epidemiological information from this outbreak has some really thoughfully produced graphics. It also provokes some in-depth statistical pondering. It’s worth a look. I can’t reproduce the figures here without waiting for copyright permissions first, so I’ll just link you straight to the paper thus, and you can see them and the accompanying text for yourself.

Figure 1 seems to suggest that the first three provinces (Shanghai, Zhejiang and Jiangsu) to have more than an isolated case saw a similar rise then fall in the numbers. See those colored bars rise and then fall again? Maybe there is a localised outbreak, transmission for a few days, and then it dies out. Well, no, I don’t think so, although it’s tempting to infer a common history like that. There are two reasons argue against it for me. One, the cases are surprisingly widespread geographically (see Figure 2). The distance from eastern Henan to Shanghai is 800 km, which is the same as Land’s End to Dumfries, or New York to Quebec City. Two, the stacking of the bars make the ones on top look at a glance like they are rising even if it just the bars underneath that are moving.

It seemed to me that there were a lot of small numbers of cases away from the coast where the patient still alive. Now, this is very flawed because I should include the days since symptoms appeared, and I don’t know that, but I made a Poisson Q-Q plot using the data from Figure 2. Shanghai looks quite different to the other locations:

Image

In fact, if you base the quantiles on the mean death risk from all the sites except Shanghai, they all lie along the line, which suggests they are Poisson-distributed but something else is going on in Shanghai, producing a higher death rate, or a lower proportion of cases that survive and recover are being captured. I don’t think it is that Shanghai started having cases first, so they have had longer in which to die (sorry to be morbid folks, it’s what I do for a living), because the median time from onset to death is 11 days (IQR 7-20) and we have cases going back to March in three provinces, while Shanghai’s bulk of cases only really got going at the same time as everywhere else, 4 weeks ago.

Image

 

One more thing struck me: how much information we are given about the patients. We would never write all that potentially identifying information here. Is it all right if (a) the data come from a country where they are not so keen on anonymity in research, (b) if the future of humanity is at stake and a snippet of information in there could be the clue that saves us (at this stage, I can’t honestly tell you that my choice of words is entirely flippant), or (c) if they said it was all right? Discuss.

Leave a Comment

Filed under Uncategorized

Fancharts in R

Guy Abel has done some nice work recreating David Spiegelhalter’s Bank of England fancharts in R. All the code is online here

Image

Leave a Comment

Filed under Uncategorized

Some degree of varying confidence

Did you numerate people chuckle at Chuck Hagel when he said today:

“..the US intelligence community assesses with some degree of varying confidence that the Syrian regime has used chemical weapons…”

Well, stop it. That sounds quite honest and accurate to me. When you assess an explanation based on some data, you have a whole range of alternative explanations at your disposal. You can choose one that is likely (though that depends on assumptions about the mechanisms that generate the data given the explanation in the first place, a.k.a. probability) or one that is lovely (Peter Lipton’s turn of phrase, not mine), although what is lovely to me as a simple, elegant explanation that also sorts out some long-standing puzzles elsewhere in our experience (yup, they did), might not be your choice of lovely (the CIA did it to justify an invasion). Ideally we would all agree on an explanation that is likely and lovely, but the world doesn’t seem to be like that.

I make certain assumptions I’m comfortable with, you make different ones. I test hypotheses I think are interesting and plausible, you test different ones. I am intrigued by a p=0.07 result and report it, you ignore it on principle. We are both using “frequentist” statistics yet we end up with different answers. And to make matters ‘worse’, whoever of us publishes their analysis first will inspire the other one to do something different tomorrow. So my confidence will not only be rather contingent but also varying.

Science is subjective. Get over it!

Leave a Comment

Filed under Uncategorized

A couple of new online interactive maps

Two interesting new online mapping / geographical sites just spotted. Crash stories is a crowd-sourcing website for data that just doesn’t exist elsewhere: traffic accident near-misses (only in NYC at present).

Crash Stories screen shot

Crash Stories screen shot

Of course, it’s a tradeoff with data quality. I accidentally put down an accident before I’d learnt which button did what (I didn’t save it but maybe others have, judging by the near-misses in the middle of the East River.

Also just out is this analysis of how integrated (or not) cities in the USA are in terms of ethnic mix. Nice Google Spreadsheet bubble plot! I’ve only just discovered how easy it is to make one of those. And they kindly provide the embed code so I could add their maps into my blog – except that WordPress didn’t like it, so you have to click here instead…

Atlantic Cities drew my attention to both of these. You should all read Atlantic! It’s realy good stuff.

Leave a Comment

Filed under Uncategorized

Big Data Mining workshop 14-15 May 2013

Imperial College London are hosting a one-off workshop entitled “Big Data Mining”, under the auspices of EPSRC and the Royal Statistical Society Stat Comp section. This promises to bring together the academics who dream up new approaches and the commercial software producers who can make them come to life for the majority of researchers. It should be a great gathering.

Leave a Comment

Filed under Uncategorized