Earlier today, a colleague came into my office to ask my advice about a logistic regression he was carrying out. It seems that one of the odds ratios for a covariate was very small, close to 1, yet highly significant, and this was puzzling him. After a bit of questioning, we worked out it was a computational problem, and got his analysis fixed up and back on the road. Here’s how the thought process went:
- Is the dataset absolutely massive? With huge numbers, even small, unimportant effects can appear to be significant. It’s a bit like being over-powered, and it rather goes to show how hypothesis tests are not the be-all-and-end-all that some textbooks and teachers would have you believe. You have to stop and think about what it really means. However, in this case it wasn’t humongous, it was about n=500.
- Are there also some really huge odds ratios? The answer was yes, there was one predictor with OR=25,000 “so we decided to ignore that one”. Hmmm. OK, that’s pretty diagnostic of non-convergence. Logistic regressions (and Poisson, negative binomial, and other GLMs, Cox regression, etc etc) are not like linear regression: it’s not a case of solving the equations and getting the answer. The computer has to iteratively zoom in on the most likely set of odds ratios. Sometimes it goes wrong! Not often, but sometimes.
- Is there a predictor which is the same for almost everybody? It turned out this was true, in fact it was the one with OR=25,000. Well, that sounds like the culprit. If you’re trying to guess the effect that something has on the outcome, and it hardly varies over the whole dataset, then you have very little information on which to base that effect size. One odds ratio is pretty much as likely as any other, and the iterative algorithm that does the regression can head off into some pretty weird places, because it doesn’t have enough information to know that it’s not getting closer to the answer. By the time it gets to 25,000, the landscape is completely featureless and flat; it’s a long way back to the maximum likelihood point and your computer is essentially not in Kansas any more.
- That uninformative predictor is not going to help us much anyway, so let’s take it out and see what happens. When he did this, my colleague’s regression model came out quickly with a sensible looking answer. So it sounds like that uninformative predictor was the problem. Hopefully, most of the time your software will spot the problem and change tack, but sometimes they get fooled, especially if you have lots of predictors and some don’t vary much. Some software allows you to change the algorithm options, and that might help. Doing a completely different analysis like a stratified Mantel-Haenszel odds ratio, or a Bayesian model by MCMC, might help, but if your data is defective, there’s not much you can do. When you write it up, you should report what happened. There’s no shame in finding a problem and fixing it!
- Is there a categorical predictor where the baseline category has very few observations? This is another classic problem, although it wasn’t the case in the dataset I was asked about. If your software chooses the lowest (or highest) category as the default baseline, and if that category contains very few observations, then you will also not have enough information to guide you to the answer. Try to switch baselines, possibly by recoding the data if the software won’t let you select a different one.