Worth reading this article at the New Statesman. I too enjoy seeing the UK Statistics Authority bite that hand that feeds it.
Monthly Archives: May 2013
Update 10 June 2013 – I replaced the two graph images. No idea why, but they weren’t displaying properly in some browsers.
Many of you may already know this occasionally updated analysis by Bob Muenchen – if not, go and check it out. The results, and the discussion in the comments, are fascinating. The strange pattern I see is that the total number of Google Scholar citations which mention any stats software peaked in 2007 and has been rapidly declining since then.
So, I wondered whether we used to have a fashion for naming the software in learned papers, and have more recently given up on this. You know the sort of thing: “All analyses were conducted in Stata/SE version 11.2 (StataCorp, 2010)”. Textbooks on writing for publication say that is good practice, but how many of us still bother? If so, we have given up on it very fast.
As a percentage of this whole, SPSS has been declining since 2005, SAS has stayed pretty much in the same ballpark, while R and Stata have increased in very similar fashion. R might be slightly ahead, but I wouldn’t want to call it. It looks like the switch is from SPSS to Stata or R. Either people are learning new packages, or the SPSS generation is retiring, though the unquenchable appetite of my students for all things point-and-click and Andy Field suggest it is not the latter. I suspect that IBM’s hilarious re-re-re-branding exercises haven’t exactly helped. I mean, how can you cite the software in the paper when you don’t even know what it’s called?
There is probably a hidden side to SAS in pharma company reports that never make it into the public eye, which accounts for its continuing dominance in the job market. And that in turn is down to the perceived preference of the FDA, which is not actually true, it’s just the preference of the boss. And as one of the commentators pointed out on Bob Muenchen’s blog, how can you say SAS is validated when you can’t see the code and see what it is doing?
This video from Facebook Stories certainly caught my eye. There’s a lot going on here, probably too much really. It made me feel dizzy, which I guess would be the effect of listening to these songs if I were ever to actually encounter one.
Firstly, what’s with the time scale? Is it going up and down with time of day? And is the pulsing of color intensity just an effect to look kind of like a beat? What is the conclusion? Is there really a geographical difference? Hip hop in the hood and country in the, well, country? Maybe I am too old for it. But you have to hand it to them for clever programming.
23 June 2016 edit: before you attempt any real work with IP geolocation, you must, must, must read this article: http://fusion.net/story/287592/internet-mapping-glitch-kansas-farm/ . Be aware of the limitations.
Whoa! Now this is cool. It turns out there is a database at freegeoip.net which you can query for the location of a particular IP address. And as it has a neat little API for batches of IPs, you can get R to fetch them en masse. That’s exactly what Heuristic Andrew has just done with a function that uses the rjson package to pull down data from the JSON version of the database. If you wanted to run a lot of these, you could compile your own local version of the database provided you have a Python interpreter, as described on the GitHub page.
I couldn’t resist playing around with it. As you see above, the Google Map on the freegeoip.net home page looks up your own IP. In my case, I was lucky enough to be able to spot a problem. This location is a little residential street, and certainly doesn’t look like the location of some big university server. In fact, the database returns the “city” as Kingston (sounds good so far) and the “region_name” as St Helens (errr, not really, that’s a few hundred miles north west from here). But Google Maps has found an apartment block called St Helens, near Kingston. Oh dear.
So, lesson 1: don’t trust the Google Map search on the home page.
However, there are also latitudes and longitudes, so I went to my website at www.animatedgraphs.co.uk (WordPress jealously guard the IP addresses of visitors; as ever, data=$$$) and got all the recent IPs of visitors. I shoved these into Heuristic Andrew’s function and found four that returned an Error 404 from the database. That’s not Andrew’s fault, it seems the database just doesn’t know them. As he says, the function is very new and needs some better error handling. I’m sure that will come in time. For now, I just ditched those four and carried on. The country names supplied to me by my ISP matched the freegeoip ones perfectly, so I took the latitudes and longitudes and put them in a map:
Now the Kingston University server is apparently located near Stoke-on-Trent, in fact near the village of Kingstone (hmmm…), at a location which is not exactly an internet hub:
So, lesson number 2: don’t trust the latitude and longitude too much, although most of them seem fine.
To take a less pathological example: a lot of traffic has come my way from Benton, Kentucky. (Thank you, whoever you are. I hope you found the website useful.) The (latitude,longitude) is given by freegeoip as (36.8596,-88.3367) which is in a field a mile out of town. The 4 decimal place precision is clearly rather spurious.
I spotted this map in the Times Higher Education (25 April 2013, p. 11):
Mmm, Britain in Opal Fruits (Starburst to the young among you). It certainly caught my eye. There is something to be said for posing a visual puzzle, and also something to be said for novelty in visualization. But then I went and looked at the original in the Report on Oxbridge Access and Wales by member of parliament Paul Murphy, and that was so much better: informative, clear, detailed. Old-fashioned, sure, but the patterns came through.
As my grandad used to say, the auld way’s the best way. But full marks for trying something new.
Earlier today, a colleague came into my office to ask my advice about a logistic regression he was carrying out. It seems that one of the odds ratios for a covariate was very small, close to 1, yet highly significant, and this was puzzling him. After a bit of questioning, we worked out it was a computational problem, and got his analysis fixed up and back on the road. Here’s how the thought process went:
- Is the dataset absolutely massive? With huge numbers, even small, unimportant effects can appear to be significant. It’s a bit like being over-powered, and it rather goes to show how hypothesis tests are not the be-all-and-end-all that some textbooks and teachers would have you believe. You have to stop and think about what it really means. However, in this case it wasn’t humongous, it was about n=500.
- Are there also some really huge odds ratios? The answer was yes, there was one predictor with OR=25,000 “so we decided to ignore that one”. Hmmm. OK, that’s pretty diagnostic of non-convergence. Logistic regressions (and Poisson, negative binomial, and other GLMs, Cox regression, etc etc) are not like linear regression: it’s not a case of solving the equations and getting the answer. The computer has to iteratively zoom in on the most likely set of odds ratios. Sometimes it goes wrong! Not often, but sometimes.
- Is there a predictor which is the same for almost everybody? It turned out this was true, in fact it was the one with OR=25,000. Well, that sounds like the culprit. If you’re trying to guess the effect that something has on the outcome, and it hardly varies over the whole dataset, then you have very little information on which to base that effect size. One odds ratio is pretty much as likely as any other, and the iterative algorithm that does the regression can head off into some pretty weird places, because it doesn’t have enough information to know that it’s not getting closer to the answer. By the time it gets to 25,000, the landscape is completely featureless and flat; it’s a long way back to the maximum likelihood point and your computer is essentially not in Kansas any more.
- That uninformative predictor is not going to help us much anyway, so let’s take it out and see what happens. When he did this, my colleague’s regression model came out quickly with a sensible looking answer. So it sounds like that uninformative predictor was the problem. Hopefully, most of the time your software will spot the problem and change tack, but sometimes they get fooled, especially if you have lots of predictors and some don’t vary much. Some software allows you to change the algorithm options, and that might help. Doing a completely different analysis like a stratified Mantel-Haenszel odds ratio, or a Bayesian model by MCMC, might help, but if your data is defective, there’s not much you can do. When you write it up, you should report what happened. There’s no shame in finding a problem and fixing it!
- Is there a categorical predictor where the baseline category has very few observations? This is another classic problem, although it wasn’t the case in the dataset I was asked about. If your software chooses the lowest (or highest) category as the default baseline, and if that category contains very few observations, then you will also not have enough information to guide you to the answer. Try to switch baselines, possibly by recoding the data if the software won’t let you select a different one.
I am speaking today at the first UK Causal Inference meeting in Manchester on my multiple imputation approach to residual confounding. I also learnt yesterday that I’ll also be presenting this method at the RSS conference in Newcastle in September, which will be exciting. And I am hoping to present software for it at the London Stata users’ group, also in September. If you are interested but can’t make any of those dates, email me and I’ll tell you more. The paper is out to review at present…