People often say that Bland and Altman’s paper where they set out the eponymous plot for comparing two measures in medical statistics is the most-cited stats paper ever. I thought I would poke around on Google Scholar and see what the citations looked like there.

In terms of total citations, and given all the shortcoming of this as a measure of anything, there are two ahead of B&A, and they needn’t feel cheated, as we’re talking about titans of statistics here. Here’s the rankings for the seven papers I could think of testing:

Cox (1972) Regression and life tables: 35,512 citations.

DLR (1977) Maximum likelihood from incomplete data via the EM algorithm: 34,988

Bland & Altman (1986) Statistical methods for assessing agreement between two methods of clinical measurement: 27,181

Geman & Geman (1984) Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images: 15,106

Efron (1979) Bootstrap methods: another look at the jackknife: 9686

Tibshirani (1996) Regression shrinkage and selection via the lasso: 8744

Nelder & Wedderburn (1972) Generalized linear models: 3818

Can anyone think of any other landmark papers to look up?

Nelder & Wedderburn invented GLMs, so you’d think they should be pretty darn near the top, but for two things, I suppose. Firstly, the most popular of these models, logistic regression and Poisson regression, are so commonplace that people no longer cite them, and secondly, the book by McCullagh and Nelder (following Robert Wedderburn’s tragic death at the age of 28) attracts most of the citations. Adding all the variants on it in Google Scholar, you get 24,297 citations, which would take GLMs up to third place, overtaking B&A, but then that is rather unfair on others with much-cited books like Little & Rubin or David Cox.

When considered per year since publication, you have to remember Google Scholar is not measuring the same thing each year. Since it got going, Google have put effort into going back into the archives and getting more books, reports and grey lit on the system. Recent years are going to produce more citations simply because of an inclusion bias, not to mention the fact that a lot more gets written and published each year now (most of it rubbish). But, given all that, B&A come out on top with 1007 citations per year, DLR second with 972, and Cox third with 866.

I'm a medical statistician at Kingston University and St George's, University of London. I'm interested in Bayesian latent variable models, data visualization and stats curriculum reform. I use (and sometimes blog about) R, Stata, BUGS, JAGS, Stan, and program sometimes with C++, Julia and JavaScript. I make the StataStan interface and Stata2D3 package. I sit on committees for statistical computing at the Royal Statistical Society, and clinical audit and confidential enquiries for NHS England. I am the Honorary Statistician at Princess Alice Hospice, a research-active hospice in Surrey, England, and I teach statistics with Stata software on Harvard Medical School's GCSRT, ICRT, CSRT-PT and UKCSRT blended learning programs for clinicians. Sleep is for wimps.
View all posts by Robert

4 Comments

Thought I should also look up structural equation models. Jöreskog (of LISREL fame) notched up a lot of citations, but spread over many publications. Oddly, “Evaluating structural equation models with unobservable variables and measurement error” by Fornell and Larcker 1981 has gathered 17,738, which would put it in 4th place. Perhaps because it is a decent all-round SEM (inflexible by today’s standards of course) and, crucially, in a marketing journal.

If you haven’t just got here from there, you should look at Andrew Gelman’s post on this topic at http://andrewgelman.com/2014/03/31/cited-statistics-papers-ever/ where more critique of citations ensues, and some glaring omissions on my part are revealed, like Kaplan & Meier (how on earth did I not think of that…)

“Rob, can you explain why Bland/Altman plots are cited almost as often as the Cox model, and way more often than the bootstrap for example? I just don’t get it (not that I don’t love Bland/Altman plots, I just don’t get why they’re so highly cited…)”

To which I replied:

“I don’t know why but it is the done thing to cite B-A at every opportunity. It makes one look clever without actually having to grapple with algebra or calculus. (No offence, their papers were and remain beacons of clear writing) Also, their citations are not too spread across lots of papers. Cox’s book mostly gets cited rather than the 2 partial likelihood papers. And the bootstrap just doesn’t get done anywhere near as often as it should. If all the duffers out there started bootstrapping, they’d probably be citing something else, because Efron is not exactly light bedtime reading!”

And Brennan wrote back with this point, crucial to understanding these citations:

“Good points – also, BA seems to have just the right amount of popularity, where it’s used quite often, but it’s not so standard that it doesn’t really require a citation (like the Cox model, Chi sq test, logistic regression, etc). “

Thought I should also look up structural equation models. Jöreskog (of LISREL fame) notched up a lot of citations, but spread over many publications. Oddly, “Evaluating structural equation models with unobservable variables and measurement error” by Fornell and Larcker 1981 has gathered 17,738, which would put it in 4th place. Perhaps because it is a decent all-round SEM (inflexible by today’s standards of course) and, crucially, in a marketing journal.

If you haven’t just got here from there, you should look at Andrew Gelman’s post on this topic at http://andrewgelman.com/2014/03/31/cited-statistics-papers-ever/ where more critique of citations ensues, and some glaring omissions on my part are revealed, like Kaplan & Meier (how on earth did I not think of that…)

Brennan Kahan wrote to ask:

To which I replied:

And Brennan wrote back with this point, crucial to understanding these citations: