Monthly Archives: December 2013

250 years of Bayes

It’s Christmas Eve in old London Town, and a cold wind is whistling down Fleet Street. 250 years ago today, Richard Price, a philosopher and preacher from the nearby village of Islington, made his way down the street and turned under the arch into the little passageway called Crane Court. He was not going to the Temple Bar Indian restaurant, as some last-minute office workers will be right now, he was going to the Royal Society. Price was carrying with him the edited work of his friend and fellow nonconformist preacher Thomas Bayes, who had died two years earlier, leaving behind among his unpublished writings “An Essay towards solving a Problem in the Doctrine of Chances”. That evening, Price read the paper before the Society, thus committing it to the annals of history. Today, much is said about Bayesian statistics, and very little of it is in any way approachable to anyone who isn’t a total stats nerd. Let me try to explain what this is all about, without mathematical detail, to mark 250 years of Bayes.

Crane Court then

Crane Court then

There are two ideas of far-reaching importance in Bayes’ Essay. the first is his eponymous Theorem, which is an uncontroversial, and very useful, bit of simple probability theory. It allows you to flip conditional probabilities round. For example, you are a doctor and your patient has received a positive HIV test result. If you know the probability of getting a positive test result when a patient really has HIV, and you know the prevalence of HIV, then you can flip it round to the much more useful probability of really having HIV given the test result. So far so good. The second idea is set out at the very beginning; Bayes writes that he wants to find:

“the chance that the probability of [a certain event happening] lies somewhere between any two degrees of probability that can be named”

This would not have been controversial at the time, because nobody really knew what they meant by probability. Once the early 20th century arrived, though, debate was hotting up among philosophers of science, many of them just a mile to the north in the neighbourhood of Bloomsbury: John Maynard Keynes and Bertrand Russell in particular (and don’t forget that Ronald Fisher worked in University College, across the square from Keynes’s house). Russell viewed Keynes’s recasting of probability into a form of logic as the best of all the options, and warned that a simple probability determined solely by extant data was not a concept to be muddled up with human (or avian) decision-making:

“The importance of probability in practice is due to its connection with credibility, but if we imagine this connection to be closer than it is, we bring confusion into the theory of probability”

and again:

“The man who has fed the chicken every day throughout its life at last wrings its neck instead, showing that more refined views as to the uniformity of nature would have been useful to the chicken.”

Wise words. He was certainly something of a wag, if his History of Western Philosophy is anything to go by. Despite the unprepossessing title, this tome is studded with little jokes at the expense of his predecessors. Keynes’s neighbour Virginia Stephen (later Mrs Woolf) wrote in her diary simply that “the Bertrand Russells came for tea – very droll” and added that he was “a fervid egoist”. As a statistician interested in philosophy, a dedicated Woolf reader, and one predisposed towards a certain degree of drollery, I am sorry to have missed out on the invitation to tea that day.

Some of these philosophers said that the probability of the event happening is simply the proportion of times it will happen if you try it again and again forever. This is certainly true, but not very practical. These frequentists went on to insist that a probability is a fixed but unknown quantity, which we can estimate but not make any definite assertions about. For them, Bayes’ terminology is a foolish slip from the unsophisticated 1700s; there is no such thing as the chance of a probability lying between X and Y, it either is or isn’t. The fact that we don’t know it does not mean you can treat it as random and start talking about the chance of being between X and Y. Bayes’ “chance” is a hyper-probability, but frequentist probabilities are only meaningful for things that can be repeated indefinitely under identical circumstances. The problem here is that you can’t then talk about the probability of rain tomorrow, or of New York electing a Democrat mayor next time round, because these events only happen once, and that seems to have thrown out the baby with the bath-water.

Other said that something being random simply means that you don’t know it. Data you have yet to collect are random, coming from a distribution that you can see when you draw a histogram of the data so far. Then they went beyond that to say that unknown parameters that govern the universe, like probabilities, are also random, even though it’s harder to learn about their distribution over repeated samples. These people alighted on Bayes’ turn of phrase and took him as their figurehead, calling themselves Bayesians. Some also asserted that probability could only mean a level of personal belief in unknowns, and so it is subjective, but Bayes never suggested this and so I shall mention it no further. (I think it is true and elegant, but not at all useful. Many very wise and learned people would disagree with me there, but their counter-arguments don’t convince me. Let’s not get into it now.) Did he really believe in the chance of a probability, or was that just sloppy terminology? In fact, Bayes wrote, when setting out his definitions at the start of the Essay, “By chance, I mean the same as probability”. So he clearly meant there to be probabilities about probabilities, or in other words that an unknown parameter is just as random as an unknown (future) measurement. He definitely was a Bayesian, although probably (!) not subjective.

We need to bear in mind when reading Bayes that his Essay was edited by Price, who seemed to have found it irresistible to employ mathematics, whenever he could, as an attempt to prove the existence of God, and his presentation of Bayes’ Essay is no exception. Price was no great logician, and like Bayes, he was most likely in the Royal Society because at the time it was a thinkers’ club dominated by liberal nonconformists, who elected their own sort to join their ranks. Price certainly wouldn’t have liked subjective Bayesianism; by his peculiar logic it would have permitted a dangerous relativism pointing us all down the highway to hell.

So, what impact does Bayesianism have on us today? There are two ways in which it makes its presence felt, the first rather unfortunate, the second beyond Bayes’ wildest imagination. By the Second World War, the philosophers had pretty much settled the issue. Frequentism crumbles under scrutiny; it is impossible not to contradict oneself, in large part because the distinctions between data and parameter, known and unknown, population and sample, can differ from individual to individual. Sadly, nobody told the statisticians. Some polemical textbooks warned readers of the dangers of Bayes, and this word of mouth continues to this day. They are not fools, they just haven’t spared the time to think it through enough.

On the other hand, if you are willing to treat unknown parameters as random variates (note the slight semantic distinction from a random variable, one of many such concessions to the statistical frequentists), then you can set your computer to try out different values and let them meander around through the parameter’s own distribution. We have had a really practical way of doing this since 1984, called the Gibbs sampler (there are others…). A strict frequentist could not permit such nonsense; the parameter is either here or it is there, it cannot move about, and to set it going, the analyst must supply a prior distribution. This smacks of subjectivity, even if it is diffuse enough to allow the parameter to roam free, constrained only by the data. Indeed, there is no judgement-free prior, but that is also a fact of life when you come to interpret the results of any data analysis.

The trouble is that frequentist (they often like to call their stuff classical, which is amusing because it just prompts Bayesians to call their work modern) methods run out in some scenarios, or become horrendously complex, while the Bayesian methods march on. Personally, I think the distinguishing feature is that methods like the Gibbs sampler get closer to the data-generating process, and break it into steps, which helps us understand the model of the data that we are fitting to it.

Bayes’ original goal, of finding the chance that a probability is between X and Y, also lends itself to clear, intuitive understanding. You can be told that there is a 78% chance that average temperatures in England have increased since 1960, while a frequentist would have to couch it in terms of going back in time and taking repeated measurements under identical conditions an infinite number of times.

Of course, there are also wise voices in the world of stats warning us not to rely too much on methods like the Gibbs sampler. For one thing, it is not easy to spot when things go wrong, or what is influencing the results. It is also not at all clear how best to compare alternative models for your data and choose the best one. Also, the debate about subjective probability has not gone away. Finally, although the computational tools are impressive, they require much more time and power than their frequentist counterparts, and in the era of big data, you can easily get into a situation where they just won’t work any more. There’s clearly lots of fun to be had refining and developing them for many years to come.

Crane Court now

Crane Court now

So we take our leave of Crane Court, the former home of the Royal Society, just as Price did, stepping out into the cold night. I dare say Price – and Bayes – would have wished you, reader, a very happy Christmas, but I think I’ll steer their ghosts round the corner and onto the 341 bus as quick as I can, in case they encounter an office party or see a decorated tree. I don’t think they would have approved.

Leave a comment

Filed under Uncategorized

Giving R the strengths of Stata


This is not a partisan post that extols the virtues of one software package over another. I love Stata and R and use them both all the time. They each have strengths and weaknesses and if I could only take one to the desert island, I’d find it hard to choose. For me, the greatest unique selling point in Stata is the flexibility of the macros. If I write something like this:

local now=1988

summarize gdp if year==`now'

I will get a summary of GDP in 1988, the same as if I had typed

summarize gdp if year==1988

And I could do the same thing in R (assuming I have installed the package pastecs):



All very nice. But they are doing this in two different ways. R holds all sorts of objects in memory: data frames, vectors, scalars, etc, and accesses any of their contents when you name them. Stata can only have one data file open at a time and stores other stuff in temporary memory as matrices, scalars or these macros, which are set up with the commands local or global. When you give it a name like now, it will by default look for a variable in the data file with that name. So, you  place the macro name between the backward quote* and the apostrophe in order to alert it to fetch the contents of the macro, stick them into the command and then interpret the whole command together. That is a very flexible way of working because you can do stuff that most programming languages forbid, like shoving your macro straight into variable names:

summarize gdp`now'

// the same as summarize gdp1988

or into text:

display as result "Summary of world GDP in the year `now':"

or indeed into other macros’ names in a nested way:

local now=1988
local chunk "ow"
summarize gdp if year==`n`chunk''

or even into commands!

local dothis "summa"
 `dothis'rize gdp if year==`now':"

I believe that is also how Python works, which no doubt helps account for its popularity in heavy number crunching (so I hear – I’ve never gone near it).

Now, the difference between these approaches is not immediately obvious, but because R does not differentiate in naming different classes of object, like scalars, matrices or estimation outputs, you can do whatever you like with them (helpful), except just jamming their names into the middle of commands and expecting R to replace the name with the contents. That is the strength of Stata’s two-stage interpretation. How can we give that strength to R?

A popular question among new useRs is “how do I manipulate the left-hand side of an assignment?”

Here’s the typical scenario: you have a series of analyses and want the results to be saved with names like result1, result2 and so on. Nine times out of ten, R will easily produce what you want as a list or array, but sometimes this collection of distinct objects really is what you need. The problem is, you can’t do things like:

mydata <- matrix(1:12,nrow=3)

paste("columntotal", 1:4, sep="") <- apply(mydata, 2, sum)

And hope it will produce the same as:

columntotal1 <- sum(mydata[, 1])
columntotal2 <- sum(mydata[, 2])
columntotal3 <- sum(mydata[, 3])
columntotal4 <- sum(mydata[, 4])

Instead you need assign()! It’s one of a series of handy R functions that can be crudely described as doing something basic in a flexible way, something which you would normally do with a simple operator such as <- but with more options.

for (i in 1:4) {

assign(paste("columntotal", i, sep=""), sum(mydata[,i]))


will do exactly what you wanted above.

If you need to fetch a whole bunch of objects by name, mget() is a function that takes a vector of strings and searches for objects with those names. The contents of the objects get returned in a single list. Now you can easily work on all the disparate objects by lapply() and the like. Now, before you mget too carried away with all this fun, take time to read this excellent post, which details the way that R goes looking for objects. It could save you a lot of headaches.

All right, now we know how to mess around with object names. What about functions? is your friend here. The first argument wants is a string which is the name of a function. The second argument is a list containing your function’s arguments, and it passes them along. You could do crazy stuff like this:

omgitsafunction <- paste("s","um",sep=""),list(mydata))

and it would be the same as:


…which raises the possibility of easily making bespoke tables of statistics by just feeding a vector of function names into

loadsafunctions <- c("sum","mean","sd")

for (i in 1:length(loadsafunctions)) {



or more succinctly:

listafunctions <- as.list(loadsafunctions)


Another neat feature of Stata is that you can prefix any line of code with capture: and it will absorb error messages and let you proceed. In R you can do this with try(). This is never going to work:



But maybe you want it to run, skip the strings and give you the numeric results (of course, you could do this by indexing with is.numeric(), but I just want to illustrate a general point, and try() is even more flexible):

lapply(geewhiz,function(x) try(log(x),TRUE))

will work just fine. Note the one line function declaration inside lapply(), which is there because lapply wants a function, not the object returned by try(log()).

attach() and (more likely) with() are useful functions if you need to work repetitively on a batch of different data frames. After all, what’s the first thing newbies notice is cool about R? You can open more than one data file at a time. So why not capitalise on that? That takes you into territory that Stata can only reach by repeatedly reading and writing from the hard drive (which will slow you down).

subset() is another good one. Really it just does the same as the indexing operator [, but because it’s a function, you can link it up with mget() and/or and get it to work it’s way through all sorts of subsets of different objects under different conditions, running different analyses on them. Nice!

The final function I want to highlight is substitute(). This allows you to play around with text which needs to be evaluated by another function as if it was typed in by the user, and yet still have it work


Pretty cool huh? I hope this opens up some new ideas for the more advanced Stata user who wants to use R functionality. On the other hand, if you use R all the time, perhaps this will encourage you to take Stata seriously too.

* – thanks to my Portuguese students who taught me this is the acento contrario


Filed under R, Stata

London, Tokyo, Amsterdam, Munich, everybody talk about, errr… big data science analytics

Nathan Yau’s summary of 2013 at flowingdata is a festive must-read. Along with fun graphics like trendy kids’ names, he picked up on some trendy stats’ names too.

The hype around big data should settle soon, but it’ll probably be at least an additional year before we get to call big data just data again.

I’m not convinced by that. The penny may already have dropped among the numerate, but it’s guys in suits that are driving this by paying for it, and like dot-com, they will be happy to carry on paying just in case there’s something to it, in which case they don’t want the other guys to get there first.

Data science versus statistics? That’s going to continue. The good news is that it affects almost no one from a practical sense, so I’ll leave that argument to others.

I chuckled to myself at this. Every day – literally – I get emails from really hard working self-promoters who are pushing this vacuous data-science-it’s-not-just-stats message. I don’t want to name them in case you go and read their nonsense. “Great new article: what’s the difference between data science and science data? Subscribe now to read the awesome answer, written by a leading expert, me.” I don’t see this changing any time soon. It’s a sign of a trendy subject; I should rename this Robert Grant’s $tat$ Blog. But consider if you will the talk of booming numbers of graduates. Dot-com hurt some investors, but the web is in a way stronger place because of the boom in skills. If that happens to $tat$ it will be no bad thing.

There is a serious side to this data science though. What they talk about is looking at the data and thinking, not just following rules blindly and doing dull hypothesis tests. Which sounds like the difference between good statisticians and bad ones to me, nothing new there. If you want to read about data science, save your money and buy a copy of Cox & Snell’s 1981 (!) tour of good, thoughtful practice, “Applied Statistics: Principles and Examples”.


Leave a comment

Filed under Uncategorized

CSI Stats: looking for traces of data fraud in R

Recently, I was looking at some published research, and I became concerned that it looked strange. Firstly, it had results wildly different to other similar studies, and more exciting / publishable ones. So, I was looking through the original paper to try and work out what was going on. Something caught my eye in a table of means and standard deviations. All were given to one decimal place, and, as a fascinating article in Significance last year pointed out, you’d expect the digits 0-9 to appear with nearly the same frequency in the decimal points. Any imbalance should be no more than you’d expect by chance. But these ones looked odd. There seemed to be a lot of one digit, and quite a run of them all together in the table. I felt I needed to see just how unlikely it was that this could arise by sheer chance. If it was very unlikely, then  perhaps it would be time to start telling others that there might be something wrong here. In other words, I wanted to work out a p-value.

The more you look, the more you see

As it happened, my fears were not backed up by the stats, so what I’ve described here is quite heavily disguised to protect the innocent from any mud sticking. But the process of investigating was quite interesting, so I thought I’d share it. I reached for R because I wanted to do something quite bespoke, and regular readers will not be surprised to hear I used a simulation approach. There probably is some weird test out there with asymptotic Z statistic, named after some long-dead guru of decimal points, but I could do the job more quickly and transparently with simulation.

# Data:
# decimal points in means


# decimal points in standard deviations

# Euclidean distance function - given 48 numbers we expect 4.8 of each digit
euc<-function(xx) {

# Euclidean distance of dp
# simulate for multinominal distn under H0: prob(dp=x)=0.1 for all 0<=x<=9
# this is the p-value:
sum(euc.sim>euc.dp)/iter # p=0.04

# Euclidean distance of dps
# this is the p-value:
sum(euc.sim<euc.dps)/iter # p=0.98

So, that didn’t look very conclusive; p=0.04 is borderline in my book. But that calculation didn’t take the runs into account. Another way of doing it would be to look at combinations of the ith digit and the (i+1)th.

The more you see, the more you look

# autocorrelation

euc2<-function(xx) {
sqrt(sum((table(xx[-48],xx[-1])-0.48)^2)) # there are 100 possible values now
# simulate for multinominal distn under H0: prob(dp=x)=0.1 for all 0<=x<=9
# this is the p-value:
sum(> # p=0.06

# this is the p-value:
sum(>euc.acs)/iter # p=0.72

Even less convincing. Now for the really interesting stuff! Having looked at the same data in two different ways we should expect more than 5% type 1 error rate. I have committed the most basic form of multiple testing, running the same data through more than one pre-specified model. Worse than that, I chose the model by eyeballing the data, and I chose those data after scaning a whole published paper. Very questionable, the sort of thing Andrew Gelman described as researcher degrees of freedom. I had been eating up those degrees of freedom faster than a box of dates on Christmas day. With the lowest p-value being 0.04, pretty much any adjustment for multiplicity will mean there are no significant departures from a multinomial(0.1, 0.1, 0.1 …. 0.1) distribution. So, the verdict was not guilty.

Now, I’m not claiming that my heuristic choice of statistic is the best one to answer the question, or that I’ve tried the best models, or that the simulation equates to a Uniformly Most Powerful test. I’m just saying that it’s an interesting process to go through, from the fraud angle and also the multiplicity angle.


Filed under R

Boris Johnson and the statistics of IQ

I’ve been putting off this post – I don’t really get a kick out of demolishing the statistical claims of politicians – but it intersects with league tables, an old area of interest, so here goes.

Boris Johnson, the mayor of London who enjoys some popularity by cultivating his lovable-eccentric persona, said last week:

Whatever you may think of the value of IQ tests, it is surely relevant to a conversation about equality that as many as 16% of our species have an IQ below 85, while about 2% have an IQ above 130.

You can read the whole speech here; the IQ bit is in page 7. Others have been alarmed by various aspects of this and the surrounding flurry of strange metaphors: society as breakfast cereal, society as centrifuge… but I will confine myself to the stats. “IQ” could mean any number of test procedures and questions. There is considerable debate about what it actually measures. There is also considerable debate about what we might want to measure, even if we could measure it. Bear that in mind, because Johnson sweeps those concerns aside.

What you do is to run your test on a bunch of normal people (small alarm bell should be ringing here about who is normal). Their results might come up with a certain mean and standard deviation, and you want your results to be comparable (read “indistinguishable”) to others, so you add or subtract something to bring the mean to 100, and multiply or divide to bring the SD to 15. Now you have an IQ score. So far, we have uncertainty arising from defining the construct, the validity of the measure, and the differences between. But with a nice normal distribution like that, it’s quite tempting to ignore those issues and move straight on to the number-crunching.

If the mean is 100 and the SD 15, then 16% of the distribution will be less than 85.1, and 2% of the distribution will be above 130.8. What Johnson is describing is the normal distribution’s shape. Why is it like that? Because you made it that way, remember, you squashed and shoved it into standardised shape. What else creates a normal distribution in nature? Adding unrelated things together (central limit theorem) and random noise. Saying that 16% are below 85 is no more meaningful that saying half the population is below the average. Everybody now knows to laugh at politicians and their apparatchiks who say things like that. I think that’s a good thing, because we are learning little by little not to have the wool puled over our eyes by “dilettantes and heartless manipulators”. Don’t fall for Johnson’s nonsense either.

A final note: why have the bigger portion at the low end and the smaller at the top? Clearly he is trying to appeal to a listener who thinks of themselves in the top 2%, and considers with distaste the rabble far below. Yet you could measure any old nonsense and talk with the same air of scientific veracity about quantiles of the normal distribution.

PS: I usually put a picture in, because blog readers like pictures. But on this occasion I fear it would only encourage him. You’ll note I don’t call him by his near-trademarked first name either.

1 Comment

Filed under Uncategorized

Best visualization – and programming – of 2013

This is an unusual recipient for the coveted Grant medal, most obviously because it actually came out in 2012. But I picked up on it this year and it is so cleverly realised and breathtakingly far-reaching in its implications that there really could be no other contender for me. Also, it’s not an amazing visualization in the sense of beautiful swirls and swooshes in delicately chosen ColorBrewer palettes (you know the sort). But given the direction of travel into JavaScript-powered interactivity, this just takes it a step further.

The author, Rasmus Bååth, is a PhD student at Lund University in Sweden. He has a bright future ahead! Some people have been talking online about JavaScript as a viable programming language for data analysis. One of the spurs has been a table published on the Julia homepage suggesting that, in several benchmark calculations, js was faster than R, faster than Matlab, and faster than Python. In one case it was faster than compiled C! The only task where it slowed down was matrix multiplication, but here the difference was not as great as for the other tasks. That table caught my eye too and after some reading around, I found Bååth’s website.

So he decided to try analysis in js, allowing users to input their own data and get graphics for all the parameters they might be interested in. And not just with any old simple analysis, Bååth programmed an MCMC Bayesian version of a t-test. The code is part of the web page, and this is then interpreted and run inside your browser – even on a smartphone!


I ran the default setup, burn-in of 20,000 iterations and then saving results from 20,000 more, and that took 34 seconds – on a Samsung S3! Pretty impressive.

If you view source and click through, you can read his own js-mcmc JavaScript file behind the scenes. This is tailored to the task at hand, so don’t expect a programmable BUGS-type interface, but then remember that BUGS was one of the great stats programming achievements of the last 20 years. To have it translated on the fly by JavaScript would be an enormous amount of work.

So, a new paradigm of interactivity could be opening up through your web browser. Once upon a time, we could only zoom into maps or select from a series of graphs. Then, we had graphics dynamically changed by our filtering and selection of data. What’s next? Perhaps we will actually be fine-tuning the analysis and looking at the results instantly, without requiring any software.


Filed under Visualization