It’s Pi Day, and yesterday I saw a tweet from Mathematics Mastery, my sister-in-law’s brain child, which pointed out that the number zero does not occur in the first 31 digits of pi. I wondered “what’s the chances of that?” and then realised it was a fine example to get students of statistics to think through. Not because the probability is difficult to work out, but because the model and assumptions are not obvious. Pi is a transcendental number, meaning that it was discovered by Walt Whitman, or something like that. All the symbols 0-9 appear without any pattern, so the chance that a particular digit is a particular symbol is 0.1. The chance it’s not “0” is 0.9, and the 30 that follow are independent and identically distributed, so that comes to 3.8% But you’d be just as surprised to find that “3” does not appear. Or “8”. There was nothing special *a priori* about “0”. Students will hopefully spot this if you have shown them real-life examples like “women wear red or pink shirts when they ovulate“. (Your alarm bells might start going here, detecting an approaching rant about the philosophy of inference, but relax. I’m giving you a day off.) So we crunch up some nasty probability theory (if you’ve taught them that sort of stuff) and get the chance of one or more symbols being completely absent at just over 38%. Then you can subtract some unpleasant multiple absences and get back to about 34%, or just simulate it!:

iter<-1000000 pb<-txtProgressBar(min=1,max=iter,style=3) count<-matrix(NA,iter,10) for (i in 1:iter) { setTxtProgressBar(pb,i) x<-floor(9.99999*runif(31)) for (j in 1:10) count[i,j]<-sum(x==(j-1)) } close(pb) noneofone<-apply(count==0,1,sum) table(noneofone)

But there’s another issue, and I hope that someone in a class would come up with it. Why 31? That’s just because the 32nd was the first “0”. So isn’t that also capitalising on chance? Yes, I think it is. It is an exploratory look-see analysis that suddenly turned into a hard-nosed let’s-prove-something analysis because we got excited about a pattern. What we really need to examine is the chance of coming up with a n-free run of length 31 or greater, where n is any of the ten symbols we call numbers. This is starting to sound more like a hypothesis test now, and you can get students to work with a negative binomial distribution to get it, but the important message is not how to do this particular example, or that coincidences, being ill-defined *a priori*, happen a lot (though that’s important too: “million-to one chances crop up nine times out of ten”, wrote Terry Pratchett), but rather that our belief about the data-generating process determines how we analyse, and it is vital to stop and think about where they came from and why we believe that particular mental/causal model *before* diving into the eureka stuff.