Monthly Archives: November 2013

Staggering to work

This morning I heard an unusual announcement as I arrived at Balham (“gateway to the South”) railway station. The trains going into London are busiest, the man said, between 8:15 and 8:30, so travelling before or after this would make our journeys quicker and more comfortable.

I immediately thought this would make a good excuse to post this old London Transport poster, with its clever design and charming pictograms (and questionable math):

So, in there seemed to be 35,000 people on the tube between 5:00 and 5:30 pm. According to 2009/10’s London Travel Demand Survey, there were about 125,000 (roughly eyeballing Figure 4.1) crammed in cheek by jowl.

 

Leave a comment

Filed under Visualization

Weighing up the options for drawing an election map

The Guardian’s Nick Evershed wrote a great article on their website in September, just before the last Australian election. (This gives you an idea of the waiting list to get blogged here at Stats HQ.) They look at several different approaches to drawing maps where the population in question is very unevenly distributed. Best of all, they provide extensive links to examples. How nice to see proper in-depth discussion of graphics options in the everyday press! Their own solution is a good one, and I agree with them is the best of all the alternatives:

Leave a comment

Filed under Visualization

It pays to be Bayes

Yes, indeed. I’m looking forward to getting this sent off for publication. Wilson asked my advice about his pilot RCT in stroke rehab, and the outcomes were so complicated it took me a long time to work out what to do to utilise all the information without dropping any. Thankfully I had been reading Song & Lee’s book “Basic and Advanced Bayesian Structural Equation Modeling (with applications in the medical and behavioral sciences)” and was able to fit one of these wonderfully flexible models to the data.

It was truly one of the most satisfying projects I’ve ever contributed to, because his clinical expertise and my stats added to up something that neither of us could have done alone, and actually changed the results substantively! But you’ll have to wait to read our findings…

Leave a comment

Filed under research

Strange attractors in R

Pete Werner has posted some very nice images on his blog, Shifting Sands, that use R’s nuts-and-bolts image() function to draw thousands of tiny dots, one for each position a particle is at as it zooms around some kind of chaotic attractor. I have only glanced over this post quickly, said “whoa, cool” and it’s getting late so I can’t claim to understand exactly what he’s done, but I’m interested in the image more than the chaos. I think it’s nice because it’s simple, and almost transparent (not in the technical RGB sense). Who needs colour?

He then has some weird stuff going on in a video that involves him throwing his hands in the air like he just don’t care, while the computer does something with the resulting data. Worth a look, as long as you turn off the music first!

Leave a comment

Filed under Visualization

The sting

My colleague Nigel Rogers sent me a link to an article in Science magazine which is making waves: Who’s Afraid of Peer Review by John Bohannon. This is entertaining in a guilty-pleasure sort of way, and quite horrifying at the same time. Bohannon wrote a phony paper, designed to be bad but claiming some cancer-curing breakthrough using lichens. He sent it, under invented names intended to sound generically African (but end up reminiscent of Screamin’ Jay Hawkins in “Feast of the Mau Mau”), to lots of open access journals and found some rejected it while others were happy to wave it through in exchange for money. Some claimed to charge nothing but later mysterious administrative fees cropped up.

I feel it plays a little too fast and loose with the facts, but given I have been banging on about the lack of statistical review in journals here, there and everywhere, I thought I should mention it on this blog. The aspect I feel particularly uneasy about is how comfortable Bohannon seems at associating middle-income countries with dirty dealing. The bank account is in China? Say no more, they must be crooks. On the other hand, it’s informative to see how they designed the spoof paper. A claim of a dose-response was made in the text and contradicted in a graph. The study design would have swamped the fictional lichen juice with alcohol, which killed the cancer cells and everything else in range, while the control group got neither. Clearly nonsense, but it would be interesting to know how many highly regarded journals would have picked it up… I’m sure there are scams out there, and no doubt some websites contain falsehoods, but I knew that before reading this. The valuable information would be just how much variation there is in scrutiny, but that was contaminated by the pantomime presentation of the phony paper, in just the same way the alcohol and the lichen juice were mixed up. As one of the stung editors wrote from an office about two miles east from mine, “An element of trust must necessarily exist in research, including that carried out in disadvantaged countries. Your activities here detract from that trust.”

Or is the Science article itself the real sting, to catch closet science supremacists?

1 Comment

Filed under Uncategorized

Simulation (is where it’s happening)

Jim Silverton wrote to the Allstat mailing list recently:

“Hi,

Anyone up for a challenge?

Suppose we have [4] random variables that are random points on the surface of a sphere. What is the probability that the tetrahedron made by joining these points, contain the origin?

I tried over and over – no idea.”

This is really a classic probability “hard” question. If you’re comfortable with geometry and probability (and you happen to think of it in the right way), you’ll be fine. I looked it up online and found it came from the 1992 Putnam math competition (and could well have existed before that). There’s a solution at Wolfram so scary it would make Steven Seagal weep, while Howard and Sisson riff on it in a paper from 1996. You have to mess about with calculus and vector geometry and probability to get the extended answers.

Don't you just love that?

Don’t you just love that? “The analytic result is difficult to compute [but here it is!]” It’s just like that old cartoon where one wild-haired prof has written on the board a load of algebra followed by “and then a miracle happens”, and the other says “I think there’s a flaw in your proof”.

I have to say, I hate this kind of thing. My family have pretty much learned not to show me logic puzzles in the newspaper or anything like that. I don’t see the point. There clearly is an answer, and you just have to struggle to find it. Where’s the satisfaction in that?

But this innocuous post caught my imagination because it seemed to me a good vehicle to promote computing skills and simulation over teeth-grinding math. Wolfram says the answer is 1/8. I wrote this R code to make random tetrahedra 100,000 times and in each case check whether they contained the origin. It takes 61 seconds to run on my office PC. The answer is apparently 0.12391 (95% confidence interval 0.122 to 0.126), which is pretty neat compared to the analytical result of 0.125.


# x is the matrix of 4 points on the unit sphere
# d is in 1:4, indicating which vector to compare with the plane joining
# the other three.
# If the 4th point and the plane are in opposite directions, this supports
# the tetrahedron containing the origin. If all four are opposite, it
# is confirmed (so it seems to me, but I leave proof to others with
# time on their hands!)
m <- function(xx) {
return(sqrt(sum(xx^2)))
}
proj<-function(x,d) {
j<-(1:4)[-d]
pq<-x[j[2],]-x[j[1],]
pr<-x[j[3],]-x[j[1],]
orth<-c((pq[2]*pr[3])-(pq[3]*pr[2]),
(pq[3]*pr[1])-(pq[1]*pr[3]),
(pq[1]*pr[2])-(pq[2]*pr[1]))
unitorth<-orth/m(orth)
plane<-unitorth*sum(x[j[1],]*unitorth)
point4<-unitorth*sum(x[d,]*unitorth)
oppo<-((plane[1]>0 & point4[1]<0)|(point4[1]>0 & plane[1]<0))
return(list(oppo,point4,plane))
}
set.seed(434)
iter<-100000
results<-rep(NA,iter)
vol<-rep(NA,iter)
pb<-txtProgressBar(min=0,max=iter,style=3)
Sys.time()
for (i in 1:iter) {
a <- matrix(runif(12,min=-1,max=1), nrow=4)
mags <- apply(a, 1, m)
mags <- matrix(rep(mags, 3), ncol=3)
x <- a/mags
temp<-rep(NA,4)
for (k in 1:4) {
temp[k]<-proj(x,k)[[1]]
}
results[i]<-all(temp)
vol[i]<-(1/6)*abs(det(cbind(x,rep(1,4))))
setTxtProgressBar(pb,i)
}
table(results)/iter
png("volume.png")
plot(density(vol))
dev.off()
Sys.time()

My approach to checking whether a tetrahedron contains the origin is a little heuristic, as you’ll see. It looks sensible enough though; if you draw a line from the origin to the kth vertex, and also from the origin, orthogonal to the plane joining the other three points, and they are in the opposite direction for all k, then the origin is contained.

My point is simply that simulation gives you enormous flexibility with gnarly problems. That code above throws in the extra stuff they do at Wolfram because the original question wasn’t hard enough. What are the central moments of the distribution of the volumes? Who cares. Let’s just get the distribution straight out of the simulation! After doing the quintuple integral and then some, Wolfram say the mean volume is (4/105)pi = 0.11968. My code says 0.11923.

You want the distribution? Boom! There it is. Let's see you try to find the formula for the sampling distribution of the  third quartiles analytically!

You want the distribution of the volumes of these tetrahedra? Boom! There it is.

There are times when you can get an analytical formula and evaluate that in a split second for each scenario, which is obviously a good move. For example, you could solve a multiple linear regression by MCMC, or by matrix algebra, and the latter is going to be lightning fast, always reliable, but not extendable to anything more complex. There are other times when you can’t get the formula but by thinking about the data-generating process, you can write a simulation program really quickly and get on with your life. How about the sampling distribution of the third quartile of dodecahedral volumes in 11-dimensional space whose vertices are sampled from autocorrelated mixtures of multivariate Cauchy distributions? No problem. Last week I did some sample size calculations for students, all by simulation, because there was no simple applicable formula. It just works, but surprisingly few places teach people these skills.

PS Do you teach this kind of stuff in an introductory stats setting? I’d like to hear from you! We might be able to put together a scientific meeting around this theme. There is interest and support and a central London venue but I need speakers…

4 Comments

Filed under R

Software inter-operability meeting in London

On the afternoon of 14 November, the RSS is hosting a joint meeting with the International Biometric Society at Errol Street. The topic is inter-operability, as in software packages talking to each other to make your life easier.

This is a big deal, not some nerdish tweak. One of the speakers has summed it up so nicely in their abstract I shall simply quote them here:

“Developers of computing software (office suites, data bases etc) often seem to take the view that users should never need to go elsewhere to fulfill their needs, but can spend their entire computing lives within a single system. In statistics, it is harder to support that attitude (although some have tried!). Our subject, its methodology and applications, are so diverse and developing so quickly, that it would be arrogant and foolish for any one group to assume that they can cover the entire user-requirement. So statistical analysts may need to draw on different tools according to each situation. … [Challenges] range from the straightforward issue of file interchangeability through to the ability to act as an environment for other systems or to operate as statistical engines. There will also be a discussion of future challenges, including the ability to link systems to handle large data sets, and to provide secure analyses over the internet.”

You can read the whole programme and register online here.

Leave a comment

Filed under R, Stata