Monthly Archives: March 2014

Kosara on stories vs worlds

Robert Kosara has written recently on his blog Eager Eyes about the tension or synergy between stories (showing the consumer of data what the message is, or leading them to the points of interest, or telling them a really compelling instance) and worlds (opening it up for exploration and leaving them to it). This is something I was reflecting on last week at the RSS (for which, by the way, videos are coming soon, hopefully in 2 weeks’ time). I read somewhere in The Politics of Large Numbers (and have never been able to find the page again – perhaps I dreamt it, in the same way I was convinced for a couple of years that Flavor Flav was dead before realising I dreamt that particular news broadcast) that a great debate raged through the innovative French statistical service set up after the Revolution. Some claimed that the role of the statistician was to present data without filtering or interpreting – or even summarising. Yes, some argued against percentages and averages; a very French sort of intellectual aggression!

Step up and show these people how to work out their own particular time that they find interesting

But the prolegomena makes or breaks a visualisation, as I raved about here recently. The best example might be Budget Forecasts, Compared With Reality, and while some bad ones come to mind, I don’t really want to single one out. I’m sure you have your own bugbear, or you can just visit wtfviz.

For me, there’s really no difference between introducing a visualization and introducing a table of stats, or a page of results text. Effective communication will involve some story and some world, and not just one-size-fits-all, as Gelman and Unwin pointed out. It’s interesting, though, that all this attention and investigation and debate goes on for grahics, while nobody pays any attention to what tables or words we should use to get people engaged with, understanding and remembering our/their stories. Where are the research studies comparing layouts of logistic regression coefficient tables in terms of comprehension and recall?

1 Comment

Filed under Visualization

Statistics is no substitute for thinking (exhibit 1)

Thanks to R-bloggers, I discovered a thoughtful post by Joel Caldwell on a blog called Engaging Market Research. Not obviously my thang, but then they do a lot of cluster analysis, and every now and then somebody asks me to do that or show them how to do it.

Getty Images

Unlike regressions and other analyses that treat all your data as one lump, the clustering requires some careful thought. And careful thought is generally lacking in the world of data analysis. “Just show me which button to push”, I imagine students and clients thinking as they tune out my droning voice talking about assumptions and context.

Firstly, we get a false picture from the classic examples such as Old Faithful:

Yup, that'll be two clusters then

Yup, that’ll be two clusters then

and Fisher’s irises:

Three, you say? Three it is then.

Three, you say? Three it is then.

Life isn’t like that. It’s really hard to pull data apart into clusters, and people out there who hire statisticians have been given a false idea that the computer will make sense of everything. The thing is, even though there may not be any evidence to favour, say, four clusters over five when you are looking at a long drawn-out blob, there is often a contextual reason, and that’s good enough! Statistics is supposed help you make decisions under uncertainty. As long as you don’t start to believe that you have discovered some immutable law of the universe, you’ll be fine.

Here I really like Joel’s explanation:

The shoe manufacturer can get along with three sizes of sandals but not three sizes of dress shoes. It is not the foot that is changing, but the demands of the customer. Thus, even if segments are no more than convenient fictions, they can be useful from the manager’s perspective. […] These segments are meaningful only within the context of the […] problem

And again:

Nor should we be wedded to the stability of our segment solution when those segments were created by dynamic forces that continue to act and alter its structure. Our models ought to be as diverse as the objects we are studying.

This guy has thought long and hard about what he is trying to do, which is the number one skill the data analyst needs.

Leave a comment

Filed under Uncategorized

Checking and cleaning in big data – at the RSS conference

The invited speakers’ sessions for this year’s Royal Statistical Society conference are now announced, and I’m delighted that a Cinderella topic I proposed is included. I intend to be there on the day to chair the session. “Checking and cleaning in big data” will include talks by:

  • Ton de Waal from Statistics Netherlands, one of the world’s leading experts on automated algorithmic editing and imputation
  • Elizabeth Ford from Brighton & Sussex Medical School, who has recently done some ground-breaking work using free-text information from GP records to validate and complete the standard coded data
  • Jesus Rogel from Dow Jones, who develops cutting-edge techniques for processing real-time streams of text from diverse online sources, to identify unique and reliable business news stories, where speed and reliability are paramount

The inclusion of Big Data here is not just à la mode; there comes a point somewhere in Moderate Data* where you just can’t check the consistency and quality the way that you would like to. Most of us involved in data analysis will know that feeling of nagging doubt. I used to crunch numbers for the National Sentinel Audit of Stroke Care, and as the numbers got up over 10,000 patients and about 100 variables, the team were still making a colossal effort to check every record contributed by hospitals, but that was operating at the limit of what is humanly feasible. To go beyond that you need to press your computer into service. At first it seems that there are no tools out there to do it, but gradually I discovered interesting work from census agencies and that led me into this topic. I feel strongly that data analysts everywhere need to know about it, but it is a neglected topic. After all, checking and cleaning is not cool.

* – I think Moderate Data is just the right name for this common situation. Big Data means bigger than your RAM, i.e. you can’t open it in one file. Moderate means bigger than your head, i.e. you don’t have the time to check it all or the ability to retain many pieces of information while you check for consistency. It also makes me think of the Nirvana song that starts with the producer saying the ubiquitous sheet music instruction “moderate rock”, and what follows is anything but moderate; that’s the Moderate Data experience too: looks OK, turns into a nightmare. To take a personal example, my recent cohort study from the THIN primary care database involved working with Big Data (40,000,000 person-years: every prescription, every diagnosis, every test, every referral, every symptom) to extract what I needed. Once I had the variables I wanted, it was Moderate Data (260,611 people with about 20 variables). I couldn’t check it without resorting to some pretty basic stuff: looking at univariate extreme cases, drawing bivariate scatterplots, calculating differences. Another neat approach to find multivariate outliers is to run principal components analysis and look at the components with the smallest eigenvalues (the ones you usually throw away). That’s one of many great pieces of advice from Gnanadesikan & Kettenring in 1972.

Here’s the complete list of sessions:

Tuesday 2nd September

  • Communicating statistical evidence and understanding probabilistic inference in criminal trials
  • New advances in multivariate modelling techniques
  • Bayesian spatio-temporal methodology for predicting air pollution concentrations and estimating its long-term health effects
  • Data visualisation – storytelling by numbers
  • Papers from the RSS Journal
  • Data collection challenges in establishment surveys

Wednesday 3rd September

  • Bayes meets Bellman
  • Statistics in Sport (two sessions)
  • Statistics and emergency care
  • YSS/Significance Writing Competition finalists
  • Quantum statistics
  • Statistical modelling with big data
  • Measuring segregation

Thursday 4th September

  • YSS/RS Predictive challenge finalists
  • Checking and cleaning in big data
  • Who’s afraid of data science?
  • Advances in Astrostatistics
  • Exploting large genetic data sets

Leave a comment

Filed under Uncategorized

A simple R bootstrap function for beginners

I teach some introductory stats classes with SPSS, and one of the frustrations for me is that you have to pay an extra wad of cash to do any bootstrapping [edit August 2014: actually, the situation in new versions is much happier than this: see footnote below!]. It’s not exactly the complete analysis solution that you might expect from the sales literature. I could go on, but I guess IBM have better lawyers than me.

I think it’s a good idea to introduce the bootstrap early on. It’s not some advanced rocket-science technique any more, and beginners do a lot of medians and quartiles, for which there’s no (proper) standard error. What’s more, it makes random sampling at the heart of inference really explicit, which I think actually makes learning easier.


So, I’ve been considering showing the students a quick diversion into R for the bootstrap. Stata has a great bootstrap syntax, but it’s not available in our computer teaching rooms. We don’t have time to do more in R, and it’s beyond my control to switch the whole software package for the course. The trouble is, the boot package has that weird thing where you have to redefine whatever statistic you’re interested in as a function with two parameters, even if a perfectly good one already exists. I suppose it’s to do with efficiency and vectorising over the replicate index vectors, but it’s the last thing you want to talk beginners through.

So I thought I could usefully wrap up the usual boot and functions in a simpleboot wrapper.  It’s not a wise choice for serious analysis, and I haven’t made it pretty, but for teaching it makes things a little more accessible. Any thoughts, let me know.

simpleboot<-function(x,stat,reps=1000) {
cat("Bootstrapping can go wrong!\n")
 cat("This simple function will not show you warning messages.\n")
 cat("Check results closely and be prepared to consult a statistician.\n")
 if(stat=="max" | stat=="min") { warning("Bootstrap is likely to fail for minima and maxima") }
 eval(parse(text=eval(substitute(paste("p.func<-function(x,i) ",stat,"(x[i])",sep=""),list(stat=stat)))))
 hist(myboots$t,breaks=25,main="EDF from bootstrap",xlab=stat)
# example:

edited August 2014: So, as of version 21 (and maybe a little earlier – I skipped some) there is a “Bootstrap” button on most dialog boxes. I will start teaching how to use this as of the start of term in September 2014.


Filed under learning, R

Visualizations on the Monopoly board

Two items of post from utility companies that recently dropped through our door included little graphics. There was a degree of innovation in them both. The first, from British Gas, is technically OK but probably bad on perceptual grounds:


I got a tape measure out and starting checking that they had scaled the flames and light bulbs by their area. (Sad, I know, but this is the fate that befalls all statisticians in the end.) And yes, it seemed they had – if you included the space to the little shadow underneath. In fact, someone had clearly been very careful to scale it just right, but the gap of clear space and the indistinct shadow are probably not perceived as part of the icon. I think they’re cute, but not so easy to derive facts from.

Next up from Thames Water:


This looks like a really bad idea. As if pies weren’t hard enough to judge anyway, making it into a drop is completely confusing. The categories at the top are possibly expanded in size just for aesthetic reasons. I thought I would check how much the area occupied by “day-to-day running” differed from the nominal 38/125=30%. First, to avoid confusion of colors, I brought out the GIMP and made a simplified version:


and then read it into R and counted the blue and black pixels:


and that turns out to have 37% of the drop allocated to “day-to-day running”. Bad, bad bad…


Filed under R, Visualization