Checking and cleaning in big data – at the RSS conference

The invited speakers’ sessions for this year’s Royal Statistical Society conference are now announced, and I’m delighted that a Cinderella topic I proposed is included. I intend to be there on the day to chair the session. “Checking and cleaning in big data” will include talks by:

  • Ton de Waal from Statistics Netherlands, one of the world’s leading experts on automated algorithmic editing and imputation
  • Elizabeth Ford from Brighton & Sussex Medical School, who has recently done some ground-breaking work using free-text information from GP records to validate and complete the standard coded data
  • Jesus Rogel from Dow Jones, who develops cutting-edge techniques for processing real-time streams of text from diverse online sources, to identify unique and reliable business news stories, where speed and reliability are paramount

The inclusion of Big Data here is not just à la mode; there comes a point somewhere in Moderate Data* where you just can’t check the consistency and quality the way that you would like to. Most of us involved in data analysis will know that feeling of nagging doubt. I used to crunch numbers for the National Sentinel Audit of Stroke Care, and as the numbers got up over 10,000 patients and about 100 variables, the team were still making a colossal effort to check every record contributed by hospitals, but that was operating at the limit of what is humanly feasible. To go beyond that you need to press your computer into service. At first it seems that there are no tools out there to do it, but gradually I discovered interesting work from census agencies and that led me into this topic. I feel strongly that data analysts everywhere need to know about it, but it is a neglected topic. After all, checking and cleaning is not cool.

* – I think Moderate Data is just the right name for this common situation. Big Data means bigger than your RAM, i.e. you can’t open it in one file. Moderate means bigger than your head, i.e. you don’t have the time to check it all or the ability to retain many pieces of information while you check for consistency. It also makes me think of the Nirvana song that starts with the producer saying the ubiquitous sheet music instruction “moderate rock”, and what follows is anything but moderate; that’s the Moderate Data experience too: looks OK, turns into a nightmare. To take a personal example, my recent cohort study from the THIN primary care database involved working with Big Data (40,000,000 person-years: every prescription, every diagnosis, every test, every referral, every symptom) to extract what I needed. Once I had the variables I wanted, it was Moderate Data (260,611 people with about 20 variables). I couldn’t check it without resorting to some pretty basic stuff: looking at univariate extreme cases, drawing bivariate scatterplots, calculating differences. Another neat approach to find multivariate outliers is to run principal components analysis and look at the components with the smallest eigenvalues (the ones you usually throw away). That’s one of many great pieces of advice from Gnanadesikan & Kettenring in 1972.

Here’s the complete list of sessions:

Tuesday 2nd September

  • Communicating statistical evidence and understanding probabilistic inference in criminal trials
  • New advances in multivariate modelling techniques
  • Bayesian spatio-temporal methodology for predicting air pollution concentrations and estimating its long-term health effects
  • Data visualisation – storytelling by numbers
  • Papers from the RSS Journal
  • Data collection challenges in establishment surveys

Wednesday 3rd September

  • Bayes meets Bellman
  • Statistics in Sport (two sessions)
  • Statistics and emergency care
  • YSS/Significance Writing Competition finalists
  • Quantum statistics
  • Statistical modelling with big data
  • Measuring segregation

Thursday 4th September

  • YSS/RS Predictive challenge finalists
  • Checking and cleaning in big data
  • Who’s afraid of data science?
  • Advances in Astrostatistics
  • Exploting large genetic data sets

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s