Tag Archives: big data

Performance indicators and routine data on child protection services

The parts of social services that do child protection in England get inspected by Ofsted on behalf of the Department for Education (DfE). The process is analogous to the Care Quality Commission inspections of healthcare and adult social care providers, and they both give out ratings of ‘Inadequate’, ‘Requires Improvement’, ‘Good’ or ‘Outstanding’. In the health setting, there’s many years’ experience of quantitative quality (or performance) indicators, often through a local process called clinical audit and sometimes nationally. I’ve been involved with clinical audit for many years. One general trend over that time has been away from de novo data collection and towards recycling routinely collected data. Especially in the era of big data, lots of organisations are very excited about Leveraging Big Data Analytics to discover who’s outstanding, who sucks, and how to save lives all over the place. Now, it may not be that simple, but there is definitely merit in using existing data.

This trend is just appearing on the horizon for social care though, because records are less organised and electronic, and because there just hasn’t been that culture of profession-led audit. Into this scene came my colleagues Rick Hood (complex systems thinker) and Ray Jones (now retired professor and general Colossus of UK social care). They wanted to investigate recently open-sourced data on child protection services and asked if I would be interested to join in. I was – and I wanted to consider this question: could routine data replace Ofsted inspections? I suspected not! But I also suspected that question would soon be asked on the cash-strapped corridors of the DfE, and I wanted to head it off with some facts and some proper analysis.

We hired master data wrangler Allie Goldacre, who combed through, tested and verified and combined together the various sources:

  • Children in Need census, and its predecessor the Child Protection and Referrals returns
  • Children and Family Court Advisory and Support Service records of care proceedings
  • DfE’s Children’s Social Work Workforce statistics
  • SSDA903 records of looked-after children
  • Spending statements from local authorities
  • Local authority statistics on child population, deprivation and urban/rural locations.

Just because the data were ‘open’ didn’t mean they were useable. Each set had its own quirks and each local authority had its own problems and definitions in some cases. The data wrangling was painstaking and painful! As it’s all in the public domain, I’m going to add the data and code to my website here, very soon.

Then, we wrote this paper investigating the system and this paper trying to predict ‘Inadequate’ ratings. The second of these took all the predictors in 2012 (the most complete year for data) and tried to predict Inadequates in 2012 or 2013. We used the marvellous glmnet package in R and got down to three predictors:

  • Initial assessments within the target of 10 days
  • Re-referrals to the service
  • The use of agency workers

Together they get 68% of teams right, and that could not be improved on. We concluded that 68% was not good enough to replace inspection, and called it a day.

But lo! Soon afterwards, the DfE announced that they had devised a new Big Data approach to predict Inadequate Ofsted scores, and that (what a coincidence!) it used the same three indicators. Well I never. We were not credited for this, nor indeed had our conclusion (that it’s a stupid idea) sunk in. Could they have just followed a parallel route to ours? Highly unlikely, unless they had an Allie at work on it, and I get no impression of the nuanced understanding of the data that would result from that.

Ray noticed that the magazine Children and Young People Now were running an article on the DfE prediction, and I got in touch. They asked for a comment and we stuck it in here.

A salutary lesson that cash-strapped Gradgrinds, starry eyed with the promises of big data after reading some half-cocked article in Forbes, will clutch at any positive message that suits them and ignore the rest. This is why careful curation of predictive models matters. The consumer is generally not equipped to make the judgements about using them.

A closing aside: Thomas Dinsmore wrote a while back that a fitted model is intellectual property. I think it would be hard to argue that coefficients from an elastic-net regression are mine and mine only, although the distinction may well be in how they are used, and this will appear in courts around the world now that they are viewed as commercially advantageous.

1 Comment

Filed under research

Stats and data science, easy jobs and easy mistakes

I have been writing some JavaScript, and I was thinking about how web dev / front-end people are obliged to use the very latest tools, not so much for utility as for kudos. This seems mysterious to me but then I realised: it’s because the basic job — make a website — is so easy. The only way to tell who’s really seriously in the game is by how up to date they are. Then, this is the parallel that occurred to me: statistics is hard to get right, and a beginner is found out over and over again on the simplest tasks. On the other hand, if you do a lot of big data or machine learning or both, then you might screw stuff up left, right, and centre, but you are less likely to get caught. Because…

  • nobody has the time and energy to re-run your humungous analysis
  • it’s a black box anyway*
  • you got headhunted by Uber last week

And maybe that’s one reason why there is more emphasis on having the latest shizzle in a data science job that’s more of a mixture of stats and computer science influences. I’m not taking a view that old ways are the best here, because I’m equally baffled by statisticians who refuse to learn anything new, but the lack of transparency and accountability (oh what British words!) is concerning.

* – this is not actually true, but it is the prevailing attitude

Leave a comment

Filed under Uncategorized

Dataviz of the week, 25/1/17

We had guideline-bustin’, kiddie-stiflin’, grandparent-over-the-threshold-usherin’ pollution in London at the beginning of the week. This is fairly standard nowadays, sadly. It’s not quite so bad out where I live in the Cronx, but in town it’s the worst in Europe. At the same time, Cameron Beccario pointed out the Beijing effect in his wonderful globe of carbon monoxide levels – far worse than anywhere else in the world, though there are some petrochemical hot spots. I’ve praised this live viz before, but that was before I started having a pick of the week on my office door (then, when the door went, here on the blog), so I’ll mention it again. Nice.

screen-shot-2017-01-25-at-20-46-53

Leave a comment

Filed under Visualization

Answers to questions you hadn’t even thought of

This recent BBC Radio 4 “Farming Today” show (available to listen online until) visited Rothamsted Research Station, former home of stats pioneer Ronald Fisher, and considered the role of remote sensing, rovers, drones etc for agriculture, and most interestingly perhaps for you readers, the big data that result.

Agrimetrics (a partnership of Rothamsted and other academic organisations) chief executive David Flanders said of big data (about 19 minutes into the show):

I think originally in the dark ages of computing, when it was invented, it had some very pedantic definition that involved more than the amount of data that one computer can handle with one program or something. I think that’s gone by the wayside now. The definition I like is that it gives you answers to questions you hadn’t even thought of.

which I found confusing and somewhat alarming. I assume he knows a lot more about big data than I do, as he runs a ‘big data centre of excellence’ and I run a few computers (although his LinkedIn profile features the boardroom over the lab), but I’m not sure why he plays down the computational challenge of data exceeding memory. That seems to me to be the real point of big data. Sure, we have tools to simplify distributed computing, and if you want to do something based on binning or moments, then it’s all pretty straightforward. But efficient algorithms to scale up more complex statistical models are still being developed, and it is by no means a thing of the past. Perhaps the emphasis on heuristic algorithms for local optima in the business world have driven this view that distributed data and computation is done and dusted. I am always amazed at how models I feel are simple are sometimes regarded as mind-blowing in the machine learning / business analytics world. It may be because they don’t scale so well (yet) and don’t come pre-packaged in big data software (yet).

In contrast, the view that, with enough data, truths will present themselves unbidden to the analyst, is a much more dangerous one. Here we find enormous potential for overt and cryptic multiplicity (which has been discussed ad nauseam elsewhere), and although I can understand how a marketing department in a medium-sized business would be seduced by such promises from the software company, it’s odd, irresponsible even, to hear a scientist say it to the public. Agrimetrics’ website says

data in themselves do not provide useful insight until they are translated into knowledge

and hurrah for that. It sounds like a platitude but is quite profound. Only with contextual information, discussion and involvement of experts from all parts of the organisation generating and using the data do you really get a grip of what’s going on. These three points were originally a kind of office joke like buzzword bingo when I worked on clinical guidelines, but later I realised were accidentally the answer to making proper use of data:

  • engage key stakeholders
  • close the loop
  • take forward best practice (you may feel you’ve seen these before)

or, less facetiously, talk to everyone about these data (not just the boss), get them all involved in discussions to define questions and interpret the results, and then do the same in translating it to recommendations for action. No matter how big your data are, this does not go away.

Leave a comment

Filed under computing

Scaling Statistics at Google Teach

Not unrelatedly to the Data Science angle, I read the recent paper “Teaching Statistics at Google Scale” today. I don’t think it actually has anything to do with teaching, but it does have some genuinely interesting examples of the inventive juggling required to make inferences in big data situations. You should read it, especially if you are still not sure whether big data is really a thing.

Leave a comment

Filed under computing

Complex interventions: MRC guidance on researching the real world

The MRC has had advice on evaluating “complex interventions” since 2000, updated 2008. By complex interventions, they mean things like encouraging children to walk to school, not complex in the sense of being made up of many parts, but complex in the sense that the way it happens and the effect it has is hard to predict because of non-linearities, interactions and feedback loops. Complexity is something I have been thinking and reading about a lot recently; it really is unavoidable in most of the work I do (I never do simple RCTs; I mean how boring is it if your life’s work is comparing drug X to placebo using a t-test?) and although it is supertrendy and a lot of nonsense is said about it, there is some wisdom out there too. However, I always found the 2000/8 guidance facile: engage stakeholders, close the loop, take forward best practice. You know you’re not in for a treat when you see a diagram like this:

bobbins-flowchart

 

Now, there is a new guidance document out that gets into the practical details and the philosophical underpinnings at the same time: wonderful! There’s a neat summary in the BMJ.

What I particularly like about this, and why it should be widely read, is that it urges all of us researchers to be explicit a priori about our beliefs and mental causal models. You can’t measure everything in a complex system, so you have to reduce it to the stuff you think matters, and you’d better be able to justify or at least be clear about that reduction. It acknowledges the role that context plays in affecting the results observed and also the inferences you choose to make. And it stresses that the only decent way of finding out what’s going on is to do both quantitative and qualitative data collection. That last part is interesting because it argues against the current fashion for gleeful retrospective analysis of big data. Without talking to people who were there, you know nothing.

My social worker colleague Rick Hood and I are putting together a paper on this subject of inference in complex systems. First I’ll be talking about it in Rome at IWcee (do come! Rome is lovely in May), picking up ideas from economists, and then we’ll write it up over the summer. I’ll keep you posted.

Leave a comment

Filed under research

Checking and cleaning in big data – at the RSS conference

The invited speakers’ sessions for this year’s Royal Statistical Society conference are now announced, and I’m delighted that a Cinderella topic I proposed is included. I intend to be there on the day to chair the session. “Checking and cleaning in big data” will include talks by:

  • Ton de Waal from Statistics Netherlands, one of the world’s leading experts on automated algorithmic editing and imputation
  • Elizabeth Ford from Brighton & Sussex Medical School, who has recently done some ground-breaking work using free-text information from GP records to validate and complete the standard coded data
  • Jesus Rogel from Dow Jones, who develops cutting-edge techniques for processing real-time streams of text from diverse online sources, to identify unique and reliable business news stories, where speed and reliability are paramount

The inclusion of Big Data here is not just à la mode; there comes a point somewhere in Moderate Data* where you just can’t check the consistency and quality the way that you would like to. Most of us involved in data analysis will know that feeling of nagging doubt. I used to crunch numbers for the National Sentinel Audit of Stroke Care, and as the numbers got up over 10,000 patients and about 100 variables, the team were still making a colossal effort to check every record contributed by hospitals, but that was operating at the limit of what is humanly feasible. To go beyond that you need to press your computer into service. At first it seems that there are no tools out there to do it, but gradually I discovered interesting work from census agencies and that led me into this topic. I feel strongly that data analysts everywhere need to know about it, but it is a neglected topic. After all, checking and cleaning is not cool.

* – I think Moderate Data is just the right name for this common situation. Big Data means bigger than your RAM, i.e. you can’t open it in one file. Moderate means bigger than your head, i.e. you don’t have the time to check it all or the ability to retain many pieces of information while you check for consistency. It also makes me think of the Nirvana song that starts with the producer saying the ubiquitous sheet music instruction “moderate rock”, and what follows is anything but moderate; that’s the Moderate Data experience too: looks OK, turns into a nightmare. To take a personal example, my recent cohort study from the THIN primary care database involved working with Big Data (40,000,000 person-years: every prescription, every diagnosis, every test, every referral, every symptom) to extract what I needed. Once I had the variables I wanted, it was Moderate Data (260,611 people with about 20 variables). I couldn’t check it without resorting to some pretty basic stuff: looking at univariate extreme cases, drawing bivariate scatterplots, calculating differences. Another neat approach to find multivariate outliers is to run principal components analysis and look at the components with the smallest eigenvalues (the ones you usually throw away). That’s one of many great pieces of advice from Gnanadesikan & Kettenring in 1972.

Here’s the complete list of sessions:

Tuesday 2nd September

  • Communicating statistical evidence and understanding probabilistic inference in criminal trials
  • New advances in multivariate modelling techniques
  • Bayesian spatio-temporal methodology for predicting air pollution concentrations and estimating its long-term health effects
  • Data visualisation – storytelling by numbers
  • Papers from the RSS Journal
  • Data collection challenges in establishment surveys

Wednesday 3rd September

  • Bayes meets Bellman
  • Statistics in Sport (two sessions)
  • Statistics and emergency care
  • YSS/Significance Writing Competition finalists
  • Quantum statistics
  • Statistical modelling with big data
  • Measuring segregation

Thursday 4th September

  • YSS/RS Predictive challenge finalists
  • Checking and cleaning in big data
  • Who’s afraid of data science?
  • Advances in Astrostatistics
  • Exploting large genetic data sets

Leave a comment

Filed under Uncategorized