# Tag Archives: big data

## Joy unconfined

There’s been much excitement online about ggjoy which is an R package making vertically stacked area charts with a common x-axis. It is reminiscent of the Joy Division album cover, so I guess that ticks some cultural boxes in the dataviz community. The original version by Henrik Lindberg had the areas extending into the chart above,

which provoked some killjoys to complain and the height got standardised. Here’s Xan Gregg tackling it with a stacked (small multiple? see below) horizon plot.

Fair enough, though, as Robert Kosara pointed out, it’s more joyless. Bob Rudis did a heatmap – violin plot cross

And Henrik did a classic heatmap

None of which beat the original in my estimation.

You might use this look if you’re visualising continuous x, continuous y and categorical z. z is the stacking variable. Now, z could be ordinal or you might want to order it in some sensible way. What way? Like in the example above, it works best when there is an approximately smooth curve through the chart from top to bottom. Lindberg and Gregg have used different orders. There isn’t a right or wrong, because you emphasise different messages either way. Such is the way of dataviz. I think this is an important point, but for now consider a random z order.

Not so good huh? To play with the order, clone Henrik’s github repo and play with the line

` mutate(activity = reorder(activity, p_peak, FUN=which.max %>% # order by peak time`

I did:

` mutate(activity = reorder(activity, p_peak, FUN=function(x) { runif(1) })) %>% # order randomly `

I think the appeal of these charts is that they emulate 3-D objects. That’s why I’m laid back about the chart extending into the block above.

This is all straightforward stuff with the three variables. Back in the day, I made an R function called slicedens that does something similar with two continuous variables. I chop up one of them (y) into slices, get kernel density plots for x in each slice of y, and off you go. Note the semi-transparency please.

This attacks the recurring problem of a scatterplot with Big Data where it just turns into one massive blob of ink. Hexbins, contour plots etc are alternatives. Which of these can you batch-process when the data exceed the RAM or are streaming? Hexbins yes, because you just add to the count (and for real-time data, subtract the oldest set of counts). Contour maps no! Slice density yes, for the same reasons, that a kernel density plot is a sum of a load of little curves. If you want to draw it for a trillion data points, just break it up and accumulate density heights. (“Just,” he says!) If you want a real-time slice density for the last ten seconds of stock market trades, batch it up in seconds (for example), add them together. Every second, add the new one and subtract the one from ten seconds ago. Simples.

So, the 3-D object is important because this leverages a load of cognitive power to see things in the real world and makes the perception of pattern in the dataviz easier for the reader. Choses visibles et invisibles.

There’s a related question I want to expand on another time: when does a stacked chart become small multiples? This is not just semantics because they get understood in different ways.

Filed under Visualization

## Performance indicators and routine data on child protection services

The parts of social services that do child protection in England get inspected by Ofsted on behalf of the Department for Education (DfE). The process is analogous to the Care Quality Commission inspections of healthcare and adult social care providers, and they both give out ratings of ‘Inadequate’, ‘Requires Improvement’, ‘Good’ or ‘Outstanding’. In the health setting, there’s many years’ experience of quantitative quality (or performance) indicators, often through a local process called clinical audit and sometimes nationally. I’ve been involved with clinical audit for many years. One general trend over that time has been away from de novo data collection and towards recycling routinely collected data. Especially in the era of big data, lots of organisations are very excited about Leveraging Big Data Analytics to discover who’s outstanding, who sucks, and how to save lives all over the place. Now, it may not be that simple, but there is definitely merit in using existing data.

This trend is just appearing on the horizon for social care though, because records are less organised and electronic, and because there just hasn’t been that culture of profession-led audit. Into this scene came my colleagues Rick Hood (complex systems thinker) and Ray Jones (now retired professor and general Colossus of UK social care). They wanted to investigate recently open-sourced data on child protection services and asked if I would be interested to join in. I was – and I wanted to consider this question: could routine data replace Ofsted inspections? I suspected not! But I also suspected that question would soon be asked on the cash-strapped corridors of the DfE, and I wanted to head it off with some facts and some proper analysis.

We hired master data wrangler Allie Goldacre, who combed through, tested and verified and combined together the various sources:

• Children in Need census, and its predecessor the Child Protection and Referrals returns
• Children and Family Court Advisory and Support Service records of care proceedings
• DfE’s Children’s Social Work Workforce statistics
• SSDA903 records of looked-after children
• Spending statements from local authorities
• Local authority statistics on child population, deprivation and urban/rural locations.

Just because the data were ‘open’ didn’t mean they were useable. Each set had its own quirks and each local authority had its own problems and definitions in some cases. The data wrangling was painstaking and painful! As it’s all in the public domain, I’m going to add the data and code to my website here, very soon.

Then, we wrote this paper investigating the system and this paper trying to predict ‘Inadequate’ ratings. The second of these took all the predictors in 2012 (the most complete year for data) and tried to predict Inadequates in 2012 or 2013. We used the marvellous glmnet package in R and got down to three predictors:

• Initial assessments within the target of 10 days
• Re-referrals to the service
• The use of agency workers

Together they get 68% of teams right, and that could not be improved on. We concluded that 68% was not good enough to replace inspection, and called it a day.

But lo! Soon afterwards, the DfE announced that they had devised a new Big Data approach to predict Inadequate Ofsted scores, and that (what a coincidence!) it used the same three indicators. Well I never. We were not credited for this, nor indeed had our conclusion (that it’s a stupid idea) sunk in. Could they have just followed a parallel route to ours? Highly unlikely, unless they had an Allie at work on it, and I get no impression of the nuanced understanding of the data that would result from that.

Ray noticed that the magazine Children and Young People Now were running an article on the DfE prediction, and I got in touch. They asked for a comment and we stuck it in here.

A salutary lesson that cash-strapped Gradgrinds, starry eyed with the promises of big data after reading some half-cocked article in Forbes, will clutch at any positive message that suits them and ignore the rest. This is why careful curation of predictive models matters. The consumer is generally not equipped to make the judgements about using them.

A closing aside: Thomas Dinsmore wrote a while back that a fitted model is intellectual property. I think it would be hard to argue that coefficients from an elastic-net regression are mine and mine only, although the distinction may well be in how they are used, and this will appear in courts around the world now that they are viewed as commercially advantageous.

1 Comment

Filed under research

## Stats and data science, easy jobs and easy mistakes

I have been writing some JavaScript, and I was thinking about how web dev / front-end people are obliged to use the very latest tools, not so much for utility as for kudos. This seems mysterious to me but then I realised: it’s because the basic job — make a website — is so easy. The only way to tell who’s really seriously in the game is by how up to date they are. Then, this is the parallel that occurred to me: statistics is hard to get right, and a beginner is found out over and over again on the simplest tasks. On the other hand, if you do a lot of big data or machine learning or both, then you might screw stuff up left, right, and centre, but you are less likely to get caught. Because…

• nobody has the time and energy to re-run your humungous analysis
• it’s a black box anyway*
• you got headhunted by Uber last week

And maybe that’s one reason why there is more emphasis on having the latest shizzle in a data science job that’s more of a mixture of stats and computer science influences. I’m not taking a view that old ways are the best here, because I’m equally baffled by statisticians who refuse to learn anything new, but the lack of transparency and accountability (oh what British words!) is concerning.

* – this is not actually true, but it is the prevailing attitude

Filed under Uncategorized

## Dataviz of the week, 25/1/17

We had guideline-bustin’, kiddie-stiflin’, grandparent-over-the-threshold-usherin’ pollution in London at the beginning of the week. This is fairly standard nowadays, sadly. It’s not quite so bad out where I live in the Cronx, but in town it’s the worst in Europe. At the same time, Cameron Beccario pointed out the Beijing effect in his wonderful globe of carbon monoxide levels – far worse than anywhere else in the world, though there are some petrochemical hot spots. I’ve praised this live viz before, but that was before I started having a pick of the week on my office door (then, when the door went, here on the blog), so I’ll mention it again. Nice.

Filed under Visualization

This recent BBC Radio 4 “Farming Today” show (available to listen online until) visited Rothamsted Research Station, former home of stats pioneer Ronald Fisher, and considered the role of remote sensing, rovers, drones etc for agriculture, and most interestingly perhaps for you readers, the big data that result.

Agrimetrics (a partnership of Rothamsted and other academic organisations) chief executive David Flanders said of big data (about 19 minutes into the show):

I think originally in the dark ages of computing, when it was invented, it had some very pedantic definition that involved more than the amount of data that one computer can handle with one program or something. I think that’s gone by the wayside now. The definition I like is that it gives you answers to questions you hadn’t even thought of.

which I found confusing and somewhat alarming. I assume he knows a lot more about big data than I do, as he runs a ‘big data centre of excellence’ and I run a few computers (although his LinkedIn profile features the boardroom over the lab), but I’m not sure why he plays down the computational challenge of data exceeding memory. That seems to me to be the real point of big data. Sure, we have tools to simplify distributed computing, and if you want to do something based on binning or moments, then it’s all pretty straightforward. But efficient algorithms to scale up more complex statistical models are still being developed, and it is by no means a thing of the past. Perhaps the emphasis on heuristic algorithms for local optima in the business world have driven this view that distributed data and computation is done and dusted. I am always amazed at how models I feel are simple are sometimes regarded as mind-blowing in the machine learning / business analytics world. It may be because they don’t scale so well (yet) and don’t come pre-packaged in big data software (yet).

In contrast, the view that, with enough data, truths will present themselves unbidden to the analyst, is a much more dangerous one. Here we find enormous potential for overt and cryptic multiplicity (which has been discussed ad nauseam elsewhere), and although I can understand how a marketing department in a medium-sized business would be seduced by such promises from the software company, it’s odd, irresponsible even, to hear a scientist say it to the public. Agrimetrics’ website says

data in themselves do not provide useful insight until they are translated into knowledge

and hurrah for that. It sounds like a platitude but is quite profound. Only with contextual information, discussion and involvement of experts from all parts of the organisation generating and using the data do you really get a grip of what’s going on. These three points were originally a kind of office joke like buzzword bingo when I worked on clinical guidelines, but later I realised were accidentally the answer to making proper use of data:

• engage key stakeholders
• close the loop
• take forward best practice (you may feel you’ve seen these before)

or, less facetiously, talk to everyone about these data (not just the boss), get them all involved in discussions to define questions and interpret the results, and then do the same in translating it to recommendations for action. No matter how big your data are, this does not go away.

Filed under computing

## Scaling Statistics at Google Teach

Not unrelatedly to the Data Science angle, I read the recent paper “Teaching Statistics at Google Scale” today. I don’t think it actually has anything to do with teaching, but it does have some genuinely interesting examples of the inventive juggling required to make inferences in big data situations. You should read it, especially if you are still not sure whether big data is really a thing.

Filed under computing

## Complex interventions: MRC guidance on researching the real world

The MRC has had advice on evaluating “complex interventions” since 2000, updated 2008. By complex interventions, they mean things like encouraging children to walk to school, not complex in the sense of being made up of many parts, but complex in the sense that the way it happens and the effect it has is hard to predict because of non-linearities, interactions and feedback loops. Complexity is something I have been thinking and reading about a lot recently; it really is unavoidable in most of the work I do (I never do simple RCTs; I mean how boring is it if your life’s work is comparing drug X to placebo using a t-test?) and although it is supertrendy and a lot of nonsense is said about it, there is some wisdom out there too. However, I always found the 2000/8 guidance facile: engage stakeholders, close the loop, take forward best practice. You know you’re not in for a treat when you see a diagram like this:

Now, there is a new guidance document out that gets into the practical details and the philosophical underpinnings at the same time: wonderful! There’s a neat summary in the BMJ.

What I particularly like about this, and why it should be widely read, is that it urges all of us researchers to be explicit a priori about our beliefs and mental causal models. You can’t measure everything in a complex system, so you have to reduce it to the stuff you think matters, and you’d better be able to justify or at least be clear about that reduction. It acknowledges the role that context plays in affecting the results observed and also the inferences you choose to make. And it stresses that the only decent way of finding out what’s going on is to do both quantitative and qualitative data collection. That last part is interesting because it argues against the current fashion for gleeful retrospective analysis of big data. Without talking to people who were there, you know nothing.

My social worker colleague Rick Hood and I are putting together a paper on this subject of inference in complex systems. First I’ll be talking about it in Rome at IWcee (do come! Rome is lovely in May), picking up ideas from economists, and then we’ll write it up over the summer. I’ll keep you posted.