Answers to questions you hadn’t even thought of

This recent BBC Radio 4 “Farming Today” show (available to listen online until) visited Rothamsted Research Station, former home of stats pioneer Ronald Fisher, and considered the role of remote sensing, rovers, drones etc for agriculture, and most interestingly perhaps for you readers, the big data that result.

Agrimetrics (a partnership of Rothamsted and other academic organisations) chief executive David Flanders said of big data (about 19 minutes into the show):

I think originally in the dark ages of computing, when it was invented, it had some very pedantic definition that involved more than the amount of data that one computer can handle with one program or something. I think that’s gone by the wayside now. The definition I like is that it gives you answers to questions you hadn’t even thought of.

which I found confusing and somewhat alarming. I assume he knows a lot more about big data than I do, as he runs a ‘big data centre of excellence’ and I run a few computers (although his LinkedIn profile features the boardroom over the lab), but I’m not sure why he plays down the computational challenge of data exceeding memory. That seems to me to be the real point of big data. Sure, we have tools to simplify distributed computing, and if you want to do something based on binning or moments, then it’s all pretty straightforward. But efficient algorithms to scale up more complex statistical models are still being developed, and it is by no means a thing of the past. Perhaps the emphasis on heuristic algorithms for local optima in the business world have driven this view that distributed data and computation is done and dusted. I am always amazed at how models I feel are simple are sometimes regarded as mind-blowing in the machine learning / business analytics world. It may be because they don’t scale so well (yet) and don’t come pre-packaged in big data software (yet).

In contrast, the view that, with enough data, truths will present themselves unbidden to the analyst, is a much more dangerous one. Here we find enormous potential for overt and cryptic multiplicity (which has been discussed ad nauseam elsewhere), and although I can understand how a marketing department in a medium-sized business would be seduced by such promises from the software company, it’s odd, irresponsible even, to hear a scientist say it to the public. Agrimetrics’ website says

data in themselves do not provide useful insight until they are translated into knowledge

and hurrah for that. It sounds like a platitude but is quite profound. Only with contextual information, discussion and involvement of experts from all parts of the organisation generating and using the data do you really get a grip of what’s going on. These three points were originally a kind of office joke like buzzword bingo when I worked on clinical guidelines, but later I realised were accidentally the answer to making proper use of data:

  • engage key stakeholders
  • close the loop
  • take forward best practice (you may feel you’ve seen these before)

or, less facetiously, talk to everyone about these data (not just the boss), get them all involved in discussions to define questions and interpret the results, and then do the same in translating it to recommendations for action. No matter how big your data are, this does not go away.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s