Monthly Archives: November 2016

Noise pollution map of London (part 1)

I’m working on a noise pollution map of central London. Noise is an interesting public health topic, overlooked and of debatable cause and effect but understandable to everyone. To realise it as interactive online content, I get to play around with Mapbox as well as D3 over Leaflet [1] and some novel forms of visualisation, audio delivery and interaction.

The basic idea is that, whenever the need arises to get from A to B, and I could do it by walking, I record the ambient sound and also capture a detailed GPS trail. Then, I process those two sets of data back at bayescamp and run some sweet tricks to make them into the map. I have about 15 hours of walking so far, and am prototyping the code to process the data. The map doesn’t exist yet, but in a future post on this subject, I’ll include a sketch of what it might look like. The map below shows some of my walks (not all). As I collect and process the files, I will update the image here, so it should be close to live.


I’d like it to become crowd-sourced, in the sense that someone else could follow my procedure for data capture, copy the website and add their own data before sharing it back. GitHub feels like the ideal tool for this. Then, the ultimate output is a tool for people to assemble their own noise-pollution data.

As I make gradual progress in my spare time, I’ll blog about it here with the ‘noise pollution’ tag. To start with, I’ll take a look at:

The equipment

Clearly, some kind of portable audio recorder is needed. For several years, when I made the occasional bit of sound art, I used a minidisc recorder [2] but now have a Roland R-05 digital recorder. This has an excellent battery life and enough storage for at least a couple of long walks. At present, you can get one from Amazon for GBP 159. When plugged into USB, it looks and behaves just like a memory stick. I have been saving CD-quality audio in .wav format, mindful that you can always degrade it later, but you can’t come back. That is pretty much the lowest quality the R-05 will capture anyway (barring .mp3 format, and I decided against that in that I don’t want it to dedicate computing power to compressing the sound data), so it occupies as little space on the device as possible. It will tuck away in a jacket pocket easily so there’s no need to be encumbered by kit like you’re Chris Watson.

Pretty much any decent microphone, plus serious wind shielding, would do, but my personal preference is for binaurals, which are worn in the ear like earphones and capture a very realistic stereo image. Mine are Roland CS-10EM which you can get for GBP 76. The wind shielding options are more limited for binaurals than a hand-held mic, because they are so small. I am still using the foam covers that come with the mics (pic below), and wind remains something of a consideration in the procedure of capturing data, which I’ll come back to another time.


On the GPS side, there are loads of options and they can be quite cheap without sacrificing quality. I wanted something small that allowed me to access the data in a generic format, and chose the Canmore GT-730FL. This looks like a USB stick, recharges when plugged in, can happily log (every second!) for about 8 hours on a single charge, and allows you to plug it in and download your trail in CSV or KML format. The precision of the trail was far superior to my mobile phone at the time when I got it, though the difference is less marked now even with a Samsung J5 (J stands for Junior (not really)). There is a single button on the side, which adds a flag to the current location datum when you press it. That flag shows up in KML format in its own field, but is absent from CSV. They cost GBP 37 at present. There are two major drawbacks: the documentation is awful (Remember when you used to get appliances from Japan in the 80s and none of the instructions made sense? Get ready for some nostalgia.) and the data transfer is by virtual serial port, which is straightforward on Windows with the manufacturer’s Canway software but a whole weekend’s worth of StackOverflow and swearing on Linux/OS X. Furthermore, I have not been able to get the software working on anything but an ancient Windows Vista PC (can you imagine the horror). Still, it is worth it to get that trail. There is a nice blog by Peter Dean (click here), which details what to do with the Canmore and its software, and compares it empirically to other products. The Canway software is quite neat in that it shows you a zoomable map of each trail, and is only a couple of clicks away from exporting to CSV or KML.

Having obtained the .kml file for the trail plus starting point, the .csv file for the trail in simpler format, and the .wav file for the sound, the next step is synchronising them, trimming to the relevant parts and then summarising the sound levels. For this, I do a little data-focussed programming, which is the topic for next time.


1 – these are JavaScript libraries that are really useful for flexible representations of data and maps. If you aren’t interested in that part of the process, just ignore them. There will be plenty of other procedural and analytic considerations to come that might tickle you more.

2 – unfairly maligned; I heard someone on the radio say recently that, back around 2000, if you dropped a minidisc on the floor, it was debatable whether it was worth the effort to pick it up


Leave a comment

Filed under Visualization

Dataviz of the week, 22 November 2016

This week I was impressed with this blog post by John Nelson, cartographic craftsman, that sets out the design principles and how to make “firefly maps”. They look like this:


(c) John Nelson /

Wow, said the owl. I really want to make one, and I suspect you do too.

Leave a comment

Filed under Visualization

Answers to questions you hadn’t even thought of

This recent BBC Radio 4 “Farming Today” show (available to listen online until) visited Rothamsted Research Station, former home of stats pioneer Ronald Fisher, and considered the role of remote sensing, rovers, drones etc for agriculture, and most interestingly perhaps for you readers, the big data that result.

Agrimetrics (a partnership of Rothamsted and other academic organisations) chief executive David Flanders said of big data (about 19 minutes into the show):

I think originally in the dark ages of computing, when it was invented, it had some very pedantic definition that involved more than the amount of data that one computer can handle with one program or something. I think that’s gone by the wayside now. The definition I like is that it gives you answers to questions you hadn’t even thought of.

which I found confusing and somewhat alarming. I assume he knows a lot more about big data than I do, as he runs a ‘big data centre of excellence’ and I run a few computers (although his LinkedIn profile features the boardroom over the lab), but I’m not sure why he plays down the computational challenge of data exceeding memory. That seems to me to be the real point of big data. Sure, we have tools to simplify distributed computing, and if you want to do something based on binning or moments, then it’s all pretty straightforward. But efficient algorithms to scale up more complex statistical models are still being developed, and it is by no means a thing of the past. Perhaps the emphasis on heuristic algorithms for local optima in the business world have driven this view that distributed data and computation is done and dusted. I am always amazed at how models I feel are simple are sometimes regarded as mind-blowing in the machine learning / business analytics world. It may be because they don’t scale so well (yet) and don’t come pre-packaged in big data software (yet).

In contrast, the view that, with enough data, truths will present themselves unbidden to the analyst, is a much more dangerous one. Here we find enormous potential for overt and cryptic multiplicity (which has been discussed ad nauseam elsewhere), and although I can understand how a marketing department in a medium-sized business would be seduced by such promises from the software company, it’s odd, irresponsible even, to hear a scientist say it to the public. Agrimetrics’ website says

data in themselves do not provide useful insight until they are translated into knowledge

and hurrah for that. It sounds like a platitude but is quite profound. Only with contextual information, discussion and involvement of experts from all parts of the organisation generating and using the data do you really get a grip of what’s going on. These three points were originally a kind of office joke like buzzword bingo when I worked on clinical guidelines, but later I realised were accidentally the answer to making proper use of data:

  • engage key stakeholders
  • close the loop
  • take forward best practice (you may feel you’ve seen these before)

or, less facetiously, talk to everyone about these data (not just the boss), get them all involved in discussions to define questions and interpret the results, and then do the same in translating it to recommendations for action. No matter how big your data are, this does not go away.

Leave a comment

Filed under computing

Dataviz of the week, 15 November 2016

I used to have an office door until this week when we moved to open plan space elsewhere in the medical school. I used to stick a chart of the week on that door, a student’s suggestion that proved to be a bottomless mine of goodies. So, I thought I would carry on visualising here.

We begin with some physical dataviz courtesy of Steven Gnagni and spotted by Helen Drury. Spoor of human activity etc etc. More like this next week.


Which pencil has been used most?


Filed under Uncategorized

How I work, part 1: oppressive regimes

A quick note: in my work, I won’t travel to countries which are oppressive or dangerous. I have a list in mind, but I won’t print it here because these things change, other than to say that, in the absence of actual statute, and notwithstanding the limitations on executive power familiar to all former viewers of The West Wing, I have to assume for the time being that the USA goes on that list, having elected a president on a ticket of torture and extra-judicial killings. So, that unfortunately rules out attendance at StanCon 2017, and JSM 2017, but not JSM 2018, which is in Canada.

My own experience leaves me unconvinced of the ability of complete economic blockade to effect cultural or political change, so I continue to work with anyone whose own focus is not morally reprehensible. As mentioned here, I am writing a book for an American publisher, and have occasional dealings with certain American universities. I will continue to make the stuff that I would have offered up for StanCon and you’ll hear about it here. But I won’t travel there (which is a pity because I really like their wintergreen-flavored toothpaste if nothing else).

When I finish tidying up my website, I will have a section there on “how I work” that spells out things like this.

1 Comment

Filed under Uncategorized

Every sample size calculation

A brief one this week, as I’m working on the dataviz book.

I’m a medical statistician, so I get asked about sample size calculations a lot. This is despite them being nonsense much of the time (wholly exploratory studies, no hypothesis, pilot study, feasibility study, qualitative study, validating a questionnaire…). In the case of randomised, experimental studies, they’re fine, and especially if there’s a potentially dangerous intervention or lack thereof. But we have a culture now where reviewers, ethics committees and such ask to see one for any quant study. No sample size, no approval.

So, I went back through six years of e-mails (I throw nothing out) and found all the sample size calculations. Others might have been on paper and lost forever, and there are many occasions where I’ve argued successfully that no calculation is needed. If it’s simple, I let students do it themselves. Those do not appear here, but what we do have (79 numbers from 21 distinct requests) give an idea of the spread.


You see, I am so down on these damned things that I started thinking I could just draw sizes from the distribution in the above histogram like a prior, given that I think it is possible to tweak the study here and there and make it as big or as small as you like. If the information the requesting person lavishes on me makes no difference to the final size, then the sizes must be identically distributed even conditional on the study design etc., and so a draw from this prior will suffice. (Pedants: this is a light-hearted remark.)

You might well ask why there are multiple — and often very different — sizes for each request, and that is because there are usually unknowns in the values required for calculating error rates, so we try a range of values. We could get Bayesian! Then it would be tempting to include another level of uncertainty, being the colleague/student’s desire to force the number down by any means available to them. Of course I know the tricks but don’t tell them. Sometimes people ask outright, “how can we make that smaller”, to which my reply is “do a bad job”.

And in those occasions where I argue that no calculation is relevant, and the reviewers still come back asking for one, I just throw in any old rubbish. Usually 31. (I would say 30 but off-round numbers are more convincing.) It doesn’t matter.

If you want to read people (other than me) saying how terrible sample size calculations are, start with “Current sample size conventions: Flaws, harms, and alternatives” by Peter Bacchetti, in BMC Medicine 2010, 8:17 (open access). He pulls his punches, minces his words, and generally takes mercy on the calculators:

“Common conventions and expectations concerning sample size are deeply flawed, cause serious harm to the research process, and should be replaced by more rational alternatives.”

In a paper called “Sample size calculations: should the emperor’s clothes be off the peg or made to measure”, which wasn’t nearly as controversial as it should have been, Geoffrey Norman, Sandra Monteiro and Suzette Salama (no strangers to the ethics committee), point out that they are such guesswork, we should just save people’s anxiety, delays waiting for a reply from the near-mythical statistician, and brain work, and let them pick some standard numbers. 65! 250! These sound like nice numbers to me; why not? In fact, their paper backs up these numbers pretty well.

In the special case of ex-post “power” calculations, see “The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis” by John M. Hoenig and Dennis M. Heisey, in The American Statistician (2001); 55(1): 19-24.

This is not a ‘field’ of scientific endeavour, it is a malarial swamp of steaming assumptions and reeking misunderstandings. Apart from multiple testing in its various guises, it’s hard to think of a worse problem in biomedical research today.

1 Comment

Filed under healthcare