Dataviz of the week, 13/01/2017

This chart from the Upshot team at the New York Times was picked up by Alberto Cairo on Twitter. The large blank space is included, partly because it’s good to have the domain of the variable visible in the axis, but mostly as a kind of mute protest to the gap between experts and public. I shall refrain from getting sidetracked into a discussion of the nature of evidence, complex systems and such.


The interesting thing is how the empty space looks impressive on the page, and not so on the screen – or so I thought, anyway. Empty space on a newspaper page is so unusual and reminds me of the classic journalists’ protest against censorship.

Leave a comment

Filed under Visualization

Stats in bed, part 2: Linux on Android

Never let it be said that Robert forgets to come back and finish off a job, although it might take a really long time. Last time (goodness, nearly 3 years ago; the antiquity of part one is shown by the long-dead term “phablet”), I was poking at Ubuntu Touch to see if it might offer a way of doing analysis on the go. Soon after that, I looked into more lightweight Linux implementations.

Firstly, your device will need to be rooted (no giggling at the back), as shown by the open padlock when it starts up. I discussed this last time; there’s plenty of advice online but basically it helps a lot to have a Linux computer (in this as in so many other ways).


Everything that follows happened a couple years ago so use your brains in checking details of apps etc if you want to try it out. I accept no responsibility for anything, ever.

So, the general idea here is to have a Linux virtual machine on your Android device. I started off using an app called GNUroot which was easy to use but had a limitation in getting files off the virtual space into the real world. When I restarted it, it made a new virtual drive, wiping out old files. Ok except for the fairly common crashes which lost all the work done in that session. It couldn’t work directly on the Android part of the machine (so to speak).

The next attempt was more stable but a little more complex. I installed Linux Deploy, which creates a 4GB virtual drive image and keeps that between uses (no more lost files). Instead of having one app that acts as a VM, I SSH’d into it using the app JuiceSSH (there are several like it).


The first step is to open Linux Deploy and press Start. An IP address appears at the top and we will use this to communicate with the VM using SSH.


Then, I went to JuiceSSH and chose (or typed in) the IP. Boom! You’re in Linux!


Awesome. Installing and updating programs sender to be a bit hit and miss, sometimes throwing up odd messages. So, to use R on the terminal, I relied on having the latest unstable build from Linux Deploy. 


I could even write files out.


In Linux Deploy, you can have the Android memory appear like a mounted drive to the Linux VM at ../../mnt/0


Coming out of Linux, we find our new file is right there, like magic.


I still find it pleasing to look at the file in a text editor in Android and marvel at how it got there. Simple pleasures.


So that’s fun. But a little clunky. No compiling C++, no Stan, or those other new-ish R packages that rely on Rcpp to build faster machine-code bits and bobs, though you can use lots and lots of R stuff. The antiquity of these experiments means I didn’t try Python out but I’m sure it would work just fine. Also, as self-styled clinical research maverick @anupampom pointed out to me, a major advantage is that you can take that linux.img file, stick it on another device with Linux Deploy, and carry on where you left off. Nice. It’s like a Docker container (kind of).

For reasons that may start to come into focus now, I gave up on doing phablet data science about this time. Not that way anyway. But the question of remote, platform-independent analysis and programming remains. And in part three, I’m going to close down this discussion with the real solution to this, which is in platform-independent interpreted languages.

Leave a comment

Filed under computing

Dataviz of the week, 06/01/2017

This is a spiral format of four months of time, with two colours (nice choices too) indicating sleep/awake patterns of a newborn baby. omg the first month is hard. I’m only 5½ months into being a dad, and I’ve already forgotten about it.


Made by Andrew Elliott, original on reddit here, brought to my attention by Randy Olsen on Twitter here.

On twitter you’ll see some people objecting to the spiral format, and it’s true that there is distortion with the early days taking up less space on the screen, but you trade that for eye-catching (GU6) and the continuity of time. No perfect mapping into visual parameters.

Leave a comment

Filed under Visualization

Holiday reading

Enough work, here’s some recommended reading for Christmas and New Year. All of it is free, online, and I enjoyed each one. Many of these come via the New York Times’ “What We’re Reading” email, which is a thing of joy.

The town where everyone got free money

A brief history of buildings that spin

Parenting by the Books: ‘On The Banks of Plum Creek’

Get them on the blower

My Saga: Karl Ove Knausgaard travels through North America


Sound check: the quietest place in the US

The most exclusive restaurant in America

Why do we work so hard?

The coddling of the American mind

The untold story of Silk Road

The town without wi-fi

The strange and curious tale of the last true hermit

A wrenching choice for Alaska towns in the path of climate change

Meet the man who flies around the world for free

The eagle huntress story

Firestorm: poor planning and tactical errors fueled a wildfire catastrophe

Today we are his family

On tickling the dragon’s tail

Leave a comment

Filed under Uncategorized

Things I discovered in 2016

Edit: I remembered! It was Jeffrey Rosenthal who also advocated improv for scientists, and I read it in PPF – review appearing in Significance soon.

Being a list, sometimes with minimal explanation, and not to be taken entirely seriously. These might be influenced by living in Croydon, working in London and hanging around with people younger than myself who work in tech.

For many of these, I kept saying to myself “how did I not know about this before”. You might find them useful too. Others are true, but less didactic, and they are scattered like the proverbial Marvellous Aphorisms.

bootstrap.js – because every template I have used has ended up causing more trouble than it saved to begin with. I’m not a full-time web dev and I need something quick. It’s really easy. Do it.

Git + Atom with packages such as merge-conflict. Damn, this is good, but the obscurity of Git to most people is not going to go away any time soon. I had shoved some stuff cack-handedly onto GitHub but it was working with stan-devs last year and this year that really pushed me into using Git for version control in everyday work. You should really consider Atom if you use Git.

node.js (yeah, I had been dodging this too and feeling inadequate)

react but it’s December and nobody really uses react any more

Why do academics not put some time, energy and budget into acquiring presentation skills? People keep telling me I am awesome etc, and I have really not done much to get awesome, so I rather doubt it, and it must be by association with others who are really bad. On a related note:

cat_with_stringhipster slides. I mean, ditch beamer and powerpoint and all that crap and just put one massive black and white picture of a kitten on the screen. Preferably with one word across it in humungous letters, possibly a lurid colour. If you don’t have your own typeface designed for you (what a loser!) then use Helvetica (and by implication, do not stand up in front of anyone to talk with an obviously Windows machine unless it is done ironically). Try to have as few slides as possible. Like Van Morrison, I’m working towards having no slides at all. On another related note:

a lot of people are talking about improv classes as the key scientific skill of 2017. OK, nobody is, but there’s Alan Alda and @alice_data and Jeffrey Rosenthal, and the classic books by Keith Johnstone which are sitting on my shelf calling to me. I read them like a million years ago and I feel they might be even more relevant now.

finally started teaching myself Python. last year I decided to reduce the number of languages, whether scripting or programming, that I have to carry around in my head, so I could be less awful at all of them. I dropped Julia, although I think it will be brilliant one day soon. That meant that I needed to boost my C++ (I learnt some long, long ago) to get R speedy when required, and that had a few spin-offs. I expect the Python skillz will be important too in 2017, if my brain can accommodate it all. The eagle-eyed among you will notice I’m back to the same number I started with (groan).

The secret Stata 14 command graph export myfilename.svg. Yes, SVG. God’s own graphics format. Just imagine what you could do… thanks to Tim Morris for spotting this. Goodness only knows why he was trying out file extensions for a laugh, or what else he tried that didn’t work. .123 anyone? But seriously, thanks StataCorp for taking this step, I know I have been droning on about it for years and now I’m really pleased with it.

Deep Work. You should seriously read this book. I now spend the start of each working day in a cafe of undisclosed location doing some deep work.

Ingrid Burrington’s work on internet infrastructure and what it tells us about secretive practices. Really eye-opening; you should get the book Networks of New York. I nearly lost my copy in the cafe of undisclosed location, but phew, they saved it for me.

Pinker’s Sense Of Style. Likewise.

Laura Marling, who I then listened to almost non stop this year. I’m not exaggerating. Perhaps responsible for the pessimistic tone creeping into recent writing on whether scientific practice will get better at replication, explanation and all that. More on that in the new year.

Rebecca Solnit’s “Field Guide To Getting Lost”. You’ll either get this or you won’t. If you do, you’ll be thanking me before long.

Mike Monteiro’s keynote talk at interaction ’15. Mentally find-and-replace designer with statistician and you have some important messages right there, plus a lot of swearing.

The Dear Data book, obvs

Cole Nussbaumer Knaflic’s book, which is the one I recommend now to viz noobs. It’s nurturing, if a little slow, and has the best coverage of perception issues that I’ve seen.

I read dataviz “classics” by Bertin and Wilkinson. Now I realise people talk about them a lot but haven’t actually read them, like Ulysses. The difference is I quite like Ulysses but these are just weird and not useful. Not good-weird, like EDA. You have to forgive Bertin a little for being a paid-up French semiologist of the 1950s, I mean it was his job not to say anything clear, but old Wilko seems to have written Grammar of Graphics while on a mind-expanding retreat.

Did a stack of reading around neural networks. They’re cool, and of course massively hyped. Feature selection and measuring uncertainty are the things to think about really hard before doing them. I’m doing NVIDIA’s two-day deep learning course in January ’17.

I decided that any complex set of predictor variables (without a clearly pre-defined subset based on contextual information) should be analysed in a number of ways, combining those from a traditional statistics training with those from a machine learning background: some kind of penalised linear model, some kind of tree, and some kind of non-linear feature combination. Maybe lasso, random forest and neural network. Consider boosting.

Did a stack of reading around AI. Interesting. A lot of compsci ML people seem to fly into a rage at the merest suggestion of killer robots (I can see where they’re coming from), and extend that to any ethical discussion (bad move, I think). You should read Nick Bostrom’s book (the ML guys hate it of course). Why does everyone assume it’s a bad thing to have humans wiped out by robots? We’re not really up to the job of running the planet. One thing I should write right now is that ML is not AI and statistical models like logistic regression do not really constitute ML either. You can relax for a few decades.

Every time I thought up some USSR – New Public Management – University life connection, I thought I was pretty damn clever, but of course Craig Brandist did it all before. What a guy. I bet they have a file on him.

I don’t like bananas, and come to that, cucumbers either. If I got to 42 and am still not sure whether I like a fruit, it’s time to stop trying. Likewise I expect to stop doing a lot of things in 2017, very much in the manner of Bilbo Baggins.

Leave a comment

Filed under Uncategorized

Best dataviz of 2016

I’m going to return to 2014’s approach of dividing best visualisation of data (dataviz!) from visualisation of methods (methodviz!).

In the first category, as soon as I saw Jill Pelto’s watercolour data paintings I was bowled over. Time series of environmental data are superimposed and form familiar but disturbing landscapes. I’m delighted to have a print of Landscape of Change hanging in the living room at Chateau Grant. Pelto studies glaciers and spends a lot of time on intrepid-sounding field trips, so she sees the effects of climate change first hand in a way that the rest of us don’t. There’s a NatGeo article on her work here.


In the methodviz category, Fernanda Viegas, Martin Wattenberg, Shan Carter and Daniel Smilkov made a truly ground-breaking website for Google’s TensorFlow project (open source deep learning software). This shows you how artificial neural networks of the simple feedforward variety work, and allows you to mess about with their design to a certain extent. I was really impressed with how the hardest aspect to communicate — the emergence of non-linear functions of the inputs — is just simple, intuitive and obvious for users. I’m sure it will continue to help people learn about this super-trendy but apparently obscure method for years to come, and it would be great to have more pages like this for algorithmic analytical methods. You can watch them present it here.


Leave a comment

Filed under Visualization

Explanation and inference with house sparrows

This time I’m going to take a closer look at another of my data visualisations I’ve been filling my spare time with for fun, not profit. I have two bird feeders in our garden and you can watch the consumption of seeds updated with every top-up at this page. This started when I wrote about Dear Data (viz o’ the year 2015) and recommended playing around with any odd local data you can get your hands on. I thought it would just be a cutesy dataviz exercise but it ended up as a neat microcosm of another issue that has occupied me somewhat this year: inference and explanation.

Briefly, statistical inference says “the rate of bird seed consumption is 0.41 cm/day, and if birds throughout suburban England consume at one rate, that should be between 0.33 and 0.50, with 95% probability”, or “the bird seed consumption has changed, in fact more than is plausibly due to random variation, so it is probably some systematic effect”. But explanation is different and is all about why it changed. Explanation doesn’t have to match statistics. A compelling statistical inference with a miniscule p-value could bug the hell out of you because you just can’t see why it should be in the direction it is, or of the strength is. Or an unconvincing, borderline statistical inference could cry out to you, “I am a fact. Just the way you hoped I would be!”

The problem here is that we try to be systematic in doing our statistical inferences, so that we don’t fall prey to cognitive biases: we pre-specify what we’re going to do and have to raise the bar if we start doing lots of tests. However, there’s no systematic approach like that to explanation. In fact, it’s not at all clear where these explanations come from, apart from thunderbolts of inspiration, and it’s only somewhat understood how we judge a good explanation from a poor one (as ever, I refer you to Peter Lipton’s book Inference To The Best Explanation for further reading).

When you get a great, satisfying explanation, it’s tempting to stop looking, but when you have compelling stats that don’t lead to a nice explanation, you might keep poking at the data, looking for patterns you like better, that suggest just such a nice explanation to you. Then, all the statistical work is no more sound than the explanatory thunderbolts.

Sad to relate, dear Reader, even Robert falls into these traps. On the web page, I wrote an explanation of the pattern of seed consumption, without giving too much thought to it:

I interpret the pattern along these lines: in mid-summer, the consumption increases massively as all the chicks leave the nest and start learning how to feed themselves. The sparrows in particular move around and feed in flocks of up to 20 birds. Once seeds and berries are available in the country though, it is safer for them to move out there than to stay in the suburbs with prowling cats everywhere. But as the new year arrives, the food runs out and they move back in gradually, still in large flocks, before splitting into small territories to build nests. Cycle of life and all that.

That was based on unsystematic observation before I started collecting the data. My hunches were informed by sketchy information about the habits of house sparrows, gleaned from goodness-knows-where, and they were backed up by the first year of data. I felt pretty smug. On this basis, one would feel confident predicting the future. But then, things started to unravel. The pattern no longer fit and there were multiple competing explanations for this. The data alone could not choose between them. Fundamentally, I realised I was sleepwalking into ignoring one of my own rules: don’t treat complex adaptive systems like physical experiments. An ecosystem — some gardens, parks and terrain vague, plus a bunch of songbirds, raptors, insects, squirrels, humans and cats — is a complex adaptive system, and the same issues beset any research in society. Causal relationships are non-linear, highly interdependent, and there are intelligent agents in the system. This all contributes to the same input producing very different outputs on different occasions, because the rules of the system change.

If it is foolish to declare an explanation on the basis of one year’s data that happen to match prior beliefs, it is equally so to declare an explanation for why 2016 shifted from 2015’s pattern after just a few months. It’s also foolish to say that after March 2016 consumption dropped and that coincided with a new roof being built on our garage, so that is the cause — yet I did just that:

The sharp drop in March 2016 was the result of work going on to replace the roof on our garage, which introduced scary humans into the garden all day. [emphasis added]

How embarrassing. Yes, it’s a nice explanation, but it doesn’t really have any competitors because there are no other observed causes, and there are no other observed effects either (I don’t sit out there all day taking notes like Thoreau). It’s only likely insofar as it is in a set of one and there are no likelier competitors. It’s only lovely insofar as it explains the data, and there’s nothing else to explain. And when we get later in the year, the congruence it enjoys with prior beliefs of casual mechanisms deteriorates: surely those birds would get used to the new roof and come back?

But we are all capable of remarkable mental gymnastics to keep our favourite explanation in the running. My neighbour had a tree cut down in June, so that would upset things further, and would be a permanent shift in the system. It was a good year for insects, following a frost-free winter, so there was less pressure to go feeding in comparatively dangerous gardens. And so on, getting ever more fanciful. The evidence for any of these mechanisms is thin to say the least. We can’t guard against this mental laxity because it’s the same process that helped our family members long ago to eat the springbok in the bushes but not get eaten by the corresponding lion, and now it’s hard-wired, but we can at least acknowledge that science* is somewhat subjective, even though we try our best to impose strict lab-notes-style hypothetico-deduction on it (this does not necessarily imply the use of decision theoretic devices like significance tests; the birdfeeder page only does splines which is basically non-parametric descriptive statistics), and not pretend to be able to know the Secrett of Nature by way of Experiment.

(More reading: Terry Speed’s article Creativity In Statistics.)

* – while simple physical sciences — of the sort you and I did in high school — might lead to Secretts, life and social sciences certainly don’t, and in fact the modern physical sciences involve pushing instruments to their limits and then statistics comes in to help sift the signal from the noise, so they too are a step removed from claiming to have proven physical laws.

Also, this exercise has something to say about data collection, or appropriation. I made some ground rules that were not very well specified, but in essence, I thought it best to write down how much seed I put in, rather than adjust for why it had come out. Spilled seeds were not to be separated from eaten seeds. But then in July 2015, I found a whole big feeder gone after filling it up in the morning, and ascribed this (on no evidence) to a squirrel. I thought about disregarding the day, but decided not to in the end. For one thing, it turned out in the cold light of analysis to be not so different to other high-consumption days. Maybe it was just especially ravenous sparrows. Or maybe Cyril the Squirrel had been at work all along. Once again, there was no way to choose one explanation from another. Now, the more you learn about the source of the data in all its messy glory, the more you question. But without that information, you wouldn’t. Another subjective, mutable aspect appears, one which is more relevant in this age of readily available and reused data.

All of these bird seed problems also appear in real research and analysis, but there, drawing the wrong conclusions can cause real harm. In each of the cock-ups above, it is the explanation that causes the problem, not the stats.

As I write this, I feel like I keep banging the same subjectivity and explanation drums, but, frankly, I don’t see much evidence of practice changing. I think the replication efforts of recent years in psychology are somewhat helpful, but are limited to fighting on a very narrow front. It probably helps to terrorise researchers more generally regarding poor practices but what we also need is a friendly acceptance of subjectivity and the role explanation plays. Science is hard.

Leave a comment

Filed under Visualization