Jake Conway, Alexander Lex & Nils Gehlenborg have made an R package called UpSetR which, as the name suggests, puts on an iron shirt and chases the devil out of Eart’. The devil in this case being Venn diagrams. Invariably, when people want to count up combinations of stuff, they end up hand-bodging some crappy diagram that isn’t even a real Venn. They use powerpoint or some other diabolical tool. Now you can do better in R.
All I’m going to do this week I point you to Andy Kirk’s blog. He’s considering bivariate choropleth maps. You what? Each region has two variables. Maybe one gets encoded as hue and the other saturation. No way. Yes way, and it’s not necessarily the train wreck you’d imagine. Check them out.
Of course, by superimposing objects rather than colouring in (because colour is, you know, so beguiling as a visual parameter to mess around with, yet so poorly perceived), people have been doing this for ages. Bertin’s much-quoted and less-read book has many such examples, which mostly fall flat in my view. As he wrote: “It is the designer’s duty … to flirt with ambiguity without succumbing to it”. Bof!
I have been writing a book review of Efron & Hastie’s CASI for Significance magazine. Here’s a tangential half page I wrote but didn’t include.
Students of statistics or data science will typically encounter some discomfiting jumps in attitude as their course progresses. first, they may have a lot of probability theory and some likelihood-based inference for rather contrived problems, which will remind them of their maths classes at school. Ah, they think, I know how to do this. I learn the tricks to manipulate the symbols and get to the QED. Then, they find themselves suddenly in a course that provides tables of data and asks them to analyse and interpret. Suddenly it’s become a practical course that connects to the real world and leaves the maths behind for the most part. Now, there’s no QED given, and no tricks. The assessments suddenly are more like humanities subjects, there’s no right or wrong and it’s the coherence of their argument that matters. Now they have to remember which options to tick in their preferred stats software. They might think: why did we do the mathematical parts of this course at all if we’re not going to use them? Next, for some, come machine learning methods. Now, the inference and asymptotic assurances are not just hidden in the cogs of the computer but are actually absent. How do I know the random forest isn’t giving me the wrong answer? You don’t. It seems at first that when the problem gets really hard, like 21st-century-hard, land-a-job-at-Google-hard, we give up on stats as an interesting mental exercise from the 1930s in favour of “unreasonably effective” heuristics and greedy algorithms.
One really nice thing they do in CASI is to emphasise that all estimation, from standard deviations of samples to GAMs, are algorithms. The inference (I prefer to say “uncertainty”) for those algorithms follows later in the history of the subject. The 1930s methods had enough time to work out inference by now, but other methods are still developing their inferential procedures. This unifies things rather better, but most teaching has to catch up. One problem is that almost all the effort of reformers following George Cobb, Joan Garfield and others has been on the very early introduction to the subject. That’s probably the right place to fix first, but we need to broaden out and fix wider data science courses now.
It’s Clean Air Day in the UK. Air pollution interests me, partly as I worked in medical stats for many years, partly because I don’t want to breathe in a lot of crap, and partly because I don’t want my baby to breathe in a lot of crap. London is really bad, the worst place in Europe. Not Beijing, sure, but really bad, and it’s hard to imagine that Brexit will lead to anything but a relaxation of the rules.
Real World Visuals (formerly CarbonVisuals, who made the amazing mountain of CO2 balls looming over New York) have made a series of simple, elegant but powerful images about volumes of air and what they contain, and the volumes of air saturated with pollution which are left behind by one car over one kilometer travelled.
The tweet is accidentally poetic as it can’t accommodate more than the first four images, which leaves you on a cliffhanger with the massive stack looming behind the mother and girl. You know what it is but you can’t see its enormity yet.
The crowd visualisation of 9,416 dead Londoners as dots is not bad, though I like physical images of numbers of people, like this classic (adapted from http://www.i-sustain.com/old/CommuterToolkit.htm):
Here’s a picture of apparently 8-9000 people marching in Detroit:
All dead by Christmas. And then some.
You might like to compare and contrast with higher-profile causes of death, like terrorism.
I’ve been stockpiling Opal Fruits, which young people tell me are now called Starburst, in anticipation of today’s election results.
This is like one-tenth of the stash. I don’t want to eat them though. You know what you’re going to get if you knock here at Halloween.
I took the New York Times’ hexbin cartogram, imposed a 6×8 rectangular grid and counted the most common party in each block. There was a little bit of fudging and chopping up the sweets. It is art, no? Here’s the video:
You know how people love maps with little shapes encoding some data? Hexagons, circles, squares? Jigsaw pieces? Opal Fruits?
Rip’t from the pages of the Times Higher Education magazine, some years ago.
Or small multiples?
You know how people love charts made from emojis?
Stick them together and what do you get?
This is by Lazaro Gamio. They’re not standard emojis. Six variables get cut into ordinal categories and mapped to various expressions. You can hover on the page (his page, not mine, ya dummy) for more info. Note that some of the variables don’t change much from state to state. Uninsured, college degrees, those change, but getting enough sleep — not so much. It must be in there because it seems fun to map it to bags under the eyes. But the categorisation effectively standardises the variables so small changes in sleep turn into a lot of visual impact. Anyway, let’s not be too pedantic, it’s fun.
This idea goes back to Herman Chernoff, who always made it clear it wasn’t a totally serious proposal, and has been surprised at its longevity (see his chapter in PPF). Bill Cleveland was pretty down on the idea in his ’85 book:
“not enough attention was paid to graphical perception … visually decoding the quantitative information is just too difficult”
Today I’m sharing a nice little dataset that I think has some good features for teaching. Hope you like it.
I spotted this in the museum in Jasper, Alberta in 2012 and took a photo.
Later, I e-mailed the museum to find out who I should credit for it and we eventually found that it originated some time ago from Parks Canada, so thanks to them and I suggest you credit them as source if you use it.
No, I don’t have it in a file. I think working from the typewritten page is quite helpful as it keeps people out of stats software for this. They have to think. If you want to click buttons, there are a gazillion other datasets out there. This is a different kind of exercise.
Here we have the number of scars in tree rings that indicate fires in various years. If you look back in time through a tree’s rings, you can plot when it got damaged by fire but recovered. This could give an idea of the number of fires through the years, but only with some biases. It would be an interesting exercise for students who are getting to grips with the idea of a data-generating process. You could prompt them to think up and justify proposed biases, and hopefully they will agree on stuff like:
- there’s a number of fires each year; we might be able to predict it with things like El Nino/a years, arrival of European settlers and other data sources*
- the most ancient years will have few surviving trees, so more and more fires will get missed as you go back in time.
- This might not be random, if the biggest (oldest) trees were more likely to get felled for wood
- there will be a point (perhaps when Jasper became a national park) after which fires in the backwoods are actively prevented and fought, at which point the size of the fires, if not the number, should drop
- the bigger the fire area, the more scars will be left behind; they have to decide to work with number of fires, or size (or both…)
- the variables for size of the fire will be quite unreliable in the old days, but a good link from number of fires to number of scars otherwise
- can we really trust the area of burn in the older years? to 2 decimal places in 1665?
- and other things that are very clever and I haven’t dreamt of
* – once they are done with the data generating process, if they are confident enough with analysis, you could give them this dataset of Canada-wide forest fires, which I pulled together from a few years ago. It’s not without its own quirks, as you’ll see, but they might enjoy using it to corroborate some of their ideas.
I would ask them to propose a joint Bayesian model for the number of fires and area burnt over the years, including (if they want) predictions for the future (bearing in mind the data ends at 1971). You could also ask for sketched dataviz in a poster presentation, for example.
Finally, I highly recommend a trip to Jasper. What a beautiful part of the world!