Jake Conway, Alexander Lex & Nils Gehlenborg have made an R package called UpSetR which, as the name suggests, puts on an iron shirt and chases the devil out of Eart’. The devil in this case being Venn diagrams. Invariably, when people want to count up combinations of stuff, they end up hand-bodging some crappy diagram that isn’t even a real Venn. They use powerpoint or some other diabolical tool. Now you can do better in R.
You know how people love maps with little shapes encoding some data? Hexagons, circles, squares? Jigsaw pieces? Opal Fruits?
Rip’t from the pages of the Times Higher Education magazine, some years ago.
Or small multiples?
You know how people love charts made from emojis?
Stick them together and what do you get?
This is by Lazaro Gamio. They’re not standard emojis. Six variables get cut into ordinal categories and mapped to various expressions. You can hover on the page (his page, not mine, ya dummy) for more info. Note that some of the variables don’t change much from state to state. Uninsured, college degrees, those change, but getting enough sleep — not so much. It must be in there because it seems fun to map it to bags under the eyes. But the categorisation effectively standardises the variables so small changes in sleep turn into a lot of visual impact. Anyway, let’s not be too pedantic, it’s fun.
This idea goes back to Herman Chernoff, who always made it clear it wasn’t a totally serious proposal, and has been surprised at its longevity (see his chapter in PPF). Bill Cleveland was pretty down on the idea in his ’85 book:
“not enough attention was paid to graphical perception … visually decoding the quantitative information is just too difficult”
Today I’m sharing a nice little dataset that I think has some good features for teaching. Hope you like it.
I spotted this in the museum in Jasper, Alberta in 2012 and took a photo.
Later, I e-mailed the museum to find out who I should credit for it and we eventually found that it originated some time ago from Parks Canada, so thanks to them and I suggest you credit them as source if you use it.
No, I don’t have it in a file. I think working from the typewritten page is quite helpful as it keeps people out of stats software for this. They have to think. If you want to click buttons, there are a gazillion other datasets out there. This is a different kind of exercise.
Here we have the number of scars in tree rings that indicate fires in various years. If you look back in time through a tree’s rings, you can plot when it got damaged by fire but recovered. This could give an idea of the number of fires through the years, but only with some biases. It would be an interesting exercise for students who are getting to grips with the idea of a data-generating process. You could prompt them to think up and justify proposed biases, and hopefully they will agree on stuff like:
- there’s a number of fires each year; we might be able to predict it with things like El Nino/a years, arrival of European settlers and other data sources*
- the most ancient years will have few surviving trees, so more and more fires will get missed as you go back in time.
- This might not be random, if the biggest (oldest) trees were more likely to get felled for wood
- there will be a point (perhaps when Jasper became a national park) after which fires in the backwoods are actively prevented and fought, at which point the size of the fires, if not the number, should drop
- the bigger the fire area, the more scars will be left behind; they have to decide to work with number of fires, or size (or both…)
- the variables for size of the fire will be quite unreliable in the old days, but a good link from number of fires to number of scars otherwise
- can we really trust the area of burn in the older years? to 2 decimal places in 1665?
- and other things that are very clever and I haven’t dreamt of
* – once they are done with the data generating process, if they are confident enough with analysis, you could give them this dataset of Canada-wide forest fires, which I pulled together from a few years ago. It’s not without its own quirks, as you’ll see, but they might enjoy using it to corroborate some of their ideas.
I would ask them to propose a joint Bayesian model for the number of fires and area burnt over the years, including (if they want) predictions for the future (bearing in mind the data ends at 1971). You could also ask for sketched dataviz in a poster presentation, for example.
Finally, I highly recommend a trip to Jasper. What a beautiful part of the world!
This came up on Twitter and lots of people were outraged, as you see in the replies and retweets.
Let’s unpack a couple of things.
- appreciate – it’s not clear what he means by this. It could mean “Many software engineers will never be really good at data science using modern machine learning”, which seems like tautology (same goes for estate agents), but see software engineers below. It could mean “Many software engineers will never truly have an intuitive attraction to the elegant mathematical underpinnings of modern machine learning”, and in that case it is true that there is a connection between maths and, er, maths, but that’s not very interesting. Appreciating in this sense is an ivory tower luxury.
- love – lord above, are you trying to fool me in love? I think high-pressure rote learning in the Asian mould would do the trick too. It seems irrelevant.
Victorian Dad (c) Viz
- as a teen – this is what most people hated about it, the gatekeeping and stereotype-enforcement. It’s clearly bollocks, so let’s not waste time on Someone Said Something Wrong On The Internet. If you want to learn now, here’s my reading page.
- software engineers – if he really is talking about software engineers (isn’t that term, like, a bit 1990s?), then it sounds fair enough despite the inaccuracies and tautologies. Why would they want to or need to have anything to do with modern ML? I’m a statistician, but do enough programming to grasp what it is like to be a day-in, day-out coder. You just grab something that someone wrote — a random forests library perhaps — and plug it in. Why would you appreciate its theory? That’s a waste of time. You don’t go round appreciating the hell out of fibre broadband cables.
- modern machine learning – I don’t know what is meant by this, but it’s interesting to me that there are some things in ML and stats like logistic regression, which have strong, mathematical underpinnings, which is to say that their asymptotics are understood, and other things in ML and not stats, like deep learning with backprop, which are kind of greedy, heuristic and do not have guaranteed or even understood asymptotics. Depending on what he means by this phrase, there might be nothing to appreciate. If there is something to appreciate, then it might not be that modern — logistic regression was pretty much finished theoretically in the 70s, PCA in the 30s.
- math – this is the really interesting thing. Do you need maths to do data science well? It certainly helps with reading those tortuous theory papers (but they’re not that useful compared to messing about with software). It is not as useful as programming (hi, software engineers!) skills. The reason a lot of people get caught out is because they have done some analysis that ran, produced no error messages, but led to the wrong answer, and they had no mental tools to spot it. Maths will not give you that tool; you need to think about data and have messed around getting your hands dirty. I studied maths and enjoyed it and did pretty well, if I say so myself, but that has been of very little use to me. I’ve forgotten most of it.
A page of my A-level maths revision notes. I have never had to do partial fractions. Ever.
If you really do intend to be a methodological stats prof, then you’d better get good with the old x’s and y’s, but otherwise, install R and play.
Perhaps the one really useful skill I acquired is imagining data as points in space, rotating, distorting, projecting. I had to do a lot of that when doing a Masters dissertation project with PCA, MCA, etc. That has genuinely helped me to develop ideas and think about where things are going wrong.
The other important thing to think about is metrics – different ways of quantifying the distance from this data point to that one, because that underpins a lot of stuff that follows, whether stats or ML (notably loss / log-likelihood functions). And I have another blog post on this very topic coming up.
Font Map is an interactive website by designers Ideo which aims to represent typefaces in 2 dimensions so you can eyeball similar ones. They make a big deal out of “leveraging AI and convolutional neural networks to draw higher-vision pattern recognition”. I’m not sure what that sentence means, though I conclude they got a thrill out of it. (I refer to the opaque boardroom talk; I know perfectly well what these techniques are.) What we see on the screen is a classic horseshoe shape of dimension reduction that happens when you have an underlying continuum that mostly lies along one axis. You see this with principal components analysis, multiple correspondence analysis, multidimensional scaling, whatever. t-SNE screws around with it (read: anisotropically transforms the projected space) to straighten out that hoof.
On this basis, we seem to have one overarching scale from italic to bold. That’s not much of a breakthrough, and although there certainly is merit in a list of similar fonts, you don’t need a whizzy graphic for it. It would also be better done by humans, as some of the fonts are misplaced to my eye. But that’s CNNs for ya; I’d also like some exploration of what features are detected. In a blog post, Ideo’s project lead Kevin Ho explains the method. I don’t know to what extent the number of training images mattered, but that is something to think about if you are doing this sort of thing. Then there’s an image of “early results” through t-SNE that, to my mind, looks better than the final results, because more clusters emerge that way. It’s not clear how he then got to the final result, though it looks like maybe he just spared the t-SNE special sauce, or took the k-D (k>2) projection and then smacked it down further through PCA (ML people love PCA, they think it has magical powers). I don’t know. (You should check out this page on t-SNE, once you understand the principle, by those ninjas of interactivity Viegas & Wattenberg, plus Ian Johnson of Google Cloud).
All in all, you know, it’s fun, and it’s important to experiment (as my grandad said about tasting his own urine), but if you talk up the AI angle too much, people who know about it will start to doubt the quality of your work. That’s a pity but it can be guarded against by providing lots of details of your method and viewing it as an ongoing exploration, not a done deal. I say this as advice to young people, not criticism of Kevin Ho’s work because I just don’t know what he did.
I’ve occasionally asked myself odd superimpose-geographies questions like “how far is it from A to B if they were in Winchester?” (because I can feel those distances better) or “would the West Kennet Long Barrow fit inside the Broadgate Centre?” (I’m sure we’ve all thought that). Hans Hack has made an online map like that, with a serious purpose, which superimposes Aleppo and the destroyed parts onto London.
It’s all done in leaflet.js and weighs in at 800 lines of code with a lot of generous — luxurious one might say — spacing, so it is well with your grasp to do something like this. It’s also just pretty, with sparing colour and layering of information with simple controls. There is also a Berlin version. I suppose you have to know the host city for it to hit home but then it’s a powerful message about the scale of it all.