A more techy one this week. Ruth Fong and Andrea Vedaldi have a paper on ArXiv called “Interpretable explanations of black boxes by meaningful perturbation”. The argument that some modern machine learning (let’s not start that one again) techniques are black boxes which produce an output but nobody can understand how and why is a serious concern. If you don’t know how it works, how do you know you can believe it, or apply it outside the bounds of your previous data (in the manner of the disastrous Challenger space shuttle launch)?
HT @poolio for tweeting this, otherwise I’d never have heard about it.
The paper is heavy on the maths but thanks to the visual nature of convolutional neural networks (CNNs), which are high-dimensional non-linear statistical models to classify images, you can absorb the message very easily. Take the image, put it through the CNN, get a classification. Here, from the paper’s Figure 1, we see this image classified as containing a flute with probability 0.9973
Then, they randomly perturb an area of the image and run it again, checking how it has affected the prediction probability. When they find an area that strongly adversely affects the CNN, they conclude that it is here that the CNN is “looking”. Here’s a perturbed image:
You can see it’s the flute that has been blurred. They then show the impact of different regions in this “learned mask” heatmap:
(I’m glossing over the computational details quite a lot here because this post is about dataviz.) It rather reminds me of the old days when I was an undergrad and had to calculate a gazillion different types of residuals and influence statistics, many of which were rather heuristic. You could do this kind of thing with all kinds of black boxes (as Fong & Vedaldi suggest by proposing a general theory of “explanations”), as long as there are some dimensions that are structural (x and y position in the case of image data) and others that can get perturbed (RGB values in this case). I think it would be valuable in random forests and boosted trees.
They also have a cup of coffee where the mask makes sense when artifacts are added (the kind of artifact that is know to mess with CNNs yet not human brains) and a maypole dance that doesn’t so much (and this seems to be powered by the same CNN tendency to spot ovals). This is potentially very informative for refining the CNN structure.
If you are interested in communicating the robustness of CNNs effectively, you should read this too.
On Twitter, @SirSandGoblin is tracking polls before the UK general election in the medium of cross-stitch.
You just have to look. This is clearly the work of a dataviz genius. I have nothing more to say.
This came up on Twitter and lots of people were outraged, as you see in the replies and retweets.
Let’s unpack a couple of things.
- appreciate – it’s not clear what he means by this. It could mean “Many software engineers will never be really good at data science using modern machine learning”, which seems like tautology (same goes for estate agents), but see software engineers below. It could mean “Many software engineers will never truly have an intuitive attraction to the elegant mathematical underpinnings of modern machine learning”, and in that case it is true that there is a connection between maths and, er, maths, but that’s not very interesting. Appreciating in this sense is an ivory tower luxury.
- love – lord above, are you trying to fool me in love? I think high-pressure rote learning in the Asian mould would do the trick too. It seems irrelevant.
Victorian Dad (c) Viz
- as a teen – this is what most people hated about it, the gatekeeping and stereotype-enforcement. It’s clearly bollocks, so let’s not waste time on Someone Said Something Wrong On The Internet. If you want to learn now, here’s my reading page.
- software engineers – if he really is talking about software engineers (isn’t that term, like, a bit 1990s?), then it sounds fair enough despite the inaccuracies and tautologies. Why would they want to or need to have anything to do with modern ML? I’m a statistician, but do enough programming to grasp what it is like to be a day-in, day-out coder. You just grab something that someone wrote — a random forests library perhaps — and plug it in. Why would you appreciate its theory? That’s a waste of time. You don’t go round appreciating the hell out of fibre broadband cables.
- modern machine learning – I don’t know what is meant by this, but it’s interesting to me that there are some things in ML and stats like logistic regression, which have strong, mathematical underpinnings, which is to say that their asymptotics are understood, and other things in ML and not stats, like deep learning with backprop, which are kind of greedy, heuristic and do not have guaranteed or even understood asymptotics. Depending on what he means by this phrase, there might be nothing to appreciate. If there is something to appreciate, then it might not be that modern — logistic regression was pretty much finished theoretically in the 70s, PCA in the 30s.
- math – this is the really interesting thing. Do you need maths to do data science well? It certainly helps with reading those tortuous theory papers (but they’re not that useful compared to messing about with software). It is not as useful as programming (hi, software engineers!) skills. The reason a lot of people get caught out is because they have done some analysis that ran, produced no error messages, but led to the wrong answer, and they had no mental tools to spot it. Maths will not give you that tool; you need to think about data and have messed around getting your hands dirty. I studied maths and enjoyed it and did pretty well, if I say so myself, but that has been of very little use to me. I’ve forgotten most of it.
A page of my A-level maths revision notes. I have never had to do partial fractions. Ever.
If you really do intend to be a methodological stats prof, then you’d better get good with the old x’s and y’s, but otherwise, install R and play.
Perhaps the one really useful skill I acquired is imagining data as points in space, rotating, distorting, projecting. I had to do a lot of that when doing a Masters dissertation project with PCA, MCA, etc. That has genuinely helped me to develop ideas and think about where things are going wrong.
The other important thing to think about is metrics – different ways of quantifying the distance from this data point to that one, because that underpins a lot of stuff that follows, whether stats or ML (notably loss / log-likelihood functions). And I have another blog post on this very topic coming up.
Font Map is an interactive website by designers Ideo which aims to represent typefaces in 2 dimensions so you can eyeball similar ones. They make a big deal out of “leveraging AI and convolutional neural networks to draw higher-vision pattern recognition”. I’m not sure what that sentence means, though I conclude they got a thrill out of it. (I refer to the opaque boardroom talk; I know perfectly well what these techniques are.) What we see on the screen is a classic horseshoe shape of dimension reduction that happens when you have an underlying continuum that mostly lies along one axis. You see this with principal components analysis, multiple correspondence analysis, multidimensional scaling, whatever. t-SNE screws around with it (read: anisotropically transforms the projected space) to straighten out that hoof.
On this basis, we seem to have one overarching scale from italic to bold. That’s not much of a breakthrough, and although there certainly is merit in a list of similar fonts, you don’t need a whizzy graphic for it. It would also be better done by humans, as some of the fonts are misplaced to my eye. But that’s CNNs for ya; I’d also like some exploration of what features are detected. In a blog post, Ideo’s project lead Kevin Ho explains the method. I don’t know to what extent the number of training images mattered, but that is something to think about if you are doing this sort of thing. Then there’s an image of “early results” through t-SNE that, to my mind, looks better than the final results, because more clusters emerge that way. It’s not clear how he then got to the final result, though it looks like maybe he just spared the t-SNE special sauce, or took the k-D (k>2) projection and then smacked it down further through PCA (ML people love PCA, they think it has magical powers). I don’t know. (You should check out this page on t-SNE, once you understand the principle, by those ninjas of interactivity Viegas & Wattenberg, plus Ian Johnson of Google Cloud).
All in all, you know, it’s fun, and it’s important to experiment (as my grandad said about tasting his own urine), but if you talk up the AI angle too much, people who know about it will start to doubt the quality of your work. That’s a pity but it can be guarded against by providing lots of details of your method and viewing it as an ongoing exploration, not a done deal. I say this as advice to young people, not criticism of Kevin Ho’s work because I just don’t know what he did.
I’ve occasionally asked myself odd superimpose-geographies questions like “how far is it from A to B if they were in Winchester?” (because I can feel those distances better) or “would the West Kennet Long Barrow fit inside the Broadgate Centre?” (I’m sure we’ve all thought that). Hans Hack has made an online map like that, with a serious purpose, which superimposes Aleppo and the destroyed parts onto London.
It’s all done in leaflet.js and weighs in at 800 lines of code with a lot of generous — luxurious one might say — spacing, so it is well with your grasp to do something like this. It’s also just pretty, with sparing colour and layering of information with simple controls. There is also a Berlin version. I suppose you have to know the host city for it to hit home but then it’s a powerful message about the scale of it all.
As an academic, I started a page on this blog site that documented each peer review I did for a journal. I never quite got round to going back in time from the start, but there isn’t much of interest there that you won’t get from the stuff I did capture. Now that I am hanging up my mortarboard, it doesn’t make sense to be a page any more so I am moving it here. Enjoy the schadenfreude if nothing else.
Statisticians are in short supply, so scientific journals find it hard to get one of us to review the papers that have been submitted to them. And yet the huge majority of these papers rely heavily on stats for their conclusions. As a reviewer, I see the same problems appearing over and over, but I know how hard it is for most scientists to find a friendly statistician to help them make it better. So, I present this log of all the papers I have reviewed, anonymised, giving the month of review, study design and broad outline of what was good or bad from a stats point of view. I hope this helps some authors improve the presentation of their work and avoid the most common problems.
I started this in November 2013, and am working backwards as well as recording new reviews, although the retrospective information might be patchy.
- November 2012, randomised controlled trial, recommended rejection. Sample size was based on an unrealistic Minimum Clinically Important Difference from prior research uncharacteristic of the primary outcome, and thus the study was unable to demonstrate benefit, and unethical because the primary outcome was about efficiency of the health system while benefit to patients had already been demonstrated, yet the intervention was withheld in the control group. Power to detect adverse events was even lower as a result, yet bold statements about safety were made. A flawed piece of work that put hospital patients at risk with no chance of ever demonstrating anything, this study should never have been approved in the first place. Of interest to scholars of evidence-based medicine, this study has now been printed by Elsevier in a lesser journal, unchanged from the version I reviewed. Such is life; I only hope the authors learnt something from the review to outweigh the reward they felt at finally getting it published.
- November 2013, cross-sectional survey, recommended rejection. Estimates were adjusted for covariates (not confounders) when it was not relevant to do so, grammar was poor and confusing in places, odds ratios were used when relative risks would be clearer, t-tests and chi-squareds were carried out and reported without any hypothesis being clearly stated or justified
- November 2013, exploratory / correlation study, recommended major revision then rejection when authors declined to revise the analysis. Ordinal data analysed as nominal, causing an error crossing p=0.05.
- March 2014, randomised controlled trial, recommended rejection. Estimates were adjusted for covariates when it was not relevant to do so, bold conclusions are made without justification.
- April 2014, mixed methods systematic review, recommended minor changes around clarity of writing and details of one calculation.
- May 2014, meta-analysis, recommended acceptance – conducted to current best practice, clearly written and on a useful topic.
- July 2014, ecological analysis, recommended major revision. Pretty ropy on several fronts, but perhaps most importantly that any variables the authors could find had been thrown into an “adjusted” analysis with clearly no concept of what that meant or was supposed to do. Wildly optimistic conclusions too. Came back for re-review in September 2014 with toned-down conclusions and clarity about what had been included as covariates but the same issue of throwing the kitchen sink in. More “major revisions”; and don’t even think about sending it voetstoots to a lesser journal because I’ll be watching for it! (As of September 2015, I find no sign of it online)
- July 2014, some other study I can’t find right now…
- September 2014, cohort study. Clear, appropriate, important. Just a couple of minor additions to the discussion requested.
- February 2015, secondary analysis of routine data, no clear question, no clear methods, no justification of adjustment, doesn’t contribute anything that we haven’t already known for 20 years and more. Reject.
- February 2015, revision of some previously rejected paper where the authors try to wriggle out of any work by refuting basic statistical facts. Straight to the 5th circle of hell.
- March 2015, statistical methods paper. Helpful, practical, clearly written. Only the very merest of amendments.
- April 2015, secondary analysis of public-domain data. Inappropriate analysis, leading to meaningless conclusions. Reject.
- April 2015, retrospective cohort study, can’t find the comments any more… but I think I recommended some level of revisions
- September 2015, survey of a specific health service in a hard-to-reach population. Appropriate to the question, novel and important. Some amendments to graphics and tables were suggested. Minor revisions.
- March 2016, case series developing a prognostic score. Nice analysis, written very well, and a really important topic. My only quibbles were about assuming linear effects. Accept subject to discretionary changes.
- October 2016, cohort study. Adjusted for stuff that probably isn’t confounding, and adjusting (Cox regression) for competing risks when they should be recognised as such. Various facts about the participants that are not declared. Major revisions.
- October 2016 diagnostic study meta-analysis. Well done, clearly explained. A few things could be spelled out more. Minor revisions.
- November 2016, kind of a diagnostic study…, well-done, well-written, but very limited in scope and hard to tell what the implications for practice might be. Left in the lap of the
- December 2016, observational study of risk factors, using binary outcomes but would be more powerful with time-to-event if possible. Competing risks would have to be used in that case. Otherwise, nice.