Tag Archives: machine learning

Discomfiting jumps

I have been writing a book review of Efron & Hastie’s CASI for Significance magazine. Here’s a tangential half page I wrote but didn’t include.

Students of statistics or data science will typically encounter some discomfiting jumps in attitude as their course progresses. first, they may have a lot of probability theory and some likelihood-based inference for rather contrived problems, which will remind them of their maths classes at school. Ah, they think, I know how to do this. I learn the tricks to manipulate the symbols and get to the QED. Then, they find themselves suddenly in a course that provides tables of data and asks them to analyse and interpret. Suddenly it’s become a practical course that connects to the real world and leaves the maths behind for the most part. Now, there’s no QED given, and no tricks. The assessments suddenly are more like humanities subjects, there’s no right or wrong and it’s the coherence of their argument that matters. Now they have to remember which options to tick in their preferred stats software. They might think: why did we do the mathematical parts of this course at all if we’re not going to use them? Next, for some, come machine learning methods. Now, the inference and asymptotic assurances are not just hidden in the cogs of the computer but are actually absent. How do I know the random forest isn’t giving me the wrong answer? You don’t. It seems at first that when the problem gets really hard, like 21st-century-hard, land-a-job-at-Google-hard, we give up on stats as an interesting mental exercise from the 1930s in favour of “unreasonably effective” heuristics and greedy algorithms.

One really nice thing they do in CASI is to emphasise that all estimation, from standard deviations of samples to GAMs, are algorithms. The inference (I prefer to say “uncertainty”) for those algorithms follows later in the history of the subject. The 1930s methods had enough time to work out inference by now, but other methods are still developing their inferential procedures. This unifies things rather better, but most teaching has to catch up. One problem is that almost all the effort of reformers following George Cobb, Joan Garfield and others has been on the very early introduction to the subject. That’s probably the right place to fix first, but we need to broaden out and fix wider data science courses now.

Leave a comment

Filed under learning

Dataviz of the week, 31/5/17

A more techy one this week. Ruth Fong and Andrea Vedaldi have a paper on ArXiv called “Interpretable explanations of black boxes by meaningful perturbation”. The argument that some modern machine learning (let’s not start that one again) techniques are black boxes which produce an output but nobody can understand how and why is a serious concern. If you don’t know how it works, how do you know you can believe it, or apply it outside the bounds of your previous data (in the manner of the disastrous Challenger space shuttle launch)?

HT @poolio for tweeting this, otherwise I’d never have heard about it.

The paper is heavy on the maths but thanks to the visual nature of convolutional neural networks (CNNs), which are high-dimensional non-linear statistical models to classify images, you can absorb the message very easily. Take the image, put it through the CNN, get a classification. Here, from the paper’s Figure 1, we see this image classified as containing a flute with probability 0.9973

Screen Shot 2017-05-31 at 11.39.58

Then, they randomly perturb an area of the image and run it again, checking how it has affected the prediction probability. When they find an area that strongly adversely affects the CNN, they conclude that it is here that the CNN is “looking”. Here’s a perturbed image:

Screen Shot 2017-05-31 at 11.40.13

You can see it’s the flute that has been blurred. They then show the impact of different regions in this “learned mask” heatmap:

Screen Shot 2017-05-31 at 11.40.19

(I’m glossing over the computational details quite a lot here because this post is about dataviz.) It rather reminds me of the old days when I was an undergrad and had to calculate a gazillion different types of residuals and influence statistics, many of which were rather heuristic. You could do this kind of thing with all kinds of black boxes (as  Fong & Vedaldi suggest by proposing a general theory of “explanations”), as long as there are some dimensions that are structural (x and y position in the case of image data) and others that can get perturbed (RGB values in this case). I think it would be valuable in random forests and boosted trees.

They also have a cup of coffee where the mask makes sense when artifacts are added (the kind of artifact that is know to mess with CNNs yet not human brains) and a maypole dance that doesn’t so much (and this seems to be powered by the same CNN tendency to spot ovals). This is potentially very informative for refining the CNN structure.

Screen Shot 2017-05-31 at 11.56.38

If you are interested in communicating the robustness of CNNs effectively, you should read this too.



Leave a comment

Filed under computing, machine learning, Visualization

Two great skills to leverage best-in-class big data science analytics

This came up on Twitter and lots of people were outraged, as you see in the replies and retweets.

Let’s unpack a couple of things.

  • appreciate – it’s not clear what he means by this. It could mean “Many software engineers will never be really good at data science using modern machine learning”, which seems like tautology (same goes for estate agents), but see software engineers below. It could mean “Many software engineers will never truly have an intuitive attraction to the elegant mathematical underpinnings of modern machine learning”, and in that case it is true that there is a connection between maths and, er, maths, but that’s not very interesting. Appreciating in this sense is an ivory tower luxury.
  • love – lord above, are you trying to fool me in love? I think high-pressure rote learning in the Asian mould would do the trick too. It seems irrelevant.


    Victorian Dad (c) Viz

  • as a teen – this is what most people hated about it, the gatekeeping and stereotype-enforcement. It’s clearly bollocks, so let’s not waste time on Someone Said Something Wrong On The Internet. If you want to learn now, here’s my reading page.
  • software engineers – if he really is talking about software engineers (isn’t that term, like, a bit 1990s?), then it sounds fair enough despite the inaccuracies and tautologies. Why would they want to or need to have anything to do with modern ML? I’m a statistician, but do enough programming to grasp what it is like to be a day-in, day-out coder. You just grab something that someone wrote — a random forests library perhaps — and plug it in. Why would you appreciate its theory? That’s a waste of time. You don’t go round appreciating the hell out of fibre broadband cables.
  • modern machine learning – I don’t know what is meant by this, but it’s interesting to me that there are some things in ML and stats like logistic regression, which have strong, mathematical underpinnings, which is to say that their asymptotics are understood, and other things in ML and not stats, like deep learning with backprop, which are kind of greedy, heuristic and do not have guaranteed or even understood asymptotics. Depending on what he means by this phrase, there might be nothing to appreciate. If there is something to appreciate, then it might not be that modern — logistic regression was pretty much finished theoretically in the 70s, PCA in the 30s.
  • math – this is the really interesting thing. Do you need maths to do data science well? It certainly helps with reading those tortuous theory papers (but they’re not that useful compared to messing about with software). It is not as useful as programming (hi, software engineers!) skills. The reason a lot of people get caught out is because they have done some analysis that ran, produced no error messages, but led to the wrong answer, and they had no mental tools to spot it. Maths will not give you that tool; you need to think about data and have messed around getting your hands dirty. I studied maths and enjoyed it and did pretty well, if I say so myself, but that has been of very little use to me. I’ve forgotten most of it.

    A page of my A-level maths revision notes. I have never had to do partial fractions. Ever.

    If you really do intend to be a methodological stats prof, then you’d better get good with the old x’s and y’s, but otherwise, install R and play.

Perhaps the one really useful skill I acquired is imagining data as points in space, rotating, distorting, projecting. I had to do a lot of that when doing a Masters dissertation project with PCA, MCA, etc. That has genuinely helped me to develop ideas and think about where things are going wrong.

The other important thing to think about is metrics – different ways of quantifying the distance from this data point to that one, because that underpins a lot of stuff that follows, whether stats or ML (notably loss / log-likelihood functions). And I have another blog post on this very topic coming up.

Leave a comment

Filed under learning

Dataviz of the week, 10/5/17

Font Map is an interactive website by designers Ideo which aims to represent typefaces in 2 dimensions so you can eyeball similar ones. They make a big deal out of “leveraging AI and convolutional neural networks to draw higher-vision pattern recognition”. I’m not sure what that sentence means, though I conclude they got a thrill out of it. (I refer to the opaque boardroom talk; I know perfectly well what these techniques are.) What we see on the screen is a classic horseshoe shape of dimension reduction that happens when you have an underlying continuum that mostly lies along one axis. You see this with principal components analysis, multiple correspondence analysis, multidimensional scaling, whatever. t-SNE screws around with it (read: anisotropically transforms the projected space) to straighten out that hoof.

Screen Shot 2017-05-09 at 13.45.14

On this basis, we seem to have one overarching scale from italic to bold. That’s not much of a breakthrough, and although there certainly is merit in a list of similar fonts, you don’t need a whizzy graphic for it. It would also be better done by humans, as some of the fonts are misplaced to my eye. But that’s CNNs for ya; I’d also like some exploration of what features are detected. In a blog post, Ideo’s project lead Kevin Ho explains the method. I don’t know to what extent the number of training images mattered, but that is something to think about if you are doing this sort of thing. Then there’s an image of “early results” through t-SNE that, to my mind, looks better than the final results, because more clusters emerge that way. It’s not clear how he then got to the final result, though it looks like maybe he just spared the t-SNE special sauce, or took the k-D (k>2) projection and then smacked it down further through PCA (ML people love PCA, they think it has magical powers). I don’t know. (You should check out this page on t-SNE, once you understand the principle, by those ninjas of interactivity Viegas & Wattenberg, plus Ian Johnson of Google Cloud).

All in all, you know, it’s fun, and it’s important to experiment (as my grandad said about tasting his own urine), but if you talk up the AI angle too much, people who know about it will start to doubt the quality of your work. That’s a pity but it can be guarded against by providing lots of details of your method and viewing it as an ongoing exploration, not a done deal. I say this as advice to young people, not criticism of Kevin Ho’s work because I just don’t know what he did.

Leave a comment

Filed under machine learning, Visualization

I’m going freelance

At the end of April 2017, I will leave my university job and start freelancing. I will be offering training and analysis, focusing on three areas:

  • Health research & quality indicators: this has been the main applied field for my work with data over the last nineteen years, including academic research, audit, service evaluation and clinical guidelines
  • Data visualisation: interest in this has exploded in recent years, and although there are many providers coming from a design or front-end development background, there are not many statisticians to back up interactive viz with solid analysis
  • Bayesian modeling: predictive models and machine learning techniques are big business, but in many cases more is needed to achieve their potential and avoid a bursting Data Science bubble, and this is where Bayes helps to capture expert knowledge, acknowledge uncertainty and give intuitive outputs for truly data-driven decisions

Considering the many “Data Science Venn Diagrams”, you’ll see that I’m aiming squarely at the overlaps from stats to domain knowledge, communication and computing. That’s because there’s a gap in the market in each of these places. I’m a statistician by training and always will be, but having read the rule book and found it eighty years out of date, I’m have no qualms in rewriting it for 21st century problems. If that sounds useful to you, get in touch at robert@robertgrantstats.co.uk

This blog will continue but maybe less frequently, although I’ll still be posting a dataviz of the week. I’ll still be developing StataStan and in particular writing some ‘statastanarm’ commands to fit specific models. I’ll still be tinkering with fun analyses and dataviz like the London Café Laptop Map or Birdfeeders Live, and you’re actually more likely to see me around at conferences. I’ll keep you posted of such movements here.

1 Comment

Filed under Uncategorized

Stats and data science, easy jobs and easy mistakes

I have been writing some JavaScript, and I was thinking about how web dev / front-end people are obliged to use the very latest tools, not so much for utility as for kudos. This seems mysterious to me but then I realised: it’s because the basic job — make a website — is so easy. The only way to tell who’s really seriously in the game is by how up to date they are. Then, this is the parallel that occurred to me: statistics is hard to get right, and a beginner is found out over and over again on the simplest tasks. On the other hand, if you do a lot of big data or machine learning or both, then you might screw stuff up left, right, and centre, but you are less likely to get caught. Because…

  • nobody has the time and energy to re-run your humungous analysis
  • it’s a black box anyway*
  • you got headhunted by Uber last week

And maybe that’s one reason why there is more emphasis on having the latest shizzle in a data science job that’s more of a mixture of stats and computer science influences. I’m not taking a view that old ways are the best here, because I’m equally baffled by statisticians who refuse to learn anything new, but the lack of transparency and accountability (oh what British words!) is concerning.

* – this is not actually true, but it is the prevailing attitude

Leave a comment

Filed under Uncategorized

A statistician’s journey into deep learning

Last week I went on a training course run by NVIDIA Deep Learning Institute to learn TensorFlow. Here’s my reflections on this. (I’ve gone easy on the hyperlinks, mostly because I’m short of time but also because, you know, there’s Google.)

Firstly, to set the scene very briefly, deep learning means neural networks — highly complex non-linear predictive models — with plenty of “hidden layers” that makes them equivalent to regressions with millions or even billions of parameters. This recent article is a nice starting point.

Only recently have we been able to fit such things, thanks to software (of which TensorFlow is the current people’s favourite) and hardware (particularly GPUs; the course was run by manufacturer NVIDIA). Deep learning is the stuff that looks at pictures and tells you whether it’s a cat or a dog. It also does things like understanding your handwriting or making some up from text, ordering stuff from Amazon at your voice command, telling your self-driving car whether that’s a kid or a plastic bag in the road ahead, classifying images of eye diseases, etc etc. You have to train it on plenty of data, which is computationally intensive, and you can do that in batches (so it is readily parallelisable, hence the GPUs), but then you can just get on and run the new predictions quite quickly, on your mobile phone for example. TensorFlow was made by Google then released as open-source software last year, and since then hundreds of people have contributed tweaks to it. It’s recently gone to version 1.0.

If you’re thinking “but I’m a statistician and I should know about this – why did nobody tell me?”, then you’re right, they sneaked it past you, those damned computer scientists. But you can pick up EoSL (Hastie, Tibshirani, Friedman) or CASI (Efron & Hastie) and get going from there. If you’re thinking “this is not a statistical model, it’s just heuristic data mining”, you’re not entirely correct. There is a loss function and you can make that the likelihood. You can include priors and regularization. But you don’t typically get more than just the point estimates, and the big concern is that you don’t know you’ve reached a global optimum. “Why not just bootstrap it?” Well, partly because of the local optima problem, partly because there is a sort of flipping of equivalent sets of weights (which you will recognise if you’ve ever bootstrapped a principal components analysis), but also because if your big model, with the big data, takes 3 hours to fit even on AWS with a whole stack of power GPUs, then you don’t want to do it 1000 times.

It’s often hard to know whether your model is any good, beyond the headline of training and test dataset accuracy (the real question is not the average performance but where the problems are and whether they can be fixed). This is like revisiting the venerable (and boring) field of model diagnostic graphics. TensorFlow Playground on the other hand is an exemplary methodviz and there is also TensorBoard which shows you how the model is doing on headline stats. But with convolutional neural networks, you can do some natural visualisation. Consider the well-trodden MNIST dataset for optical character recognition:


On the course we did some convolutional neural networks for this, and because it is a bunch of images, you can literally look at things like where the filters get activated visually. Here’s 36 filters that the network learned in the first hidden layer
and how they get activated at different places in one particular number zero:
And here we’re at the third hidden layer, where some overfitting appears – the filters get set off by the edge of the digit and also inside it, so there’s a shadowing effect. It thinks there are multiple zeros in there. It’s evident that a different approach is needed to get better results. Simply piling in more layers will not help.

I’m showing you this because it’s a rare example of where visualisation helps you refine the model and also, crucially, understand how it works a little bit better.

Other data forms are not so easy. If you have masses of continuous independent variables, you can plot them against some smoother of the fitted values, or plot residuals against the predictor, etc – old skool but effective. Masses of categorical independent variables is not so easy (it never was), and if you want to feed in autocorrelated but non-visual data, like sound waves, you will have to take a lot on faith. It would be great to see more work on diagnostic visualisation in this field.

Another point to bear in mind is that it’s early days. As Aditya Singh wrote in that HBR article above, “If I analogize [sic] it to the personal computer, deep learning is in the green-and-black-DOS-screen stage of its evolution”, which is exactly correct. To run it, you type some stuff in a Jupyter notebook if you’re lucky, or otherwise in a terminal screen. We don’t yet have super-easy off-the-peg models in a gentle GUI, and they will matter not just for dabblers but for future master modellers learning the ropes – consider the case of WinBUGS and how it trained a generation of Bayesian statisticians.

You need cloud GPUs. I was intrigued by GPU computing and CUDA (NVIDIA’s language extending C++ to compile for their own GPU chips) a couple of years ago and bought some kit to play with at home. All that is obsolete now, and you would run your deep learning code in the cloud. One really nice thing about the course was that NVIDIA provided access to their slice of AWS servers and we could play around in that and get some experience of it. It doesn’t have to be expensive; you can bid for unused GPU time. And by the way, if you want to buy a bangin’ desktop computer, let me know. One careful owner.

You need to think about — and try — lots of optimisation algorithms and other tweaks. Don’t believe people who tell you it is more art than science, that’s BS not DS. You could say the same thing about building multivariable regressions (and it would also be wrong). It’s the equivalent of doctors writing everything in Latin to keep the lucrative trade in-house. Never teach the Wu-Tang style!

It’s hard to teach yourself; I’ve found no single great tutorial code out there. Get on a course with some tuition, either face-to-face or blended.

Recurrent neural networks, which you can use for time series data, are really hard to get your head around. The various tricks they employ, called things like GRUs and LSTMs, may cause you to give up. But you must persist.

You need a lot of data for deep learning, and it has to be reliably labelled with the dependent variable(s), which is expensive and potentially very time-consuming. If you are fitting millions of weights (parameters), this should come as no surprise. Those convnet filters and their results above are trained on 1000 digits, so only 100 examples of each on average. When you pump it up to all 10,000, you get much clearer distinctions between the level-3 filters that respond to this zero and those that don’t.

The overlap between Bayes and neural networks is not clear (but see Neal & Zhang’s famous NIPS-winning model). On the other hand, there are some more theoretical aspects which make the CS guys sweat that statisticians will find straightforward, like regularisation, dropout as bagging, convergence metrics, or likelihood as loss function.

Statisticians should get involved with this. You are right to be sceptical, but not to walk away from it. Here’s some salient words from Diego Kuonen:

1 Comment

Filed under computing, learning, machine learning