Last week I went on a training course run by NVIDIA Deep Learning Institute to learn TensorFlow. Here’s my reflections on this. (I’ve gone easy on the hyperlinks, mostly because I’m short of time but also because, you know, there’s Google.)
Firstly, to set the scene very briefly, deep learning means neural networks — highly complex non-linear predictive models — with plenty of “hidden layers” that makes them equivalent to regressions with millions or even billions of parameters. This recent article is a nice starting point.
Only recently have we been able to fit such things, thanks to software (of which TensorFlow is the current people’s favourite) and hardware (particularly GPUs; the course was run by manufacturer NVIDIA). Deep learning is the stuff that looks at pictures and tells you whether it’s a cat or a dog. It also does things like understanding your handwriting or making some up from text, ordering stuff from Amazon at your voice command, telling your self-driving car whether that’s a kid or a plastic bag in the road ahead, classifying images of eye diseases, etc etc. You have to train it on plenty of data, which is computationally intensive, and you can do that in batches (so it is readily parallelisable, hence the GPUs), but then you can just get on and run the new predictions quite quickly, on your mobile phone for example. TensorFlow was made by Google then released as open-source software last year, and since then hundreds of people have contributed tweaks to it. It’s recently gone to version 1.0.
If you’re thinking “but I’m a statistician and I should know about this – why did nobody tell me?”, then you’re right, they sneaked it past you, those damned computer scientists. But you can pick up EoSL (Hastie, Tibshirani, Friedman) or CASI (Efron & Hastie) and get going from there. If you’re thinking “this is not a statistical model, it’s just heuristic data mining”, you’re not entirely correct. There is a loss function and you can make that the likelihood. You can include priors and regularization. But you don’t typically get more than just the point estimates, and the big concern is that you don’t know you’ve reached a global optimum. “Why not just bootstrap it?” Well, partly because of the local optima problem, partly because there is a sort of flipping of equivalent sets of weights (which you will recognise if you’ve ever bootstrapped a principal components analysis), but also because if your big model, with the big data, takes 3 hours to fit even on AWS with a whole stack of power GPUs, then you don’t want to do it 1000 times.
It’s often hard to know whether your model is any good, beyond the headline of training and test dataset accuracy (the real question is not the average performance but where the problems are and whether they can be fixed). This is like revisiting the venerable (and boring) field of model diagnostic graphics. TensorFlow Playground on the other hand is an exemplary methodviz and there is also TensorBoard which shows you how the model is doing on headline stats. But with convolutional neural networks, you can do some natural visualisation. Consider the well-trodden MNIST dataset for optical character recognition:
On the course we did some convolutional neural networks for this, and because it is a bunch of images, you can literally look at things like where the filters get activated visually. Here’s 36 filters that the network learned in the first hidden layer
and how they get activated at different places in one particular number zero:
And here we’re at the third hidden layer, where some overfitting appears – the filters get set off by the edge of the digit and also inside it, so there’s a shadowing effect. It thinks there are multiple zeros in there. It’s evident that a different approach is needed to get better results. Simply piling in more layers will not help.
I’m showing you this because it’s a rare example of where visualisation helps you refine the model and also, crucially, understand how it works a little bit better.
Other data forms are not so easy. If you have masses of continuous independent variables, you can plot them against some smoother of the fitted values, or plot residuals against the predictor, etc – old skool but effective. Masses of categorical independent variables is not so easy (it never was), and if you want to feed in autocorrelated but non-visual data, like sound waves, you will have to take a lot on faith. It would be great to see more work on diagnostic visualisation in this field.
Another point to bear in mind is that it’s early days. As Aditya Singh wrote in that HBR article above, “If I analogize [sic] it to the personal computer, deep learning is in the green-and-black-DOS-screen stage of its evolution”, which is exactly correct. To run it, you type some stuff in a Jupyter notebook if you’re lucky, or otherwise in a terminal screen. We don’t yet have super-easy off-the-peg models in a gentle GUI, and they will matter not just for dabblers but for future master modellers learning the ropes – consider the case of WinBUGS and how it trained a generation of Bayesian statisticians.
You need cloud GPUs. I was intrigued by GPU computing and CUDA (NVIDIA’s language extending C++ to compile for their own GPU chips) a couple of years ago and bought some kit to play with at home. All that is obsolete now, and you would run your deep learning code in the cloud. One really nice thing about the course was that NVIDIA provided access to their slice of AWS servers and we could play around in that and get some experience of it. It doesn’t have to be expensive; you can bid for unused GPU time. And by the way, if you want to buy a bangin’ desktop computer, let me know. One careful owner.
You need to think about — and try — lots of optimisation algorithms and other tweaks. Don’t believe people who tell you it is more art than science, that’s BS not DS. You could say the same thing about building multivariable regressions (and it would also be wrong). It’s the equivalent of doctors writing everything in Latin to keep the lucrative trade in-house. Never teach the Wu-Tang style!
It’s hard to teach yourself; I’ve found no single great tutorial code out there. Get on a course with some tuition, either face-to-face or blended.
Recurrent neural networks, which you can use for time series data, are really hard to get your head around. The various tricks they employ, called things like GRUs and LSTMs, may cause you to give up. But you must persist.
You need a lot of data for deep learning, and it has to be reliably labelled with the dependent variable(s), which is expensive and potentially very time-consuming. If you are fitting millions of weights (parameters), this should come as no surprise. Those convnet filters and their results above are trained on 1000 digits, so only 100 examples of each on average. When you pump it up to all 10,000, you get much clearer distinctions between the level-3 filters that respond to this zero and those that don’t.
The overlap between Bayes and neural networks is not clear (but see Neal & Zhang’s famous NIPS-winning model). On the other hand, there are some more theoretical aspects which make the CS guys sweat that statisticians will find straightforward, like regularisation, dropout as bagging, convergence metrics, or likelihood as loss function.
Statisticians should get involved with this. You are right to be sceptical, but not to walk away from it. Here’s some salient words from Diego Kuonen: