Tag Archives: machine learning

Two great skills to leverage best-in-class big data science analytics

This came up on Twitter and lots of people were outraged, as you see in the replies and retweets.

Let’s unpack a couple of things.

  • appreciate – it’s not clear what he means by this. It could mean “Many software engineers will never be really good at data science using modern machine learning”, which seems like tautology (same goes for estate agents), but see software engineers below. It could mean “Many software engineers will never truly have an intuitive attraction to the elegant mathematical underpinnings of modern machine learning”, and in that case it is true that there is a connection between maths and, er, maths, but that’s not very interesting. Appreciating in this sense is an ivory tower luxury.
  • love – lord above, are you trying to fool me in love? I think high-pressure rote learning in the Asian mould would do the trick too. It seems irrelevant.

    victorian-dad

    Victorian Dad (c) Viz

  • as a teen – this is what most people hated about it, the gatekeeping and stereotype-enforcement. It’s clearly bollocks, so let’s not waste time on Someone Said Something Wrong On The Internet. If you want to learn now, here’s my reading page.
  • software engineers – if he really is talking about software engineers (isn’t that term, like, a bit 1990s?), then it sounds fair enough despite the inaccuracies and tautologies. Why would they want to or need to have anything to do with modern ML? I’m a statistician, but do enough programming to grasp what it is like to be a day-in, day-out coder. You just grab something that someone wrote — a random forests library perhaps — and plug it in. Why would you appreciate its theory? That’s a waste of time. You don’t go round appreciating the hell out of fibre broadband cables.
  • modern machine learning – I don’t know what is meant by this, but it’s interesting to me that there are some things in ML and stats like logistic regression, which have strong, mathematical underpinnings, which is to say that their asymptotics are understood, and other things in ML and not stats, like deep learning with backprop, which are kind of greedy, heuristic and do not have guaranteed or even understood asymptotics. Depending on what he means by this phrase, there might be nothing to appreciate. If there is something to appreciate, then it might not be that modern — logistic regression was pretty much finished theoretically in the 70s, PCA in the 30s.
  • math – this is the really interesting thing. Do you need maths to do data science well? It certainly helps with reading those tortuous theory papers (but they’re not that useful compared to messing about with software). It is not as useful as programming (hi, software engineers!) skills. The reason a lot of people get caught out is because they have done some analysis that ran, produced no error messages, but led to the wrong answer, and they had no mental tools to spot it. Maths will not give you that tool; you need to think about data and have messed around getting your hands dirty. I studied maths and enjoyed it and did pretty well, if I say so myself, but that has been of very little use to me. I’ve forgotten most of it.
    partial_fractions

    A page of my A-level maths revision notes. I have never had to do partial fractions. Ever.

    If you really do intend to be a methodological stats prof, then you’d better get good with the old x’s and y’s, but otherwise, install R and play.

Perhaps the one really useful skill I acquired is imagining data as points in space, rotating, distorting, projecting. I had to do a lot of that when doing a Masters dissertation project with PCA, MCA, etc. That has genuinely helped me to develop ideas and think about where things are going wrong.

The other important thing to think about is metrics – different ways of quantifying the distance from this data point to that one, because that underpins a lot of stuff that follows, whether stats or ML (notably loss / log-likelihood functions). And I have another blog post on this very topic coming up.

Leave a comment

Filed under learning

Dataviz of the week, 10/5/17

Font Map is an interactive website by designers Ideo which aims to represent typefaces in 2 dimensions so you can eyeball similar ones. They make a big deal out of “leveraging AI and convolutional neural networks to draw higher-vision pattern recognition”. I’m not sure what that sentence means, though I conclude they got a thrill out of it. (I refer to the opaque boardroom talk; I know perfectly well what these techniques are.) What we see on the screen is a classic horseshoe shape of dimension reduction that happens when you have an underlying continuum that mostly lies along one axis. You see this with principal components analysis, multiple correspondence analysis, multidimensional scaling, whatever. t-SNE screws around with it (read: anisotropically transforms the projected space) to straighten out that hoof.

Screen Shot 2017-05-09 at 13.45.14

On this basis, we seem to have one overarching scale from italic to bold. That’s not much of a breakthrough, and although there certainly is merit in a list of similar fonts, you don’t need a whizzy graphic for it. It would also be better done by humans, as some of the fonts are misplaced to my eye. But that’s CNNs for ya; I’d also like some exploration of what features are detected. In a blog post, Ideo’s project lead Kevin Ho explains the method. I don’t know to what extent the number of training images mattered, but that is something to think about if you are doing this sort of thing. Then there’s an image of “early results” through t-SNE that, to my mind, looks better than the final results, because more clusters emerge that way. It’s not clear how he then got to the final result, though it looks like maybe he just spared the t-SNE special sauce, or took the k-D (k>2) projection and then smacked it down further through PCA (ML people love PCA, they think it has magical powers). I don’t know. (You should check out this page on t-SNE, once you understand the principle, by those ninjas of interactivity Viegas & Wattenberg, plus Ian Johnson of Google Cloud).

All in all, you know, it’s fun, and it’s important to experiment (as my grandad said about tasting his own urine), but if you talk up the AI angle too much, people who know about it will start to doubt the quality of your work. That’s a pity but it can be guarded against by providing lots of details of your method and viewing it as an ongoing exploration, not a done deal. I say this as advice to young people, not criticism of Kevin Ho’s work because I just don’t know what he did.

Leave a comment

Filed under machine learning, Visualization

I’m going freelance

At the end of April 2017, I will leave my university job and start freelancing. I will be offering training and analysis, focusing on three areas:

  • Health research & quality indicators: this has been the main applied field for my work with data over the last nineteen years, including academic research, audit, service evaluation and clinical guidelines
  • Data visualisation: interest in this has exploded in recent years, and although there are many providers coming from a design or front-end development background, there are not many statisticians to back up interactive viz with solid analysis
  • Bayesian modeling: predictive models and machine learning techniques are big business, but in many cases more is needed to achieve their potential and avoid a bursting Data Science bubble, and this is where Bayes helps to capture expert knowledge, acknowledge uncertainty and give intuitive outputs for truly data-driven decisions

Considering the many “Data Science Venn Diagrams”, you’ll see that I’m aiming squarely at the overlaps from stats to domain knowledge, communication and computing. That’s because there’s a gap in the market in each of these places. I’m a statistician by training and always will be, but having read the rule book and found it eighty years out of date, I’m have no qualms in rewriting it for 21st century problems. If that sounds useful to you, get in touch at robert@robertgrantstats.co.uk

This blog will continue but maybe less frequently, although I’ll still be posting a dataviz of the week. I’ll still be developing StataStan and in particular writing some ‘statastanarm’ commands to fit specific models. I’ll still be tinkering with fun analyses and dataviz like the London Café Laptop Map or Birdfeeders Live, and you’re actually more likely to see me around at conferences. I’ll keep you posted of such movements here.

1 Comment

Filed under Uncategorized

Stats and data science, easy jobs and easy mistakes

I have been writing some JavaScript, and I was thinking about how web dev / front-end people are obliged to use the very latest tools, not so much for utility as for kudos. This seems mysterious to me but then I realised: it’s because the basic job — make a website — is so easy. The only way to tell who’s really seriously in the game is by how up to date they are. Then, this is the parallel that occurred to me: statistics is hard to get right, and a beginner is found out over and over again on the simplest tasks. On the other hand, if you do a lot of big data or machine learning or both, then you might screw stuff up left, right, and centre, but you are less likely to get caught. Because…

  • nobody has the time and energy to re-run your humungous analysis
  • it’s a black box anyway*
  • you got headhunted by Uber last week

And maybe that’s one reason why there is more emphasis on having the latest shizzle in a data science job that’s more of a mixture of stats and computer science influences. I’m not taking a view that old ways are the best here, because I’m equally baffled by statisticians who refuse to learn anything new, but the lack of transparency and accountability (oh what British words!) is concerning.

* – this is not actually true, but it is the prevailing attitude

Leave a comment

Filed under Uncategorized

A statistician’s journey into deep learning

Last week I went on a training course run by NVIDIA Deep Learning Institute to learn TensorFlow. Here’s my reflections on this. (I’ve gone easy on the hyperlinks, mostly because I’m short of time but also because, you know, there’s Google.)

Firstly, to set the scene very briefly, deep learning means neural networks — highly complex non-linear predictive models — with plenty of “hidden layers” that makes them equivalent to regressions with millions or even billions of parameters. This recent article is a nice starting point.

Only recently have we been able to fit such things, thanks to software (of which TensorFlow is the current people’s favourite) and hardware (particularly GPUs; the course was run by manufacturer NVIDIA). Deep learning is the stuff that looks at pictures and tells you whether it’s a cat or a dog. It also does things like understanding your handwriting or making some up from text, ordering stuff from Amazon at your voice command, telling your self-driving car whether that’s a kid or a plastic bag in the road ahead, classifying images of eye diseases, etc etc. You have to train it on plenty of data, which is computationally intensive, and you can do that in batches (so it is readily parallelisable, hence the GPUs), but then you can just get on and run the new predictions quite quickly, on your mobile phone for example. TensorFlow was made by Google then released as open-source software last year, and since then hundreds of people have contributed tweaks to it. It’s recently gone to version 1.0.

If you’re thinking “but I’m a statistician and I should know about this – why did nobody tell me?”, then you’re right, they sneaked it past you, those damned computer scientists. But you can pick up EoSL (Hastie, Tibshirani, Friedman) or CASI (Efron & Hastie) and get going from there. If you’re thinking “this is not a statistical model, it’s just heuristic data mining”, you’re not entirely correct. There is a loss function and you can make that the likelihood. You can include priors and regularization. But you don’t typically get more than just the point estimates, and the big concern is that you don’t know you’ve reached a global optimum. “Why not just bootstrap it?” Well, partly because of the local optima problem, partly because there is a sort of flipping of equivalent sets of weights (which you will recognise if you’ve ever bootstrapped a principal components analysis), but also because if your big model, with the big data, takes 3 hours to fit even on AWS with a whole stack of power GPUs, then you don’t want to do it 1000 times.

It’s often hard to know whether your model is any good, beyond the headline of training and test dataset accuracy (the real question is not the average performance but where the problems are and whether they can be fixed). This is like revisiting the venerable (and boring) field of model diagnostic graphics. TensorFlow Playground on the other hand is an exemplary methodviz and there is also TensorBoard which shows you how the model is doing on headline stats. But with convolutional neural networks, you can do some natural visualisation. Consider the well-trodden MNIST dataset for optical character recognition:

screen-shot-2017-02-01-at-19-04-14

On the course we did some convolutional neural networks for this, and because it is a bunch of images, you can literally look at things like where the filters get activated visually. Here’s 36 filters that the network learned in the first hidden layer
screen-shot-2017-02-01-at-19-05-07
and how they get activated at different places in one particular number zero:
screen-shot-2017-02-01-at-19-05-23
And here we’re at the third hidden layer, where some overfitting appears – the filters get set off by the edge of the digit and also inside it, so there’s a shadowing effect. It thinks there are multiple zeros in there. It’s evident that a different approach is needed to get better results. Simply piling in more layers will not help.
screen-shot-2017-02-01-at-19-05-46

I’m showing you this because it’s a rare example of where visualisation helps you refine the model and also, crucially, understand how it works a little bit better.

Other data forms are not so easy. If you have masses of continuous independent variables, you can plot them against some smoother of the fitted values, or plot residuals against the predictor, etc – old skool but effective. Masses of categorical independent variables is not so easy (it never was), and if you want to feed in autocorrelated but non-visual data, like sound waves, you will have to take a lot on faith. It would be great to see more work on diagnostic visualisation in this field.

Another point to bear in mind is that it’s early days. As Aditya Singh wrote in that HBR article above, “If I analogize [sic] it to the personal computer, deep learning is in the green-and-black-DOS-screen stage of its evolution”, which is exactly correct. To run it, you type some stuff in a Jupyter notebook if you’re lucky, or otherwise in a terminal screen. We don’t yet have super-easy off-the-peg models in a gentle GUI, and they will matter not just for dabblers but for future master modellers learning the ropes – consider the case of WinBUGS and how it trained a generation of Bayesian statisticians.

You need cloud GPUs. I was intrigued by GPU computing and CUDA (NVIDIA’s language extending C++ to compile for their own GPU chips) a couple of years ago and bought some kit to play with at home. All that is obsolete now, and you would run your deep learning code in the cloud. One really nice thing about the course was that NVIDIA provided access to their slice of AWS servers and we could play around in that and get some experience of it. It doesn’t have to be expensive; you can bid for unused GPU time. And by the way, if you want to buy a bangin’ desktop computer, let me know. One careful owner.

You need to think about — and try — lots of optimisation algorithms and other tweaks. Don’t believe people who tell you it is more art than science, that’s BS not DS. You could say the same thing about building multivariable regressions (and it would also be wrong). It’s the equivalent of doctors writing everything in Latin to keep the lucrative trade in-house. Never teach the Wu-Tang style!

It’s hard to teach yourself; I’ve found no single great tutorial code out there. Get on a course with some tuition, either face-to-face or blended.

Recurrent neural networks, which you can use for time series data, are really hard to get your head around. The various tricks they employ, called things like GRUs and LSTMs, may cause you to give up. But you must persist.

You need a lot of data for deep learning, and it has to be reliably labelled with the dependent variable(s), which is expensive and potentially very time-consuming. If you are fitting millions of weights (parameters), this should come as no surprise. Those convnet filters and their results above are trained on 1000 digits, so only 100 examples of each on average. When you pump it up to all 10,000, you get much clearer distinctions between the level-3 filters that respond to this zero and those that don’t.

The overlap between Bayes and neural networks is not clear (but see Neal & Zhang’s famous NIPS-winning model). On the other hand, there are some more theoretical aspects which make the CS guys sweat that statisticians will find straightforward, like regularisation, dropout as bagging, convergence metrics, or likelihood as loss function.

Statisticians should get involved with this. You are right to be sceptical, but not to walk away from it. Here’s some salient words from Diego Kuonen:
diego-kuonen-quote

1 Comment

Filed under computing, learning, machine learning

Best dataviz of 2016

I’m going to return to 2014’s approach of dividing best visualisation of data (dataviz!) from visualisation of methods (methodviz!).

In the first category, as soon as I saw Jill Pelto’s watercolour data paintings I was bowled over. Time series of environmental data are superimposed and form familiar but disturbing landscapes. I’m delighted to have a print of Landscape of Change hanging in the living room at Chateau Grant. Pelto studies glaciers and spends a lot of time on intrepid-sounding field trips, so she sees the effects of climate change first hand in a way that the rest of us don’t. There’s a NatGeo article on her work here.

download

In the methodviz category, Fernanda Viegas, Martin Wattenberg, Shan Carter and Daniel Smilkov made a truly ground-breaking website for Google’s TensorFlow project (open source deep learning software). This shows you how artificial neural networks of the simple feedforward variety work, and allows you to mess about with their design to a certain extent. I was really impressed with how the hardest aspect to communicate — the emergence of non-linear functions of the inputs — is just simple, intuitive and obvious for users. I’m sure it will continue to help people learn about this super-trendy but apparently obscure method for years to come, and it would be great to have more pages like this for algorithmic analytical methods. You can watch them present it here.

screen-shot-2016-10-25-at-22-37-26

Leave a comment

Filed under Visualization

So you want to be a Data Science superstar

Big house? Five cars? There’s no one universal way to do it, but get a coffee and read on through this bumper post to find your own way with the advice of real experts.

Last summer, Mrs G and I were in that ridiculously long line for the cablecar in San Francisco, like predictable British tourists, and got talking to the guys next to us. One of them, Jason Jackson, was just about to start studies in business including a good dose of quantitative research and data analysis. So, we’ve stayed in touch on Twitter. Recently, he asked me what the single best resource is for getting started in data science, and I found this a surprisingly tough question.

‘Data science’ is a term widely used in business and more computing-oriented circles, while it is not always recognised in slow-moving academia, where ‘statistics’ still holds sway. They are not the same thing. DS is a mix of skills to manipulate, analyse and interpret data, drawn from statistics, computer science and machine learning. It’s hard to be world-class at all of those, but there are probably a few really irritating people like that out there. To be autonomous and not get ripped off as a freelancer or entrepreneur, you should also know how to construct and work with databases and websites, and be able to make some data visualisations. It is probably sensible to devote little, if any, energy to Big Data. I mean, just watch a few YouTube videos about Spark and you’ll be OK.

If you want to study statistics, the route to take and resources to use are well mapped-out, but DS is not so clear. And remember that DS is only one step away from BS; there are plenty of websites promising a lot and providing little. Many of the ‘great resources’ you find online turn out to be vacuous efforts to separate you from your do$h, blatant self-promotion, or just badly-explained home-made videos. I thought it would be a nice opportunity to elicit some opinions from people I respect, even if we all end up disagreeing. So, I sent the following around anyone I could think would have an interesting view on this, including as far as possible people outside the classical statistics world:

Colleagues & friends,
I am writing a blog post and would love it if you would contribute just a few sentences of your views. I was asked recently what the best single resource is for teaching oneself data science (which I take to be a crossover between computer science / programming skills, classical statistics and machine learning). I am really not sure what the answer is, but I think it is a really important one and worth airing some different views. People trained initially in statistics, like me, are often negative about the concept of data science, but I think this is a mistake and we stand to up our game and learn a lot of cool tricks along the way.
It could be an online course, software to play around with, a book or anything else.
For my suggestion, I am going to lay claim to Hastie, Tibshirani & Friedman’s book “The Elements of Statistical Learning” [EoSL], combined with googling ” in r” and then playing around in R late into the night when you really have other things you should be getting on with.

Why specifically R? Because it has by far the biggest library of packages tackling everything from statistics to machine learning to interfacing with databases to text analysis to you name it. And it’s free.

Let’s start the replies with with Bob Carpenter (Columbia), who was not a fan of ‘EoSL’:

I didn’t like Hastie et al.’s book, because I found it nearly impossible to understand from first principles. Now I find it trivially easy, of course, which is probably why they didn’t understand how hard it would be for beginners. More seriously, I would shy away from recommending a pure frequentist approach and recommend something more Bayesian.

On that Bayesian point, I have looked a bit at ‘Bayesian Reasoning and Machine Learning’ by David Barber and like the look of it. I haven’t read it thoroughly though, and I think it would make a better second or third textbook than a first. Bob continued:

For computer scientists getting into stats, I’d recommend Gelman and Hill’s book on multilevel regression. It’s too high level to teach you basic stats and probabilities, but it’s an awesome tutorial on modeling. I liked Bishop’s book [“Pattern Recognition and Machine Learning”] much better than EoSL — but then it’s more algorithm focused and gives a decent intro to probability theory. I’m a computer scientist. But it’s rather incoherent in covering so many different things that aren’t probabilistic (perceptrons, SVMs, etc.)

Well, as I see it, the mixture of probabilistic algorithms and heuristic non-probabilistic ones (particularly around unsupervised learning) is an interesting characteristic of data-science-as-useful-though-incoherent-mashup. And while we’re on the subject of tutorials in modeling, let’s not forget good old Cox & Snell, whose book is still unique and fresh in its over-the-statistician’s-shoulder view of real analysis in action, complexities, compromises and all. Mike Betancourt (Warwick), who, like Bob, came to statistics after training in another field, also came down in favour of Bishop:

Firstly I should note that I hate Elements of Statistical Learning. It’s a cookbook with lots of technical results that apply in unrealistic settings and little intuition that helps in practice. I much prefer Bishop who motivates each algorithm from a generative perspective and then ties that perspective into the examples.

Personally, I looked at both books when I wanted to learn about ML, and chose against Bishop, perhaps because unlike these two, my first degree was in math. Laurent Gatto (Cambridge) suggested some online learning:

I enjoyed the Statistical Learning Stanford Online course [1] and book [2] from the same authors you mentioned. Although I haven’t taken the course myself, I think the set of Data Science Coursera courses from Roger Peng et al. from Johns Hopkins [3] is probably quite good.
[1] http://online.stanford.edu/course/statistical-learning-winter-2014
[2] http://www-bcf.usc.edu/~gareth/ISL/
[3] https://www.coursera.org/specializations/jhu-data-science

You can always spot a true academic by the way they use proper referencing in emails. Or SMS, or Twitter…

The next theme that I got was in favour of getting your hands dirty with real data (which is the sort of thing I had in mind for tinkering late at night when you really should be doing something else). Here’s Laurent again:

I think the most crucial factor to teaching oneself data science (or programming) is a practical use case to guide the student. It’s so easy to get started with a nice resource or book and then get carried away by everyday business. I think a simple enough, yet non-trivial problem to tackle is really helpful to ground the study material in ones real-life applications.

I think they are absolutely right that just-in-time self-taught programming for a real task and a deadline is very fast and effective. The trick is then keeping up the practice afterwards and polishing the rough edges of programming. And programming in particular is an added layer of difficulty for the novice data scientist (unless you still believe you can get by pointing and clicking in various IBM products which we do not mention on this blog). As statistician Rebecca Killick (Lancaster) put it:

My research is more and more on the borderline between classical statistics and machine learning for which I need good programming skills. I wouldn’t call myself a data scientist but many of my more theoretical colleagues probably consider me to be one. I would contribute the following book: “Machine Learning: An Algorithmic Perspective” by Stephen Marsland, again with the relevant googling of how to do things practically in R (the book gives Python examples). I also learnt much of my Python knowledge from the Appendix (and googling).

Ah yes, Python. That is also very popular in data science circles, probably more among people approaching from a web/computer science angle than a stats angle, and I’ve not got enough brain space to absorb another language, but there’s no denying its popularity, flexibility and power. It’s doubtless faster than R in most settings (though perhaps not judicious use of Rcpp, the ‘seamless’ interface between high-level R and low-level C++, which is my power tool of choice). Here’s Bob Carpenter again:

If you want one recommendation from me for statisticians getting into software, it’s Hunt and Thomas’s book, The Pragmatic Programmer. It’s too high level to teach you to program, but it’s an awesome tutorial on being a solid developer and managing projects of all scales. For domain scientists getting into both, I’d recommend both this and Bishop.

And economist Nick Latimer (Sheffield):

I’ve never really been taught much statistics (aside from little bits on economics courses) or programming. Hence I am not very good at either. However, in teaching myself how to do these things the most useful thing for me was googling Stata error messages and messing around with datasets and code until I got it to do what I wanted it to do (much as you say you did with R). Seeing the code written by other people is also very useful, mainly to show you the many different ways (usually more efficient than mine) to do the same thing.

Likewise biochemist Jon Houseley (Cambridge):

My experience with R books has not been fruitful, and I am also of the Googling “how do I do XXXX in r” school. Most texts on R seem to require more statistical and/or programming knowledge than I possess. However, our bioinformatics unit runs a series of courses for biologists needing to perform basic data analysis in R – the course materials provide step-by-step guides for simple tasks and are freely available here: http://www.bioinformatics.babraham.ac.uk/training.html

Here’s Rasmus Bååth (Lund), a statistician whose hobbies include hard programming challenges like recreating Bayesian software inside a website (for fun):

For the programming part of data science it’s relatively straightforward, there are tons of great blogs (where R-bloggers is the main pusher, http://www.r-bloggers.com/) and great tutorials (if you are completely new to R http://tryr.codeschool.com/ is one of the best!). For the stats part I found it much more difficult to find good resources online, and you’ll easily find lots of conflicting advice (p-value based statistics vs. Bayes comes to mind…). For visualization Cleveland’s old book is a gold mine (http://amzn.com/0963488406 ), and the ggplot2 book (https://github.com/hadley/ggplot2-book) and cookbook (http://www.cookbook-r.com/Graphs/) shows you how to do it in practice. A great source for (practical) statistical theory is also Richard McElreath’s video lectures (https://youtu.be/WFv2vS8ESkk) and upcoming book (http://bit.ly/1NfLlsN).

Bob Carpenter pointed me to a blog post by Peter Norvig (head of research at Google): http://norvig.com/21-days.html One quote from that I’m going to throw into the mix here is about taking time and treating it as a serious life-changing challenge.

The key is deliberative practice: not just doing it again and again, but challenging yourself with a task that is just beyond your current ability, trying it, analyzing your performance while and after doing it, and correcting any mistakes. Then repeat. And repeat again. There appear to be no real shortcuts: even Mozart, who was a musical prodigy at age 4, took 13 more years before he began to produce world-class music.

I really recommend reading this post as it has a lot of wise advice in there, and even if you (dear reader) don’t believe me when I tell you unpalatable facts about learning, you might take it from Norvig!

And here’s Con Ariti (LSHTM, ex-CapitalOne):

I recommend the ‘little’ statistical learning book with the R labs. There is also a good example book by O’Reilly publishing I think is called “Doing data science” that has some examples and is based on a course at NYU. It is good for showing how DS is done in the real world and how much could be learnt from statistics!

Con’s ‘little’ book was Laurent Gatto’s reference 2. This is Hastie & friends’ shorter and less theoretical book ‘Introduction to Statistical Learning’ – I like them both, but don’t imagine that by reading the little one you’ll escape the algebra.

Now for a word from medical statistician Charles Opondo (Oxford):

“Best single resource” – the internet! I think the best way is to start with a personal/work/task related problem that one understands well, and by understanding the complexities and limitations of available tools and solutions then one can begin to understand the subject. I think the internet as a whole is the ‘best single source’ because good books, courses and online resources are always replaced with the next best thing, and there’s always bound to be that single source that does one, just one thing, exceptionally better than any book or course ever would.

to which I replied:

Would you advise a beginner to play rather than agonise over the theoretical foundations then?

and he said:

Absolutely – one sometimes finds, upon deeper exploration, that there is no consensus or clarity on some aspects of foundation, and that it is enough to work with methods and approaches as currently understood (talking for myself and my recent exploration of causal inference).

and I couldn’t resist:

Hmmm yes! Especially frustrating for the novice because the writings of the professors give every impression of being unquestionably the final word on the subject.

Finally, there was something of a defence of statistics. Now, I don’t imagine DS is the new stats, or that stats has had its day, but Royal Statistical Society president Peter Diggle (Lancaster, ex-CSIRO) wrote on “Statistics: a data science for the 21st century” as his presidential address, and noted that stats has some crucially important stuff to offer DS:

we can assert that uncertainty is ubiquitous and that probability is the correct way to deal with uncertainty. We understand the uncertainty in our data by building stochastic models, and in our conclusions by probabilistic inference. And on the principle that prevention is better than cure we also minimize uncertainty by the application of the design principles that Fisher laid down 80 years ago, and by using efficient methods of estimation.

So, in conclusion, it seems there is no silver bullet but rather a selection of different approaches when people offer up materials for learning these skills. Regular readers will know I’m a fan of the American Statistical Association’s GAISE guidelines for teaching stats in a modern, evidence-based way. But even that did not foresee the approach of DS. Basically, if t-tests get mentioned in the first third of any course, video, book or website, then you are looking at a reheated statistics course. The old Snedecor 1930s syllabus just doesn’t work, because so many of the ideas it leaves you with are not going to be priorities in a DS application. How to we tackle that then, to teach statistics rigorously but leave graduates able to flex across into machine learning and programming? Here’s Peter Diggle:

Given a solid mathematical foundation, my suggested list of topics for a Master of Science degree in statistics is
(a) design,
(b) probability and stochastic processes,
(c) likelihood-based inference,
(d) computation, including numerical methods and programming,
(e) communication, including scientific writing for both technical and lay audiences, and
(f) scientific method, and the foundations of at least one substantive area of application.

I love point (e) but I am going to be more radical than Peter. I think Bayes has to be there from the start, along with getting everyone to practise proposing and justifying data-generating processes all the time, not just shoe-horned into (d). Then the exploratory stats and graphics, and any model, works to test and refine that a priori process (and not in a p<0.05 mechanistic way). Here I suggest the interested teacher takes a look at Jim Ridgway’s paper ‘Implications of the Data Revolution for Statistics Education’. Although I think he over-emphasises massive data sets, I like the principles. Others will doubtless disagree and want to teach following a classical model, then expand, but I’m concerned not only with long, luxurious university courses, but also people teaching themselves and needing to start producing results this week, not in three years’ time. Needless to say, there is no ideal material for this, but you may note that I own a web domain called bayescamp.com – now what should I do with that, do you suppose?

A final point from me is about mathematics. Nobody raised this (except Peter Diggle in the context of a university degree course), but if you don't have confident reading and writing skills in the highly condensed and abstracted language we call algebra (including matrix algebra), you will find it hard to absorb some ideas. Pseudocode can get you some of the way, but it is probably worth setting aside a few weeks to brush up your math. Gentle's book 'Matrix Algebra' is ideal for this, I think. You will need to carve out and defend serious chunks of uninterrupted, undistracted time. The process will hurt, let's not kid ourselves, but it will pay you back for the rest of your life.

3 Comments

Filed under learning, R