Poor review

Firstly, let’s namecheck the wittiest punner in stats for that title, Stephen Senn.

This recent post on Andrew Gelman’s blog is essential reading. I suspect my readers are all over there too, but I’ll mention it here because of this point in wrapping up:

Peer review can serve some useful purposes. But to the extent the reviewers are actually peers of the authors, they can easily have the same blind spots. I think outside review can serve a useful purpose as well.

I’ve seen this a lot in my life as a medical (read ‘health and social care more broadly, with a dash of education’) statistician. There are distinct tribes of healthcare professionals and they do things, including research designs, analytical methods and communicating findings, in their own sweet way. There’s generally no reason, it’s just custom and ritual. If you don’t fit that mould to some extent, you get rejected. Often, I find myself consciously peppering the paper / slides with some shibboleths that will ease my journey to REFland. (Of which, sample size calculations for anything that isn’t a randomised controlled trial is the most common, although I am no stranger to the Totally Unnecessary Reporting Diagram (TURD). I draw the line at Cohen’s D though; D stands for d’oh.)

‘Outside review’ reminds me of the idea of ‘strong inference’, or having your worst enemy analyse your data too and see if they can destroy your conclusions. You don’t have to go that far though, you could just make sure that reviewers extend beyond the specialism and profession of the authors to break that parochialism and question the unquestionable.

People Against Goodness And Normalcy

Essentially, if they can’t understand it, then it’s not written well. I don’t accept any argument that the subject is just too complex for outsiders – because the authors’ interests were once upon a time confined to Lego or the Smurfs, so it must be possible – nor do I claim to have got it right myself – it’s a constant challenge to pitch Bayesian latent variable models just so for a subject-expert audience.

I don’t know about you, but I enjoy reading academic papers outside my own field (OK, no critical theory please, but I’ll consider pretty much anything else). Maybe I should start an occasional series of randomly selected academic papers here, or maybe I just don’t have time for that.

2 Comments

Filed under Uncategorized

Notes after the death of Pierre Boulez

I’m going to take a diversion from the staple statistical fare and mark the passing of a man who has obliquely, and not without contradiction, been a long-running source of inspiration to me. The death of composer and conductor Pierre Boulez was announced in January. There is plenty you can read about him online, so I won’t attempt any kind of obituary; rather, I want to reflect on the art-science intercourse and the unexpected lessons in living and working.

For this post, it was quite hard to decide what to cover and how to structure it. Finally, I felt I should get on with it and follow his style and form. It would be tempting to keep deleting and rewriting it every ten years or so, but I don’t plan to do that. The text is not here but on my website (click here), because it uses a little JavaScript to play the role of a conductor interlocking the rings of Le Marteau Sans Maître.

Leave a comment

Filed under Uncategorized

Scaling Statistics at Google Teach

Not unrelatedly to the Data Science angle, I read the recent paper “Teaching Statistics at Google Scale” today. I don’t think it actually has anything to do with teaching, but it does have some genuinely interesting examples of the inventive juggling required to make inferences in big data situations. You should read it, especially if you are still not sure whether big data is really a thing.

Leave a comment

Filed under computing

So you want to be a Data Science superstar

Big house? Five cars? There’s no one universal way to do it, but get a coffee and read on through this bumper post to find your own way with the advice of real experts.

Last summer, Mrs G and I were in that ridiculously long line for the cablecar in San Francisco, like predictable British tourists, and got talking to the guys next to us. One of them, Jason Jackson, was just about to start studies in business including a good dose of quantitative research and data analysis. So, we’ve stayed in touch on Twitter. Recently, he asked me what the single best resource is for getting started in data science, and I found this a surprisingly tough question.

‘Data science’ is a term widely used in business and more computing-oriented circles, while it is not always recognised in slow-moving academia, where ‘statistics’ still holds sway. They are not the same thing. DS is a mix of skills to manipulate, analyse and interpret data, drawn from statistics, computer science and machine learning. It’s hard to be world-class at all of those, but there are probably a few really irritating people like that out there. To be autonomous and not get ripped off as a freelancer or entrepreneur, you should also know how to construct and work with databases and websites, and be able to make some data visualisations. It is probably sensible to devote little, if any, energy to Big Data. I mean, just watch a few YouTube videos about Spark and you’ll be OK.

If you want to study statistics, the route to take and resources to use are well mapped-out, but DS is not so clear. And remember that DS is only one step away from BS; there are plenty of websites promising a lot and providing little. Many of the ‘great resources’ you find online turn out to be vacuous efforts to separate you from your do$h, blatant self-promotion, or just badly-explained home-made videos. I thought it would be a nice opportunity to elicit some opinions from people I respect, even if we all end up disagreeing. So, I sent the following around anyone I could think would have an interesting view on this, including as far as possible people outside the classical statistics world:

Colleagues & friends,
I am writing a blog post and would love it if you would contribute just a few sentences of your views. I was asked recently what the best single resource is for teaching oneself data science (which I take to be a crossover between computer science / programming skills, classical statistics and machine learning). I am really not sure what the answer is, but I think it is a really important one and worth airing some different views. People trained initially in statistics, like me, are often negative about the concept of data science, but I think this is a mistake and we stand to up our game and learn a lot of cool tricks along the way.
It could be an online course, software to play around with, a book or anything else.
For my suggestion, I am going to lay claim to Hastie, Tibshirani & Friedman’s book “The Elements of Statistical Learning” [EoSL], combined with googling ” in r” and then playing around in R late into the night when you really have other things you should be getting on with.

Why specifically R? Because it has by far the biggest library of packages tackling everything from statistics to machine learning to interfacing with databases to text analysis to you name it. And it’s free.

Let’s start the replies with with Bob Carpenter (Columbia), who was not a fan of ‘EoSL’:

I didn’t like Hastie et al.’s book, because I found it nearly impossible to understand from first principles. Now I find it trivially easy, of course, which is probably why they didn’t understand how hard it would be for beginners. More seriously, I would shy away from recommending a pure frequentist approach and recommend something more Bayesian.

On that Bayesian point, I have looked a bit at ‘Bayesian Reasoning and Machine Learning’ by David Barber and like the look of it. I haven’t read it thoroughly though, and I think it would make a better second or third textbook than a first. Bob continued:

For computer scientists getting into stats, I’d recommend Gelman and Hill’s book on multilevel regression. It’s too high level to teach you basic stats and probabilities, but it’s an awesome tutorial on modeling. I liked Bishop’s book [“Pattern Recognition and Machine Learning”] much better than EoSL — but then it’s more algorithm focused and gives a decent intro to probability theory. I’m a computer scientist. But it’s rather incoherent in covering so many different things that aren’t probabilistic (perceptrons, SVMs, etc.)

Well, as I see it, the mixture of probabilistic algorithms and heuristic non-probabilistic ones (particularly around unsupervised learning) is an interesting characteristic of data-science-as-useful-though-incoherent-mashup. And while we’re on the subject of tutorials in modeling, let’s not forget good old Cox & Snell, whose book is still unique and fresh in its over-the-statistician’s-shoulder view of real analysis in action, complexities, compromises and all. Mike Betancourt (Warwick), who, like Bob, came to statistics after training in another field, also came down in favour of Bishop:

Firstly I should note that I hate Elements of Statistical Learning. It’s a cookbook with lots of technical results that apply in unrealistic settings and little intuition that helps in practice. I much prefer Bishop who motivates each algorithm from a generative perspective and then ties that perspective into the examples.

Personally, I looked at both books when I wanted to learn about ML, and chose against Bishop, perhaps because unlike these two, my first degree was in math. Laurent Gatto (Cambridge) suggested some online learning:

I enjoyed the Statistical Learning Stanford Online course [1] and book [2] from the same authors you mentioned. Although I haven’t taken the course myself, I think the set of Data Science Coursera courses from Roger Peng et al. from Johns Hopkins [3] is probably quite good.
[1] http://online.stanford.edu/course/statistical-learning-winter-2014
[2] http://www-bcf.usc.edu/~gareth/ISL/
[3] https://www.coursera.org/specializations/jhu-data-science

You can always spot a true academic by the way they use proper referencing in emails. Or SMS, or Twitter…

The next theme that I got was in favour of getting your hands dirty with real data (which is the sort of thing I had in mind for tinkering late at night when you really should be doing something else). Here’s Laurent again:

I think the most crucial factor to teaching oneself data science (or programming) is a practical use case to guide the student. It’s so easy to get started with a nice resource or book and then get carried away by everyday business. I think a simple enough, yet non-trivial problem to tackle is really helpful to ground the study material in ones real-life applications.

I think they are absolutely right that just-in-time self-taught programming for a real task and a deadline is very fast and effective. The trick is then keeping up the practice afterwards and polishing the rough edges of programming. And programming in particular is an added layer of difficulty for the novice data scientist (unless you still believe you can get by pointing and clicking in various IBM products which we do not mention on this blog). As statistician Rebecca Killick (Lancaster) put it:

My research is more and more on the borderline between classical statistics and machine learning for which I need good programming skills. I wouldn’t call myself a data scientist but many of my more theoretical colleagues probably consider me to be one. I would contribute the following book: “Machine Learning: An Algorithmic Perspective” by Stephen Marsland, again with the relevant googling of how to do things practically in R (the book gives Python examples). I also learnt much of my Python knowledge from the Appendix (and googling).

Ah yes, Python. That is also very popular in data science circles, probably more among people approaching from a web/computer science angle than a stats angle, and I’ve not got enough brain space to absorb another language, but there’s no denying its popularity, flexibility and power. It’s doubtless faster than R in most settings (though perhaps not judicious use of Rcpp, the ‘seamless’ interface between high-level R and low-level C++, which is my power tool of choice). Here’s Bob Carpenter again:

If you want one recommendation from me for statisticians getting into software, it’s Hunt and Thomas’s book, The Pragmatic Programmer. It’s too high level to teach you to program, but it’s an awesome tutorial on being a solid developer and managing projects of all scales. For domain scientists getting into both, I’d recommend both this and Bishop.

And economist Nick Latimer (Sheffield):

I’ve never really been taught much statistics (aside from little bits on economics courses) or programming. Hence I am not very good at either. However, in teaching myself how to do these things the most useful thing for me was googling Stata error messages and messing around with datasets and code until I got it to do what I wanted it to do (much as you say you did with R). Seeing the code written by other people is also very useful, mainly to show you the many different ways (usually more efficient than mine) to do the same thing.

Likewise biochemist Jon Houseley (Cambridge):

My experience with R books has not been fruitful, and I am also of the Googling “how do I do XXXX in r” school. Most texts on R seem to require more statistical and/or programming knowledge than I possess. However, our bioinformatics unit runs a series of courses for biologists needing to perform basic data analysis in R – the course materials provide step-by-step guides for simple tasks and are freely available here: http://www.bioinformatics.babraham.ac.uk/training.html

Here’s Rasmus Bååth (Lund), a statistician whose hobbies include hard programming challenges like recreating Bayesian software inside a website (for fun):

For the programming part of data science it’s relatively straightforward, there are tons of great blogs (where R-bloggers is the main pusher, http://www.r-bloggers.com/) and great tutorials (if you are completely new to R http://tryr.codeschool.com/ is one of the best!). For the stats part I found it much more difficult to find good resources online, and you’ll easily find lots of conflicting advice (p-value based statistics vs. Bayes comes to mind…). For visualization Cleveland’s old book is a gold mine (http://amzn.com/0963488406 ), and the ggplot2 book (https://github.com/hadley/ggplot2-book) and cookbook (http://www.cookbook-r.com/Graphs/) shows you how to do it in practice. A great source for (practical) statistical theory is also Richard McElreath’s video lectures (https://youtu.be/WFv2vS8ESkk) and upcoming book (http://bit.ly/1NfLlsN).

Bob Carpenter pointed me to a blog post by Peter Norvig (head of research at Google): http://norvig.com/21-days.html One quote from that I’m going to throw into the mix here is about taking time and treating it as a serious life-changing challenge.

The key is deliberative practice: not just doing it again and again, but challenging yourself with a task that is just beyond your current ability, trying it, analyzing your performance while and after doing it, and correcting any mistakes. Then repeat. And repeat again. There appear to be no real shortcuts: even Mozart, who was a musical prodigy at age 4, took 13 more years before he began to produce world-class music.

I really recommend reading this post as it has a lot of wise advice in there, and even if you (dear reader) don’t believe me when I tell you unpalatable facts about learning, you might take it from Norvig!

And here’s Con Ariti (LSHTM, ex-CapitalOne):

I recommend the ‘little’ statistical learning book with the R labs. There is also a good example book by O’Reilly publishing I think is called “Doing data science” that has some examples and is based on a course at NYU. It is good for showing how DS is done in the real world and how much could be learnt from statistics!

Con’s ‘little’ book was Laurent Gatto’s reference 2. This is Hastie & friends’ shorter and less theoretical book ‘Introduction to Statistical Learning’ – I like them both, but don’t imagine that by reading the little one you’ll escape the algebra.

Now for a word from medical statistician Charles Opondo (Oxford):

“Best single resource” – the internet! I think the best way is to start with a personal/work/task related problem that one understands well, and by understanding the complexities and limitations of available tools and solutions then one can begin to understand the subject. I think the internet as a whole is the ‘best single source’ because good books, courses and online resources are always replaced with the next best thing, and there’s always bound to be that single source that does one, just one thing, exceptionally better than any book or course ever would.

to which I replied:

Would you advise a beginner to play rather than agonise over the theoretical foundations then?

and he said:

Absolutely – one sometimes finds, upon deeper exploration, that there is no consensus or clarity on some aspects of foundation, and that it is enough to work with methods and approaches as currently understood (talking for myself and my recent exploration of causal inference).

and I couldn’t resist:

Hmmm yes! Especially frustrating for the novice because the writings of the professors give every impression of being unquestionably the final word on the subject.

Finally, there was something of a defence of statistics. Now, I don’t imagine DS is the new stats, or that stats has had its day, but Royal Statistical Society president Peter Diggle (Lancaster, ex-CSIRO) wrote on “Statistics: a data science for the 21st century” as his presidential address, and noted that stats has some crucially important stuff to offer DS:

we can assert that uncertainty is ubiquitous and that probability is the correct way to deal with uncertainty. We understand the uncertainty in our data by building stochastic models, and in our conclusions by probabilistic inference. And on the principle that prevention is better than cure we also minimize uncertainty by the application of the design principles that Fisher laid down 80 years ago, and by using efficient methods of estimation.

So, in conclusion, it seems there is no silver bullet but rather a selection of different approaches when people offer up materials for learning these skills. Regular readers will know I’m a fan of the American Statistical Association’s GAISE guidelines for teaching stats in a modern, evidence-based way. But even that did not foresee the approach of DS. Basically, if t-tests get mentioned in the first third of any course, video, book or website, then you are looking at a reheated statistics course. The old Snedecor 1930s syllabus just doesn’t work, because so many of the ideas it leaves you with are not going to be priorities in a DS application. How to we tackle that then, to teach statistics rigorously but leave graduates able to flex across into machine learning and programming? Here’s Peter Diggle:

Given a solid mathematical foundation, my suggested list of topics for a Master of Science degree in statistics is
(a) design,
(b) probability and stochastic processes,
(c) likelihood-based inference,
(d) computation, including numerical methods and programming,
(e) communication, including scientific writing for both technical and lay audiences, and
(f) scientific method, and the foundations of at least one substantive area of application.

I love point (e) but I am going to be more radical than Peter. I think Bayes has to be there from the start, along with getting everyone to practise proposing and justifying data-generating processes all the time, not just shoe-horned into (d). Then the exploratory stats and graphics, and any model, works to test and refine that a priori process (and not in a p<0.05 mechanistic way). Here I suggest the interested teacher takes a look at Jim Ridgway’s paper ‘Implications of the Data Revolution for Statistics Education’. Although I think he over-emphasises massive data sets, I like the principles. Others will doubtless disagree and want to teach following a classical model, then expand, but I’m concerned not only with long, luxurious university courses, but also people teaching themselves and needing to start producing results this week, not in three years’ time. Needless to say, there is no ideal material for this, but you may note that I own a web domain called bayescamp.com – now what should I do with that, do you suppose?

A final point from me is about mathematics. Nobody raised this (except Peter Diggle in the context of a university degree course), but if you don't have confident reading and writing skills in the highly condensed and abstracted language we call algebra (including matrix algebra), you will find it hard to absorb some ideas. Pseudocode can get you some of the way, but it is probably worth setting aside a few weeks to brush up your math. Gentle's book 'Matrix Algebra' is ideal for this, I think. You will need to carve out and defend serious chunks of uninterrupted, undistracted time. The process will hurt, let's not kid ourselves, but it will pay you back for the rest of your life.

3 Comments

Filed under learning, R

If you’re using Stata and you want to do Bayes, you should be using StataStan

All I’m saying this morning is that Andrew Gelman’s blog from yesterday was about a paper that we wrote together. We sent StataStan in head to head against Stata’s own bayesmh. As co-author Bob Carpenter says in the comments, “a rag-tag bunch of academics mainly working in their spare time up against the product of a professional software company”. Man, it’s a heartwarming story. Offers for film rights should be sent to Robert Grant, Grosvenor Wing, St George’s Hospital, London. Actually, this was a great honour because Stan is such an amazing piece of software, and my co-authors are the real geniuses behind it (not me). I believe it was Method Man who put it best when he said, “we formed like Vultron and yo, he just happened to be the head”.

Source: If you’re using Stata and you want to do Bayes, you should be using StataStan

Leave a comment

Filed under Uncategorized

Visualisation of the year 2015

There was never any doubt about this one. It had to be Dear Data. Earlier this year, I posted a six-page handwritten blog post about it, which is not something I do all the time.

Other dataviz bloggers and tweeters with end-of-year lists have cited Dear Data because it’s different, or cute. I think it’s much more important than that. I think the detailed examination of the process, and the fact that the pace required constant innovation and experimentation from creators Stefanie Posavec and Giorgia Lupi, makes it a uniquely detailed exemplar for datavizzers present and future to read, think about and learn from. It doesn’t try to be perfect all the time, and that’s fine too. Many of the notes on production reflect on how things went a bit wrong (or in some cases, spectacularly wrong (and I am reminded not to put bottles of water into rucksacks without thorough checks)).

Over time, you get a real idea of personal style preferences and interests. Lupi is keen on symbols, which is something you see in her work with Studio Accurat too. One of the first pieces of theirs that caught my eye involved elongated triangles. I had been thinking about depicting movement in two dimensions. The obvious choice is arrows, but at a glance they can become terribly confusing as soon as the movement is not smoothly laminar. I then realised the Accurat triangles were much more experimental and not to be taken with such scientific conservatism; they didn’t work for me because I always take Cleveland’s advice: dataviz is a translation of data into vision, and for it to work the viewer has to be able to translate it back to numbers in their head – and I add and do so easily. Posavec’s work is generally more artistically informed and less about data in the journalistic or scientific tradition, yet somehow features more parallel lines and tree structures. I still think Phantom Terrains is an incredible piece of work, partly because it ticks almost all the Robert’s Interests boxes – maybe include some Neolithic monuments next time? – but also for the way it weaves together different types of data and design seamlessly.

Notably, none of the postcards contains a single axis! This is not the place for that sort of thing. Has Dear Data inspired me to do more? Yes, I think so, or even to publish it here and then try to drum up some funding to expand on it. So, expect to see my London noise pollution map and code appearing at some point in 2016. Why? Mostly because it’s surprisingly a lot of fun to collect data and make pictures from it.

Leave a comment

Filed under Visualization

Complex systems reading

Tomorrow I’ll be giving a seminar in our faculty on inference in complex systems (like the health service, or social services, or local government, or society more generally). It’s the latest talk on this subject that is really gelling now into something of a manifesto. Rick Hood and I intend to send off the paper version before Xmas, so I won’t say more about the substance of it here (and the slides are just a bunch of aide-memoire images), other than to list the references, which contains some of my favourite sources on data+science:

mr-death

I deliberately omit the methodologically detailed papers from this list, but in the main you should look into Bayesian modelling, generalised coarsening, generalised instrumental variable models, structural equation models, and their various intersections.

Leave a comment

Filed under Bayesian, research