Category Archives: learning

A statistician’s journey into deep learning

Last week I went on a training course run by NVIDIA Deep Learning Institute to learn TensorFlow. Here’s my reflections on this. (I’ve gone easy on the hyperlinks, mostly because I’m short of time but also because, you know, there’s Google.)

Firstly, to set the scene very briefly, deep learning means neural networks — highly complex non-linear predictive models — with plenty of “hidden layers” that makes them equivalent to regressions with millions or even billions of parameters. This recent article is a nice starting point.

Only recently have we been able to fit such things, thanks to software (of which TensorFlow is the current people’s favourite) and hardware (particularly GPUs; the course was run by manufacturer NVIDIA). Deep learning is the stuff that looks at pictures and tells you whether it’s a cat or a dog. It also does things like understanding your handwriting or making some up from text, ordering stuff from Amazon at your voice command, telling your self-driving car whether that’s a kid or a plastic bag in the road ahead, classifying images of eye diseases, etc etc. You have to train it on plenty of data, which is computationally intensive, and you can do that in batches (so it is readily parallelisable, hence the GPUs), but then you can just get on and run the new predictions quite quickly, on your mobile phone for example. TensorFlow was made by Google then released as open-source software last year, and since then hundreds of people have contributed tweaks to it. It’s recently gone to version 1.0.

If you’re thinking “but I’m a statistician and I should know about this – why did nobody tell me?”, then you’re right, they sneaked it past you, those damned computer scientists. But you can pick up EoSL (Hastie, Tibshirani, Friedman) or CASI (Efron & Hastie) and get going from there. If you’re thinking “this is not a statistical model, it’s just heuristic data mining”, you’re not entirely correct. There is a loss function and you can make that the likelihood. You can include priors and regularization. But you don’t typically get more than just the point estimates, and the big concern is that you don’t know you’ve reached a global optimum. “Why not just bootstrap it?” Well, partly because of the local optima problem, partly because there is a sort of flipping of equivalent sets of weights (which you will recognise if you’ve ever bootstrapped a principal components analysis), but also because if your big model, with the big data, takes 3 hours to fit even on AWS with a whole stack of power GPUs, then you don’t want to do it 1000 times.

It’s often hard to know whether your model is any good, beyond the headline of training and test dataset accuracy (the real question is not the average performance but where the problems are and whether they can be fixed). This is like revisiting the venerable (and boring) field of model diagnostic graphics. TensorFlow Playground on the other hand is an exemplary methodviz and there is also TensorBoard which shows you how the model is doing on headline stats. But with convolutional neural networks, you can do some natural visualisation. Consider the well-trodden MNIST dataset for optical character recognition:

screen-shot-2017-02-01-at-19-04-14

On the course we did some convolutional neural networks for this, and because it is a bunch of images, you can literally look at things like where the filters get activated visually. Here’s 36 filters that the network learned in the first hidden layer
screen-shot-2017-02-01-at-19-05-07
and how they get activated at different places in one particular number zero:
screen-shot-2017-02-01-at-19-05-23
And here we’re at the third hidden layer, where some overfitting appears – the filters get set off by the edge of the digit and also inside it, so there’s a shadowing effect. It thinks there are multiple zeros in there. It’s evident that a different approach is needed to get better results. Simply piling in more layers will not help.
screen-shot-2017-02-01-at-19-05-46

I’m showing you this because it’s a rare example of where visualisation helps you refine the model and also, crucially, understand how it works a little bit better.

Other data forms are not so easy. If you have masses of continuous independent variables, you can plot them against some smoother of the fitted values, or plot residuals against the predictor, etc – old skool but effective. Masses of categorical independent variables is not so easy (it never was), and if you want to feed in autocorrelated but non-visual data, like sound waves, you will have to take a lot on faith. It would be great to see more work on diagnostic visualisation in this field.

Another point to bear in mind is that it’s early days. As Aditya Singh wrote in that HBR article above, “If I analogize [sic] it to the personal computer, deep learning is in the green-and-black-DOS-screen stage of its evolution”, which is exactly correct. To run it, you type some stuff in a Jupyter notebook if you’re lucky, or otherwise in a terminal screen. We don’t yet have super-easy off-the-peg models in a gentle GUI, and they will matter not just for dabblers but for future master modellers learning the ropes – consider the case of WinBUGS and how it trained a generation of Bayesian statisticians.

You need cloud GPUs. I was intrigued by GPU computing and CUDA (NVIDIA’s language extending C++ to compile for their own GPU chips) a couple of years ago and bought some kit to play with at home. All that is obsolete now, and you would run your deep learning code in the cloud. One really nice thing about the course was that NVIDIA provided access to their slice of AWS servers and we could play around in that and get some experience of it. It doesn’t have to be expensive; you can bid for unused GPU time. And by the way, if you want to buy a bangin’ desktop computer, let me know. One careful owner.

You need to think about — and try — lots of optimisation algorithms and other tweaks. Don’t believe people who tell you it is more art than science, that’s BS not DS. You could say the same thing about building multivariable regressions (and it would also be wrong). It’s the equivalent of doctors writing everything in Latin to keep the lucrative trade in-house. Never teach the Wu-Tang style!

It’s hard to teach yourself; I’ve found no single great tutorial code out there. Get on a course with some tuition, either face-to-face or blended.

Recurrent neural networks, which you can use for time series data, are really hard to get your head around. The various tricks they employ, called things like GRUs and LSTMs, may cause you to give up. But you must persist.

You need a lot of data for deep learning, and it has to be reliably labelled with the dependent variable(s), which is expensive and potentially very time-consuming. If you are fitting millions of weights (parameters), this should come as no surprise. Those convnet filters and their results above are trained on 1000 digits, so only 100 examples of each on average. When you pump it up to all 10,000, you get much clearer distinctions between the level-3 filters that respond to this zero and those that don’t.

The overlap between Bayes and neural networks is not clear (but see Neal & Zhang’s famous NIPS-winning model). On the other hand, there are some more theoretical aspects which make the CS guys sweat that statisticians will find straightforward, like regularisation, dropout as bagging, convergence metrics, or likelihood as loss function.

Statisticians should get involved with this. You are right to be sceptical, but not to walk away from it. Here’s some salient words from Diego Kuonen:
diego-kuonen-quote

1 Comment

Filed under computing, learning, machine learning

I’m writing a dataviz book

Today I am starting work on a major new project, writing a book on data visualisation for the CRC-ASA series on statistical reasoning in science and society. There are several excellent dataviz books out there but I’m excited to be adding something new. This will be a brief, affordable overview that does not assume any previous training in statistics, or design, or coding. A lot of techniques will get described, but rather than just a baffling gallery, I want to make this a tour that shows the reader how to think through the options critically and justify their choices.

20161004_154109

Procrastinating by taking a selfie in my secret hideout

The series should be a great collection for just this reason. More people than ever before have to work with data, and not all are experts or intend to be. I was inspired by the popularity of short, simple books on various business topics that you see in airport & railway station bookshops, and hope to provide something like that. I picture as my readers the manager in charge of risk analysis at a credit card company, or starting up a new modeling department in an insurance company, or the charity boss who wants to know what to ask for from the design team so their publications are more compelling (with apologies to any friends who see their own images there). You won’t see this in bookshops for a little while, but I’ll keep you posted on progress.

2 Comments

Filed under learning, noticeboard, Visualization

A bird’s eye view of statistics in two hours

Next week I am giving a two-hour talk and discussion for Kingston University researchers and doctoral students, with the aim of being an update on statistics for those who are not active in the field. That’s an interesting and quite challenging mission, not least of all because it must fit into two hours, with the first hour being an overview for newcomers like PhD students from health and social care disciplines, and the second hour looking at big current topics. I thought I would cover these points in the second half:

  • crisis of replication: what does it mean for researchers, and how is “good practice” likely to change?
  • GAISE, curriculum reform & simulation in teaching
  • data visualization
  • big data
  • machine learning

 
The first half warrants a revised version of this handout, with the talk then structuring the ideas around three traditions of teaching and learning stats:

  • classical, mathematically grounded, stats, exemplified by Snedecor, Fisher, Neyman & Pearson, and many textbooks with either a theoretical or applied focus. Likelihood and/or adding prior to get posterior distributions are the big concepts here.
  • cookbook, exemplified by many popular textbooks out there, especially if their titles make light of statistics as a ‘hard’ subject (you could count Fisher here as the first evangelical writer in 1925, though it is harsh to put him in the same camp as some of these flimsy contemporary textbooks)
  • reformist, exemplified by Tukey in the 70s but consolidated around George Cobb and Joan Garfield’s work for the American Statistical Association. The only books for this are “Statistics: Unlocking the Power of Data” by the Lock family and “Introduction to Statistical Investigations” by Tintle et al.

It’s worth remembering that there are other great thinkers who accept the role of computational thinking and yet insist that you can’t really do statistics without being skilled in mathematics, of whom David Cox springs to mind.

eagle-over-100-000-acre-plain-at-susaki-fukagawa-juman-tsubo.jpg!Large

Hiroshige’s Eagle over the 100,000 acre plain of statistics. Note the density plot of some big data in the background.

The topics to interweave with those three traditions are models, sampling distribution versus data distribution, likelihood, significance testing as a historic aide to hand calculation, and Bayesian principles. I’ll put slides on my website when they’re ready.

While I’m on this subject, I’ll tell you about an afternoon meeting at the Royal Statistical Society on 13 October, which I have organised. The topic is making computational thinking part of learning statistics, and we have three great speakers: Helen Drury (Mathematics Mastery) representing the schools perspective, Kari Lock Morgan (Penn State University) representing the university perspective, and Jim Ridgway (University of Durham) considering what the profession should do about the changing face of teaching our subject.

Leave a comment

Filed under learning

So you want to be a Data Science superstar

Big house? Five cars? There’s no one universal way to do it, but get a coffee and read on through this bumper post to find your own way with the advice of real experts.

Last summer, Mrs G and I were in that ridiculously long line for the cablecar in San Francisco, like predictable British tourists, and got talking to the guys next to us. One of them, Jason Jackson, was just about to start studies in business including a good dose of quantitative research and data analysis. So, we’ve stayed in touch on Twitter. Recently, he asked me what the single best resource is for getting started in data science, and I found this a surprisingly tough question.

‘Data science’ is a term widely used in business and more computing-oriented circles, while it is not always recognised in slow-moving academia, where ‘statistics’ still holds sway. They are not the same thing. DS is a mix of skills to manipulate, analyse and interpret data, drawn from statistics, computer science and machine learning. It’s hard to be world-class at all of those, but there are probably a few really irritating people like that out there. To be autonomous and not get ripped off as a freelancer or entrepreneur, you should also know how to construct and work with databases and websites, and be able to make some data visualisations. It is probably sensible to devote little, if any, energy to Big Data. I mean, just watch a few YouTube videos about Spark and you’ll be OK.

If you want to study statistics, the route to take and resources to use are well mapped-out, but DS is not so clear. And remember that DS is only one step away from BS; there are plenty of websites promising a lot and providing little. Many of the ‘great resources’ you find online turn out to be vacuous efforts to separate you from your do$h, blatant self-promotion, or just badly-explained home-made videos. I thought it would be a nice opportunity to elicit some opinions from people I respect, even if we all end up disagreeing. So, I sent the following around anyone I could think would have an interesting view on this, including as far as possible people outside the classical statistics world:

Colleagues & friends,
I am writing a blog post and would love it if you would contribute just a few sentences of your views. I was asked recently what the best single resource is for teaching oneself data science (which I take to be a crossover between computer science / programming skills, classical statistics and machine learning). I am really not sure what the answer is, but I think it is a really important one and worth airing some different views. People trained initially in statistics, like me, are often negative about the concept of data science, but I think this is a mistake and we stand to up our game and learn a lot of cool tricks along the way.
It could be an online course, software to play around with, a book or anything else.
For my suggestion, I am going to lay claim to Hastie, Tibshirani & Friedman’s book “The Elements of Statistical Learning” [EoSL], combined with googling ” in r” and then playing around in R late into the night when you really have other things you should be getting on with.

Why specifically R? Because it has by far the biggest library of packages tackling everything from statistics to machine learning to interfacing with databases to text analysis to you name it. And it’s free.

Let’s start the replies with with Bob Carpenter (Columbia), who was not a fan of ‘EoSL’:

I didn’t like Hastie et al.’s book, because I found it nearly impossible to understand from first principles. Now I find it trivially easy, of course, which is probably why they didn’t understand how hard it would be for beginners. More seriously, I would shy away from recommending a pure frequentist approach and recommend something more Bayesian.

On that Bayesian point, I have looked a bit at ‘Bayesian Reasoning and Machine Learning’ by David Barber and like the look of it. I haven’t read it thoroughly though, and I think it would make a better second or third textbook than a first. Bob continued:

For computer scientists getting into stats, I’d recommend Gelman and Hill’s book on multilevel regression. It’s too high level to teach you basic stats and probabilities, but it’s an awesome tutorial on modeling. I liked Bishop’s book [“Pattern Recognition and Machine Learning”] much better than EoSL — but then it’s more algorithm focused and gives a decent intro to probability theory. I’m a computer scientist. But it’s rather incoherent in covering so many different things that aren’t probabilistic (perceptrons, SVMs, etc.)

Well, as I see it, the mixture of probabilistic algorithms and heuristic non-probabilistic ones (particularly around unsupervised learning) is an interesting characteristic of data-science-as-useful-though-incoherent-mashup. And while we’re on the subject of tutorials in modeling, let’s not forget good old Cox & Snell, whose book is still unique and fresh in its over-the-statistician’s-shoulder view of real analysis in action, complexities, compromises and all. Mike Betancourt (Warwick), who, like Bob, came to statistics after training in another field, also came down in favour of Bishop:

Firstly I should note that I hate Elements of Statistical Learning. It’s a cookbook with lots of technical results that apply in unrealistic settings and little intuition that helps in practice. I much prefer Bishop who motivates each algorithm from a generative perspective and then ties that perspective into the examples.

Personally, I looked at both books when I wanted to learn about ML, and chose against Bishop, perhaps because unlike these two, my first degree was in math. Laurent Gatto (Cambridge) suggested some online learning:

I enjoyed the Statistical Learning Stanford Online course [1] and book [2] from the same authors you mentioned. Although I haven’t taken the course myself, I think the set of Data Science Coursera courses from Roger Peng et al. from Johns Hopkins [3] is probably quite good.
[1] http://online.stanford.edu/course/statistical-learning-winter-2014
[2] http://www-bcf.usc.edu/~gareth/ISL/
[3] https://www.coursera.org/specializations/jhu-data-science

You can always spot a true academic by the way they use proper referencing in emails. Or SMS, or Twitter…

The next theme that I got was in favour of getting your hands dirty with real data (which is the sort of thing I had in mind for tinkering late at night when you really should be doing something else). Here’s Laurent again:

I think the most crucial factor to teaching oneself data science (or programming) is a practical use case to guide the student. It’s so easy to get started with a nice resource or book and then get carried away by everyday business. I think a simple enough, yet non-trivial problem to tackle is really helpful to ground the study material in ones real-life applications.

I think they are absolutely right that just-in-time self-taught programming for a real task and a deadline is very fast and effective. The trick is then keeping up the practice afterwards and polishing the rough edges of programming. And programming in particular is an added layer of difficulty for the novice data scientist (unless you still believe you can get by pointing and clicking in various IBM products which we do not mention on this blog). As statistician Rebecca Killick (Lancaster) put it:

My research is more and more on the borderline between classical statistics and machine learning for which I need good programming skills. I wouldn’t call myself a data scientist but many of my more theoretical colleagues probably consider me to be one. I would contribute the following book: “Machine Learning: An Algorithmic Perspective” by Stephen Marsland, again with the relevant googling of how to do things practically in R (the book gives Python examples). I also learnt much of my Python knowledge from the Appendix (and googling).

Ah yes, Python. That is also very popular in data science circles, probably more among people approaching from a web/computer science angle than a stats angle, and I’ve not got enough brain space to absorb another language, but there’s no denying its popularity, flexibility and power. It’s doubtless faster than R in most settings (though perhaps not judicious use of Rcpp, the ‘seamless’ interface between high-level R and low-level C++, which is my power tool of choice). Here’s Bob Carpenter again:

If you want one recommendation from me for statisticians getting into software, it’s Hunt and Thomas’s book, The Pragmatic Programmer. It’s too high level to teach you to program, but it’s an awesome tutorial on being a solid developer and managing projects of all scales. For domain scientists getting into both, I’d recommend both this and Bishop.

And economist Nick Latimer (Sheffield):

I’ve never really been taught much statistics (aside from little bits on economics courses) or programming. Hence I am not very good at either. However, in teaching myself how to do these things the most useful thing for me was googling Stata error messages and messing around with datasets and code until I got it to do what I wanted it to do (much as you say you did with R). Seeing the code written by other people is also very useful, mainly to show you the many different ways (usually more efficient than mine) to do the same thing.

Likewise biochemist Jon Houseley (Cambridge):

My experience with R books has not been fruitful, and I am also of the Googling “how do I do XXXX in r” school. Most texts on R seem to require more statistical and/or programming knowledge than I possess. However, our bioinformatics unit runs a series of courses for biologists needing to perform basic data analysis in R – the course materials provide step-by-step guides for simple tasks and are freely available here: http://www.bioinformatics.babraham.ac.uk/training.html

Here’s Rasmus Bååth (Lund), a statistician whose hobbies include hard programming challenges like recreating Bayesian software inside a website (for fun):

For the programming part of data science it’s relatively straightforward, there are tons of great blogs (where R-bloggers is the main pusher, http://www.r-bloggers.com/) and great tutorials (if you are completely new to R http://tryr.codeschool.com/ is one of the best!). For the stats part I found it much more difficult to find good resources online, and you’ll easily find lots of conflicting advice (p-value based statistics vs. Bayes comes to mind…). For visualization Cleveland’s old book is a gold mine (http://amzn.com/0963488406 ), and the ggplot2 book (https://github.com/hadley/ggplot2-book) and cookbook (http://www.cookbook-r.com/Graphs/) shows you how to do it in practice. A great source for (practical) statistical theory is also Richard McElreath’s video lectures (https://youtu.be/WFv2vS8ESkk) and upcoming book (http://bit.ly/1NfLlsN).

Bob Carpenter pointed me to a blog post by Peter Norvig (head of research at Google): http://norvig.com/21-days.html One quote from that I’m going to throw into the mix here is about taking time and treating it as a serious life-changing challenge.

The key is deliberative practice: not just doing it again and again, but challenging yourself with a task that is just beyond your current ability, trying it, analyzing your performance while and after doing it, and correcting any mistakes. Then repeat. And repeat again. There appear to be no real shortcuts: even Mozart, who was a musical prodigy at age 4, took 13 more years before he began to produce world-class music.

I really recommend reading this post as it has a lot of wise advice in there, and even if you (dear reader) don’t believe me when I tell you unpalatable facts about learning, you might take it from Norvig!

And here’s Con Ariti (LSHTM, ex-CapitalOne):

I recommend the ‘little’ statistical learning book with the R labs. There is also a good example book by O’Reilly publishing I think is called “Doing data science” that has some examples and is based on a course at NYU. It is good for showing how DS is done in the real world and how much could be learnt from statistics!

Con’s ‘little’ book was Laurent Gatto’s reference 2. This is Hastie & friends’ shorter and less theoretical book ‘Introduction to Statistical Learning’ – I like them both, but don’t imagine that by reading the little one you’ll escape the algebra.

Now for a word from medical statistician Charles Opondo (Oxford):

“Best single resource” – the internet! I think the best way is to start with a personal/work/task related problem that one understands well, and by understanding the complexities and limitations of available tools and solutions then one can begin to understand the subject. I think the internet as a whole is the ‘best single source’ because good books, courses and online resources are always replaced with the next best thing, and there’s always bound to be that single source that does one, just one thing, exceptionally better than any book or course ever would.

to which I replied:

Would you advise a beginner to play rather than agonise over the theoretical foundations then?

and he said:

Absolutely – one sometimes finds, upon deeper exploration, that there is no consensus or clarity on some aspects of foundation, and that it is enough to work with methods and approaches as currently understood (talking for myself and my recent exploration of causal inference).

and I couldn’t resist:

Hmmm yes! Especially frustrating for the novice because the writings of the professors give every impression of being unquestionably the final word on the subject.

Finally, there was something of a defence of statistics. Now, I don’t imagine DS is the new stats, or that stats has had its day, but Royal Statistical Society president Peter Diggle (Lancaster, ex-CSIRO) wrote on “Statistics: a data science for the 21st century” as his presidential address, and noted that stats has some crucially important stuff to offer DS:

we can assert that uncertainty is ubiquitous and that probability is the correct way to deal with uncertainty. We understand the uncertainty in our data by building stochastic models, and in our conclusions by probabilistic inference. And on the principle that prevention is better than cure we also minimize uncertainty by the application of the design principles that Fisher laid down 80 years ago, and by using efficient methods of estimation.

So, in conclusion, it seems there is no silver bullet but rather a selection of different approaches when people offer up materials for learning these skills. Regular readers will know I’m a fan of the American Statistical Association’s GAISE guidelines for teaching stats in a modern, evidence-based way. But even that did not foresee the approach of DS. Basically, if t-tests get mentioned in the first third of any course, video, book or website, then you are looking at a reheated statistics course. The old Snedecor 1930s syllabus just doesn’t work, because so many of the ideas it leaves you with are not going to be priorities in a DS application. How to we tackle that then, to teach statistics rigorously but leave graduates able to flex across into machine learning and programming? Here’s Peter Diggle:

Given a solid mathematical foundation, my suggested list of topics for a Master of Science degree in statistics is
(a) design,
(b) probability and stochastic processes,
(c) likelihood-based inference,
(d) computation, including numerical methods and programming,
(e) communication, including scientific writing for both technical and lay audiences, and
(f) scientific method, and the foundations of at least one substantive area of application.

I love point (e) but I am going to be more radical than Peter. I think Bayes has to be there from the start, along with getting everyone to practise proposing and justifying data-generating processes all the time, not just shoe-horned into (d). Then the exploratory stats and graphics, and any model, works to test and refine that a priori process (and not in a p<0.05 mechanistic way). Here I suggest the interested teacher takes a look at Jim Ridgway’s paper ‘Implications of the Data Revolution for Statistics Education’. Although I think he over-emphasises massive data sets, I like the principles. Others will doubtless disagree and want to teach following a classical model, then expand, but I’m concerned not only with long, luxurious university courses, but also people teaching themselves and needing to start producing results this week, not in three years’ time. Needless to say, there is no ideal material for this, but you may note that I own a web domain called bayescamp.com – now what should I do with that, do you suppose?

A final point from me is about mathematics. Nobody raised this (except Peter Diggle in the context of a university degree course), but if you don't have confident reading and writing skills in the highly condensed and abstracted language we call algebra (including matrix algebra), you will find it hard to absorb some ideas. Pseudocode can get you some of the way, but it is probably worth setting aside a few weeks to brush up your math. Gentle's book 'Matrix Algebra' is ideal for this, I think. You will need to carve out and defend serious chunks of uninterrupted, undistracted time. The process will hurt, let's not kid ourselves, but it will pay you back for the rest of your life.

3 Comments

Filed under learning, R

Everything you need to make R Commander locally (packages, dependencies, zip files)

I’ve been installing R Commander on laptops for our students to use in tutorials. It’s tedious to put each one online with my login, download it all, then disable the internet (so they don’t send lewd e-mails to the vice-chancellor from my account, although I could always plead that I had misunderstood the meaning of his job title). I eventually got every package it needed downloaded and I’ve done it all off a USB stick. But I didn’t find a single list of all the Rcmdr dependencies, recursively. Maybe it’s out there but I didn’t find it. So, here it is. You might find it useful.

tcltk2_1.2-11.zip
Rcmdr_2.2-1.zip
RcmdrMisc_1.0-3.zip
readxl_0.1.0.zip
relimp_1.0-4.zip
rgl_0.95.1367.zip
rmarkdown_0.8.1.zip
sem_3.1-6.zip
abind_1.4-3.zip
aplpack_1.3.0.zip
car_2.1-0.zip
colorspace_1.2-6.zip
e1071_1.6-7.zip
effects_3.0-4.zip
foreign_0.8-66.zip
Hmisc_3.17-0.zip
knitr_1.11.zip
lattice_0.20-33.zip
leaps_2.9.zip
lmtest_0.9-34.zip
markdown_0.7.7.zip
MASS_7.3-44.zip
mgcv_1.8-7.zip
multcomp_1.4-1.zip
nlme_3.1-122.zip
quantreg_5.19.zip
pbkrtest_0.4-2.zip
lme4_1.1-10.zip
minqa_1.2.4.zip
nnet_7.3-11.zip
Rcpp_0.12.1.zip
nloptr_1.0.4.zip
SparseM_1.7.zip
MatrixModels_0.4-1.zip
sandwich_2.3-4.zip
Formula_1.2-1.zip
ggplot2_1.0.1.zip
digest_0.6.8.zip
gtable_0.1.2.zip
plyr_1.8.3.zip
proto_0.3-10.zip
reshape2_1.4.1.zip
stringr_1.0.0.zip
stringi_0.5-5.zip
magrittr_1.5.zip
scales_0.3.0.zip
munsell_0.4.2.zip
xtable_1.7-4.zip
rpart_4.1-10.zip
randomForest_4.6-12.zip
mlbench_2.1-1.zip
cluster_2.0.3.zip
survival_2.38-3.zip
praise_1.0.0.zip
crayon_1.3.1.zip
testthat_0.11.0.zip
gridExtra_2.0.0.zip
acepack_1.3-3.3.zip
latticeExtra_0.6-26.zip
RColorBrewer_1.1-2.zip
RODBC_1.3-12.zip
XLConnect_0.2-11.zip
zoo_1.7-12.zip

I suppose this is one of my less engaging posts…

3 Comments

Filed under learning, R

Responses to the BASP psychologists’ p-value ban: the one that got away

When StatsLife collected responses to the BASP p-value ban (see blogs hither and yon), I suggested they contact Ian Hunt, a wise and philosophically minded critical voice in the wilderness of cookbook analysts. I also know that he and I take rather divergent opinions on deduction and induction and such, but I hold his arguments in the highest respect because they are carefully constructed. Alas! he couldn’t send in a response in time, but here it is, reproduced with his kind permission:

How good are the reasons given by the editors of Basic and Applied Social Psychology (BASP) to ban hypothesis tests and p-values?

I argue that BASPs reasoning for banning “statistical inference” is weak.

First, they (the editors) offer a non sequitur: ban p-values because “the state of the art remains uncertain” (2015 editorial) and “there exists no inferential statistical procedure that has elicited widespread agreement” (2014 editorial).  I argue inter-subjective (dis)agreement is not decisive.

Secondly, they imply that good inductive inferences require posterior probabilities. This is contentious, especially since both posteriors and p-values are just deductions.

Thirdly, they plead for larger sample sizes “because as the sample size increases, descriptive statistics become increasingly stable and sampling error is less of a problem.” This is contradictory: the evidence for this claim is best shown by the statistical inferences being banned.

Fourthly, they correctly assume that with a large enough sample size many “significant effects” (or “null rejections” or low p-values or interesting things or whatever) can be identified by looking at canny descriptive statistics and adroitly drawn charts. But I believe p-values ARE descriptive statistics – with which both frequentists and Bayesians can work.

Finally, BASP “welcomes the submission of null effects”.  But without tests and concomitant power profiles the evidential value of a “null effect” is unclear.

BASPs editors appear to conclude that modern statistics is inductive and akin to “the art of discovery” (as David Hand puts it). Fair enough. But I conclude that careful deductive inferences, in the form of hypothesis tests with clear premisses and verifiable mathematics, still have a role in discovering interesting things.

Now, it would be unfair of me to say anything more on this here but I believe you can hear Ian talking on this very subject at this year’s RSS conference, which is in Exeter. Personally, I’ve never been to Exeter, and I don’t think this is going to be the year for it either, but as Southern towns go, I suspect it’s neither as depressing as Weymouth nor as humourlessly overrated as Salisbury. (That counts as enthusiasm round here.) I recommend the conference to you. It’s just about optimal size and always interestingly diverse.

Leave a comment

Filed under Bayesian, learning

Roman dataviz and inference in complex systems

I’m in Rome at the International Workshop on Computational Economics and Econometrics. I gave a seminar on Monday on the ever-popular subject of data visualization. Slides are here. In a few minutes, I’ll be speaking on Inference in Complex Systems, a topic of some interest from practical research experience my colleague Rick Hood and I have had in health and social care research.

Here’s a link to my handout for that: iwcee-handout

In essence, we draw on realist evaluation and mixed-methods research to emphasise understanding the complex system and how the intervention works inside it. Unsurprisingly for regular readers, I try to promote transparency around subjectivities, awareness of philosophy of science, and Bayesian methods.

4 Comments

Filed under Bayesian, healthcare, learning, R, research, Stata, Visualization