Category Archives: R

Dataviz of the week, 12/4/17

This week, a chart with some Bayesian polemic behind it. Alexander Etz put this on Twitter:

C7oC-Z5VUAAe77z

He is working on an R package to provide easy Bayesian adjustments for reporting bias with a method by Guan & Vandekerchhove. Imagine a study reporting three p-values, all just under the threshold of significance, and with small-ish sample sizes. Sound suspicious?

jim-taggart-theres-been-a-murder

Sounds like someone’s been sniffing around after any pattern they could find. Trouble is, if they don’t tell you about the other results they threw away (reporting bias), you don’t know whether to believe them or not. Or there are a thousand similar studies but this is the (un)lucky one and this author didn’t do anything wrong in their own study (publication bias).

Well, you have to make some assumptions to do the adjustment, but at least being Bayesian, you don’t have to assume one number for the bias, you can have a distribution. Here, the orange distribution is the posterior for the true effect once the bias has been added (in this case, p>0.05 has a 0% chance of getting published, which is not unrealistic in some circles!) This is standard probabilistic stuff but it doesn’t get done  because the programming seems so daunting to a lot of people. The more easy tools – with nice helpful visualisations – the better.

Leave a comment

Filed under Bayesian, R, Visualization

Dataviz of the week: 8/3/17

 

Corinne Riddell posted this on Twitter. It’s one version of multiple time series that she tried out, one for each USA state. It’s not the finished article, but is really nice for its combination of that recognisable shape (I suppose if your country has a dull shape like Portugal — no offence — then readers won’t immediately recognise the meaning of the arrangement) and the clean, simple small multiples. Admittedly, the time series has enough signal:noise to make this possible, and only a few unusual states, and without that it might start to get like spaghetti, but it’s always worth sketching options like this out to see how they work.

small-multiples-map-grid-corinne-riddell

Could the state names get dropped? Probably not, but they could turn into two-letter abbreviations. The whole idea of the small multiple is that it’s not a precise x-y mapping but a general impression, so the y-axis labels could go (just as there are no x-axis labels).

Overall, I really like it and would like to write up a UK function for this to add to my dataviz toolbox.

Leave a comment

Filed under R, Visualization

So you want to be a Data Science superstar

Big house? Five cars? There’s no one universal way to do it, but get a coffee and read on through this bumper post to find your own way with the advice of real experts.

Last summer, Mrs G and I were in that ridiculously long line for the cablecar in San Francisco, like predictable British tourists, and got talking to the guys next to us. One of them, Jason Jackson, was just about to start studies in business including a good dose of quantitative research and data analysis. So, we’ve stayed in touch on Twitter. Recently, he asked me what the single best resource is for getting started in data science, and I found this a surprisingly tough question.

‘Data science’ is a term widely used in business and more computing-oriented circles, while it is not always recognised in slow-moving academia, where ‘statistics’ still holds sway. They are not the same thing. DS is a mix of skills to manipulate, analyse and interpret data, drawn from statistics, computer science and machine learning. It’s hard to be world-class at all of those, but there are probably a few really irritating people like that out there. To be autonomous and not get ripped off as a freelancer or entrepreneur, you should also know how to construct and work with databases and websites, and be able to make some data visualisations. It is probably sensible to devote little, if any, energy to Big Data. I mean, just watch a few YouTube videos about Spark and you’ll be OK.

If you want to study statistics, the route to take and resources to use are well mapped-out, but DS is not so clear. And remember that DS is only one step away from BS; there are plenty of websites promising a lot and providing little. Many of the ‘great resources’ you find online turn out to be vacuous efforts to separate you from your do$h, blatant self-promotion, or just badly-explained home-made videos. I thought it would be a nice opportunity to elicit some opinions from people I respect, even if we all end up disagreeing. So, I sent the following around anyone I could think would have an interesting view on this, including as far as possible people outside the classical statistics world:

Colleagues & friends,
I am writing a blog post and would love it if you would contribute just a few sentences of your views. I was asked recently what the best single resource is for teaching oneself data science (which I take to be a crossover between computer science / programming skills, classical statistics and machine learning). I am really not sure what the answer is, but I think it is a really important one and worth airing some different views. People trained initially in statistics, like me, are often negative about the concept of data science, but I think this is a mistake and we stand to up our game and learn a lot of cool tricks along the way.
It could be an online course, software to play around with, a book or anything else.
For my suggestion, I am going to lay claim to Hastie, Tibshirani & Friedman’s book “The Elements of Statistical Learning” [EoSL], combined with googling ” in r” and then playing around in R late into the night when you really have other things you should be getting on with.

Why specifically R? Because it has by far the biggest library of packages tackling everything from statistics to machine learning to interfacing with databases to text analysis to you name it. And it’s free.

Let’s start the replies with with Bob Carpenter (Columbia), who was not a fan of ‘EoSL’:

I didn’t like Hastie et al.’s book, because I found it nearly impossible to understand from first principles. Now I find it trivially easy, of course, which is probably why they didn’t understand how hard it would be for beginners. More seriously, I would shy away from recommending a pure frequentist approach and recommend something more Bayesian.

On that Bayesian point, I have looked a bit at ‘Bayesian Reasoning and Machine Learning’ by David Barber and like the look of it. I haven’t read it thoroughly though, and I think it would make a better second or third textbook than a first. Bob continued:

For computer scientists getting into stats, I’d recommend Gelman and Hill’s book on multilevel regression. It’s too high level to teach you basic stats and probabilities, but it’s an awesome tutorial on modeling. I liked Bishop’s book [“Pattern Recognition and Machine Learning”] much better than EoSL — but then it’s more algorithm focused and gives a decent intro to probability theory. I’m a computer scientist. But it’s rather incoherent in covering so many different things that aren’t probabilistic (perceptrons, SVMs, etc.)

Well, as I see it, the mixture of probabilistic algorithms and heuristic non-probabilistic ones (particularly around unsupervised learning) is an interesting characteristic of data-science-as-useful-though-incoherent-mashup. And while we’re on the subject of tutorials in modeling, let’s not forget good old Cox & Snell, whose book is still unique and fresh in its over-the-statistician’s-shoulder view of real analysis in action, complexities, compromises and all. Mike Betancourt (Warwick), who, like Bob, came to statistics after training in another field, also came down in favour of Bishop:

Firstly I should note that I hate Elements of Statistical Learning. It’s a cookbook with lots of technical results that apply in unrealistic settings and little intuition that helps in practice. I much prefer Bishop who motivates each algorithm from a generative perspective and then ties that perspective into the examples.

Personally, I looked at both books when I wanted to learn about ML, and chose against Bishop, perhaps because unlike these two, my first degree was in math. Laurent Gatto (Cambridge) suggested some online learning:

I enjoyed the Statistical Learning Stanford Online course [1] and book [2] from the same authors you mentioned. Although I haven’t taken the course myself, I think the set of Data Science Coursera courses from Roger Peng et al. from Johns Hopkins [3] is probably quite good.
[1] http://online.stanford.edu/course/statistical-learning-winter-2014
[2] http://www-bcf.usc.edu/~gareth/ISL/
[3] https://www.coursera.org/specializations/jhu-data-science

You can always spot a true academic by the way they use proper referencing in emails. Or SMS, or Twitter…

The next theme that I got was in favour of getting your hands dirty with real data (which is the sort of thing I had in mind for tinkering late at night when you really should be doing something else). Here’s Laurent again:

I think the most crucial factor to teaching oneself data science (or programming) is a practical use case to guide the student. It’s so easy to get started with a nice resource or book and then get carried away by everyday business. I think a simple enough, yet non-trivial problem to tackle is really helpful to ground the study material in ones real-life applications.

I think they are absolutely right that just-in-time self-taught programming for a real task and a deadline is very fast and effective. The trick is then keeping up the practice afterwards and polishing the rough edges of programming. And programming in particular is an added layer of difficulty for the novice data scientist (unless you still believe you can get by pointing and clicking in various IBM products which we do not mention on this blog). As statistician Rebecca Killick (Lancaster) put it:

My research is more and more on the borderline between classical statistics and machine learning for which I need good programming skills. I wouldn’t call myself a data scientist but many of my more theoretical colleagues probably consider me to be one. I would contribute the following book: “Machine Learning: An Algorithmic Perspective” by Stephen Marsland, again with the relevant googling of how to do things practically in R (the book gives Python examples). I also learnt much of my Python knowledge from the Appendix (and googling).

Ah yes, Python. That is also very popular in data science circles, probably more among people approaching from a web/computer science angle than a stats angle, and I’ve not got enough brain space to absorb another language, but there’s no denying its popularity, flexibility and power. It’s doubtless faster than R in most settings (though perhaps not judicious use of Rcpp, the ‘seamless’ interface between high-level R and low-level C++, which is my power tool of choice). Here’s Bob Carpenter again:

If you want one recommendation from me for statisticians getting into software, it’s Hunt and Thomas’s book, The Pragmatic Programmer. It’s too high level to teach you to program, but it’s an awesome tutorial on being a solid developer and managing projects of all scales. For domain scientists getting into both, I’d recommend both this and Bishop.

And economist Nick Latimer (Sheffield):

I’ve never really been taught much statistics (aside from little bits on economics courses) or programming. Hence I am not very good at either. However, in teaching myself how to do these things the most useful thing for me was googling Stata error messages and messing around with datasets and code until I got it to do what I wanted it to do (much as you say you did with R). Seeing the code written by other people is also very useful, mainly to show you the many different ways (usually more efficient than mine) to do the same thing.

Likewise biochemist Jon Houseley (Cambridge):

My experience with R books has not been fruitful, and I am also of the Googling “how do I do XXXX in r” school. Most texts on R seem to require more statistical and/or programming knowledge than I possess. However, our bioinformatics unit runs a series of courses for biologists needing to perform basic data analysis in R – the course materials provide step-by-step guides for simple tasks and are freely available here: http://www.bioinformatics.babraham.ac.uk/training.html

Here’s Rasmus Bååth (Lund), a statistician whose hobbies include hard programming challenges like recreating Bayesian software inside a website (for fun):

For the programming part of data science it’s relatively straightforward, there are tons of great blogs (where R-bloggers is the main pusher, http://www.r-bloggers.com/) and great tutorials (if you are completely new to R http://tryr.codeschool.com/ is one of the best!). For the stats part I found it much more difficult to find good resources online, and you’ll easily find lots of conflicting advice (p-value based statistics vs. Bayes comes to mind…). For visualization Cleveland’s old book is a gold mine (http://amzn.com/0963488406 ), and the ggplot2 book (https://github.com/hadley/ggplot2-book) and cookbook (http://www.cookbook-r.com/Graphs/) shows you how to do it in practice. A great source for (practical) statistical theory is also Richard McElreath’s video lectures (https://youtu.be/WFv2vS8ESkk) and upcoming book (http://bit.ly/1NfLlsN).

Bob Carpenter pointed me to a blog post by Peter Norvig (head of research at Google): http://norvig.com/21-days.html One quote from that I’m going to throw into the mix here is about taking time and treating it as a serious life-changing challenge.

The key is deliberative practice: not just doing it again and again, but challenging yourself with a task that is just beyond your current ability, trying it, analyzing your performance while and after doing it, and correcting any mistakes. Then repeat. And repeat again. There appear to be no real shortcuts: even Mozart, who was a musical prodigy at age 4, took 13 more years before he began to produce world-class music.

I really recommend reading this post as it has a lot of wise advice in there, and even if you (dear reader) don’t believe me when I tell you unpalatable facts about learning, you might take it from Norvig!

And here’s Con Ariti (LSHTM, ex-CapitalOne):

I recommend the ‘little’ statistical learning book with the R labs. There is also a good example book by O’Reilly publishing I think is called “Doing data science” that has some examples and is based on a course at NYU. It is good for showing how DS is done in the real world and how much could be learnt from statistics!

Con’s ‘little’ book was Laurent Gatto’s reference 2. This is Hastie & friends’ shorter and less theoretical book ‘Introduction to Statistical Learning’ – I like them both, but don’t imagine that by reading the little one you’ll escape the algebra.

Now for a word from medical statistician Charles Opondo (Oxford):

“Best single resource” – the internet! I think the best way is to start with a personal/work/task related problem that one understands well, and by understanding the complexities and limitations of available tools and solutions then one can begin to understand the subject. I think the internet as a whole is the ‘best single source’ because good books, courses and online resources are always replaced with the next best thing, and there’s always bound to be that single source that does one, just one thing, exceptionally better than any book or course ever would.

to which I replied:

Would you advise a beginner to play rather than agonise over the theoretical foundations then?

and he said:

Absolutely – one sometimes finds, upon deeper exploration, that there is no consensus or clarity on some aspects of foundation, and that it is enough to work with methods and approaches as currently understood (talking for myself and my recent exploration of causal inference).

and I couldn’t resist:

Hmmm yes! Especially frustrating for the novice because the writings of the professors give every impression of being unquestionably the final word on the subject.

Finally, there was something of a defence of statistics. Now, I don’t imagine DS is the new stats, or that stats has had its day, but Royal Statistical Society president Peter Diggle (Lancaster, ex-CSIRO) wrote on “Statistics: a data science for the 21st century” as his presidential address, and noted that stats has some crucially important stuff to offer DS:

we can assert that uncertainty is ubiquitous and that probability is the correct way to deal with uncertainty. We understand the uncertainty in our data by building stochastic models, and in our conclusions by probabilistic inference. And on the principle that prevention is better than cure we also minimize uncertainty by the application of the design principles that Fisher laid down 80 years ago, and by using efficient methods of estimation.

So, in conclusion, it seems there is no silver bullet but rather a selection of different approaches when people offer up materials for learning these skills. Regular readers will know I’m a fan of the American Statistical Association’s GAISE guidelines for teaching stats in a modern, evidence-based way. But even that did not foresee the approach of DS. Basically, if t-tests get mentioned in the first third of any course, video, book or website, then you are looking at a reheated statistics course. The old Snedecor 1930s syllabus just doesn’t work, because so many of the ideas it leaves you with are not going to be priorities in a DS application. How to we tackle that then, to teach statistics rigorously but leave graduates able to flex across into machine learning and programming? Here’s Peter Diggle:

Given a solid mathematical foundation, my suggested list of topics for a Master of Science degree in statistics is
(a) design,
(b) probability and stochastic processes,
(c) likelihood-based inference,
(d) computation, including numerical methods and programming,
(e) communication, including scientific writing for both technical and lay audiences, and
(f) scientific method, and the foundations of at least one substantive area of application.

I love point (e) but I am going to be more radical than Peter. I think Bayes has to be there from the start, along with getting everyone to practise proposing and justifying data-generating processes all the time, not just shoe-horned into (d). Then the exploratory stats and graphics, and any model, works to test and refine that a priori process (and not in a p<0.05 mechanistic way). Here I suggest the interested teacher takes a look at Jim Ridgway’s paper ‘Implications of the Data Revolution for Statistics Education’. Although I think he over-emphasises massive data sets, I like the principles. Others will doubtless disagree and want to teach following a classical model, then expand, but I’m concerned not only with long, luxurious university courses, but also people teaching themselves and needing to start producing results this week, not in three years’ time. Needless to say, there is no ideal material for this, but you may note that I own a web domain called bayescamp.com – now what should I do with that, do you suppose?

A final point from me is about mathematics. Nobody raised this (except Peter Diggle in the context of a university degree course), but if you don't have confident reading and writing skills in the highly condensed and abstracted language we call algebra (including matrix algebra), you will find it hard to absorb some ideas. Pseudocode can get you some of the way, but it is probably worth setting aside a few weeks to brush up your math. Gentle's book 'Matrix Algebra' is ideal for this, I think. You will need to carve out and defend serious chunks of uninterrupted, undistracted time. The process will hurt, let's not kid ourselves, but it will pay you back for the rest of your life.

3 Comments

Filed under learning, R

Everything you need to make R Commander locally (packages, dependencies, zip files)

I’ve been installing R Commander on laptops for our students to use in tutorials. It’s tedious to put each one online with my login, download it all, then disable the internet (so they don’t send lewd e-mails to the vice-chancellor from my account, although I could always plead that I had misunderstood the meaning of his job title). I eventually got every package it needed downloaded and I’ve done it all off a USB stick. But I didn’t find a single list of all the Rcmdr dependencies, recursively. Maybe it’s out there but I didn’t find it. So, here it is. You might find it useful.

tcltk2_1.2-11.zip
Rcmdr_2.2-1.zip
RcmdrMisc_1.0-3.zip
readxl_0.1.0.zip
relimp_1.0-4.zip
rgl_0.95.1367.zip
rmarkdown_0.8.1.zip
sem_3.1-6.zip
abind_1.4-3.zip
aplpack_1.3.0.zip
car_2.1-0.zip
colorspace_1.2-6.zip
e1071_1.6-7.zip
effects_3.0-4.zip
foreign_0.8-66.zip
Hmisc_3.17-0.zip
knitr_1.11.zip
lattice_0.20-33.zip
leaps_2.9.zip
lmtest_0.9-34.zip
markdown_0.7.7.zip
MASS_7.3-44.zip
mgcv_1.8-7.zip
multcomp_1.4-1.zip
nlme_3.1-122.zip
quantreg_5.19.zip
pbkrtest_0.4-2.zip
lme4_1.1-10.zip
minqa_1.2.4.zip
nnet_7.3-11.zip
Rcpp_0.12.1.zip
nloptr_1.0.4.zip
SparseM_1.7.zip
MatrixModels_0.4-1.zip
sandwich_2.3-4.zip
Formula_1.2-1.zip
ggplot2_1.0.1.zip
digest_0.6.8.zip
gtable_0.1.2.zip
plyr_1.8.3.zip
proto_0.3-10.zip
reshape2_1.4.1.zip
stringr_1.0.0.zip
stringi_0.5-5.zip
magrittr_1.5.zip
scales_0.3.0.zip
munsell_0.4.2.zip
xtable_1.7-4.zip
rpart_4.1-10.zip
randomForest_4.6-12.zip
mlbench_2.1-1.zip
cluster_2.0.3.zip
survival_2.38-3.zip
praise_1.0.0.zip
crayon_1.3.1.zip
testthat_0.11.0.zip
gridExtra_2.0.0.zip
acepack_1.3-3.3.zip
latticeExtra_0.6-26.zip
RColorBrewer_1.1-2.zip
RODBC_1.3-12.zip
XLConnect_0.2-11.zip
zoo_1.7-12.zip

I suppose this is one of my less engaging posts…

3 Comments

Filed under learning, R

Showing a distribution over time: how many summary stats?

I saw this nice graph today on Twitter, by Thomas Forth:

but the more I looked at it, the more I felt it was hard to understand the changes over time across the income distribution from the Gini coefficient and the median. People started asking online for other percentiles, so I thought I would smooth each of them from the source data and plot them side by side:

uk_income

Now, this has the advantage of showing exactly where in society the growth or contraction is, but it loses the engaging element of the wandering nation across economic space (cf Booze Space; where do we end up? washed up on the banks of the Walbrook?), which should not be sneezed at. Something engaging matters in dataviz.

Code (as you know, I’m a nuts ‘n’ bolts guy, so don’t go recommending ggplot2 to me):


library(foreign)
library(splines)
bluecol<-"#467db4"
redcol<-"#b44f46"
uk<-read.csv("uk_income.csv")[1:53,1:22]
uk$Year <- as.numeric(substr(uk$Year,1,4))
sm<-apply(uk,2,function(z){smooth.spline(x=uk$Year,y=z)$y})
png("uk_income.png")
par(yaxs="i")
plot(uk$Year[1:3],sm[1:3,4],type="l",
ylim=c(min(sm[,4:22]-1),max(sm[,4:22]+60)),
xlim=c(1960,2015),
col=bluecol,
main="Percentiles of UK income over time",
sub="(Colour indicates governing political party)",
ylab="2013 GBP",
xlab="Year")
lines(uk$Year[4:10],sm[4:10,4],col=redcol) # Wilson I
lines(uk$Year[11:14],sm[11:14,4],col=bluecol) # Heath
lines(uk$Year[15:19],sm[15:19,4],col=redcol) # Wilson II, Callaghan
lines(uk$Year[20:37],sm[20:37,4],col=bluecol) # Thatcher, Major
lines(uk$Year[38:50],sm[38:50,4],col=redcol) # Blair, Brown
lines(uk$Year[51:53],sm[51:53,4],col=bluecol) # cameron
for(i in 5:22) {
lines(uk$Year[1:3],sm[1:3,i],col=bluecol) # Macmillan, Douglas-Home
lines(uk$Year[4:10],sm[4:10,i],col=redcol) # Wilson I
lines(uk$Year[11:14],sm[11:14,i],col=bluecol) # Heath
lines(uk$Year[15:19],sm[15:19,i],col=redcol) # Wilson II, Callaghan
lines(uk$Year[20:37],sm[20:37,i],col=bluecol) # Thatcher, Major
lines(uk$Year[38:50],sm[38:50,i],col=redcol) # Blair, Brown
lines(uk$Year[51:53],sm[51:53,i],col=bluecol) # Cam'ron
}
dev.off()

(uk_income.csv is just the trimmed down source data spreadsheet)

Leave a comment

Filed under R, Visualization

St Swithun’s Day simulator

I got a bit bored (sorry Mike), and wrote this. I didn’t take long (I tell you that not so much to cover my backside as to celebrate the majesty of R). First, I estimated probabilities of a day being rainy if the previous was dry, and of it being rainy if the previous day was rainy. I couldn’t be bothered with any thinking, so I used ABC, which basically is an incredibly simple and intuitive way of finding parameters than match data. I started with 16 rain days in July and 15 in August, in my home town of Winchester (which is also Swithun’s hood) from here, and that led to probabilities that I plugged into a simulator (it’s only AR1, but I’m no meteorologist). That ran 10,000 years of 40-day periods (there’s a sort of run-in of ten days to get to a stable distribution first; it’s basically a Markov chain), and not a single one had rain for 40 days.

It ain’t gon’ rain.

# Estmiate Winchester July/August rainy day transition probabilities
# We'll use Approximate Bayesian Computation
abciter<-1000
drytorain<-seq(from=0.15,to=0.35,by=0.01)
raintorain<-seq(from=0.5,to=0.7,by=0.01)
ldr<-length(drytorain)
lrr<-length(raintorain)
pb<-txtProgressBar(0,ldr*lrr*abciter,style=3)
loopcount<-1
prox<-matrix(NA,nrow=ldr,ncol=lrr)
for(i in 1:ldr) {
for(j in 1:lrr) {
trans<-c(drytorain[i],raintorain[j])
rainydays<-rep(NA,abciter)
for(k in 1:abciter) {
setTxtProgressBar(pb,loopcount)
runin<-rep(NA,10)
runin[1]<-rbinom(1,1,0.4)
for (m in 2:10) {
runin[m]<-rbinom(1,1,trans[runin[m-1]+1])
}
days<-rep(NA,40)
days[1]<-rbinom(1,1,trans[runin[10]+1])
for (m in 2:40) {
days[m]<-rbinom(1,1,trans[days[m-1]+1])
}
rainydays[k]<-sum(days)
loopcount<-loopcount+1
}
prox[i,j]<-sum(abs(rainydays-15.5)<1)/abciter
rainydays<-rep(NA,abciter)
}
}
close(pb)
image(prox)
# I'm going to go with P(rain | dry)=0.32, P(rain | rain)=0.51

# St Swithun's Day simulator
trans<-c(0.32,0.51)
iter<-10000
runs<-rep(NA,iter)

pb<-txtProgressBar(0,iter,style=3)
for (i in 1:iter) {
setTxtProgressBar(pb,i)
runin<-rep(NA,10)
runin[1]<-rbinom(1,1,0.4)
for (j in 2:10) {
runin[j]<-rbinom(1,1,trans[runin[j-1]+1])
}
days<-rep(NA,40)
days[1]<-rbinom(1,1,trans[runin[10]+1])
for (j in 2:40) {
days[j]<-rbinom(1,1,trans[days[j-1]+1])
}
runs[i]<-max(rle(days)$lengths)
}
close(pb)
print(paste("There were ",sum(runs==40),
" instances of St Swithun's Day coming true, over ",
iter," simulated years.",sep=""))
hist(runs)

3 Comments

Filed under R

Roman dataviz and inference in complex systems

I’m in Rome at the International Workshop on Computational Economics and Econometrics. I gave a seminar on Monday on the ever-popular subject of data visualization. Slides are here. In a few minutes, I’ll be speaking on Inference in Complex Systems, a topic of some interest from practical research experience my colleague Rick Hood and I have had in health and social care research.

Here’s a link to my handout for that: iwcee-handout

In essence, we draw on realist evaluation and mixed-methods research to emphasise understanding the complex system and how the intervention works inside it. Unsurprisingly for regular readers, I try to promote transparency around subjectivities, awareness of philosophy of science, and Bayesian methods.

4 Comments

Filed under Bayesian, healthcare, learning, R, research, Stata, Visualization