Slice bivariate densities, or the Joy Division “waterfall plot”

This has been on my to-do list for a long old time. Lining up slices through a bivariate density seems a much more intuitive way of depicting it than contour plots or some ghastly rotating 3-D thing (urgh). Of course, there is the danger of features being hidden, but you know I’m a semi-transparency nut, so it’s no surprise I think that’s the answer to this too.

slicedens

Here’s an R function for you:

# x, y: data
# slices: number of horizontal slices through the data
# lboost: coefficient to increase the height of the lines
# gboost: coefficient to increase the height of the graph (ylim)
# xinc: horizontal offset for each succesive slice 
# (typically something like 1/80)
# yinc: vertical offset for each succesive slice
# bcol: background color
# fcol: fill color for each slice (polygon)
# lcol: line color for each slice
# lwidth: line width
# extend: Boolean to extend lines to edge of plot area
# densopt: list of strings containing density() arguments that
# are passed verbatim.
# NB if you want to cycle slice colors through vectors, you
# need to change the function code; it sounds like a
# pretty bad idea to me, but each to their own.
slicedens<-function(x,y,slices=50,lboost=1,gboost=1,xinc=0,yinc=0.01,
 bcol="black",fcol="black",lcol="white",lwidth=1,extend=FALSE,
 densopt=NULL) {
 ycut<-min(y)+((0:(slices))*(max(y)-min(y))/slices)
 height<-gboost*((slices*yinc)+max(density(x)$y))
 plot( c(min(x),max(x)+((max(x)-min(x))/4)),
 c(0,height),
 xaxt="n",yaxt="n",ylab="",xlab="")
 rect(par("usr")[1],par("usr")[3],par("usr")[2],par("usr")[4],col=bcol)
 for(i in slices:1) {
 miny<-ycut[i]
 maxy<-ycut[i+1]
 gx<-(i-1)*(max(x)-min(x))*xinc
 gy<-(i-1)*(height)*yinc
 dd<-do.call(density,append(list(x=x[y>=miny & y<maxy]),
 densopt))
 polygon(dd$x+gx,lboost*dd$y+gy,col=fcol)
 lines(dd$x+gx,lboost*dd$y+gy,col=lcol,lwd=lwidth)
 if(extend) {
 lines(c(par("usr")[1],dd$x[1]+gx),
 rep(lboost*dd$y[1]+gy,2),col=lcol)
 lines(c(dd$x[length(dd$x)]+gx,par("usr")[2]),
 rep(lboost*dd$y[length(dd$y)]+gy,2),col=lcol)
 }
 }
}
# Example 1:
y<-runif(5000,min=-1,max=1)
x<-runif(5000,min=-1,max=1)+rnorm(5000,mean=1/(y+1.1),sd=0.8-(y*0.5))
slicedens(x,y,lboost=0.2,fcol=rgb(0,0,0,200,maxColorValue=255))
# Example 2:
library(iris)
slicedens(x=iris$Sepal.Width,y=iris$Sepal.Length,slices=12,
 lboost=0.02,fcol=rgb(0,0,0,200,maxColorValue=255),
 extend=TRUE,densopt=list(kernel="cosine",adjust=0.5))

Some places call this a waterfall plot. Anyway, the white-on-black color scheme is clearly inspired by the Joy Division album cover. Enjoy.

Edit 9 October 2014: added the “extend” and “densopt” arguments.

3 Comments

Filed under R, Visualization

Transparent hurricane paths in R

Arthur Charpentier has written a really nice blog post about obtaining hurricane tracks and plotting them. He then goes on to do other clever Markov process models, but as a dataviz guy who knows almost nothing about meteorology, I want to suggest just a small improvement to the maps. Almost every graphic with a lot of data benefits from transparency. It’s easy to add in R too, you can just specify a color with a function like rgb(). Here’s the opaque version:

hurricanes

and here it is with color=”red” replaced by color=rgb(255,0,0,18,maxColorValue=255).

transparent-hurricanes

The only other thing I did to Arthur’s code was to wrap the crucial TOTTRACK line in a try() function because some of the files came back with error 404, and you don’t want to stop just for a few ol’ hurricanes:

try(TOTTRACK<-rbind(TOTTRACK,extract.track(y)),silent=FALSE)

Longer-term, it would be worth trying to replace the tracks with splines to avoid jumpiness, and it does look like there’s some dirt in the data: what are those tracks going straight from the Azores to Iceland?

1 Comment

Filed under R

Save your simulation study seeds

Here in the Northern hemisphere, gardeners are gathering seeds from their prize-winning vegetables are storing them away for next year’s crop. Today at the 20th London Stata Users’ Group meeting, I learnt a similar trick. It’s strange I never thought of it before; regular readers will know how keen I am on simulation studies and techniques. Sometimes you want to go back and investigate one particular simulated dataset, either because the results were particularly interesting or because your program didn’t behave the way you hoped, and you think there will be a clue on how to debug it in that one weird dataset.

Bill Gould’s sage advice was to gather the seed for the random number generator (RNG) at the beginning of each iteration and store it as a string. The seed initialises the RNG (which is never really random*, just unpredictable) and if you set it, you can reproduce the results. A basic example in Stata:

clear all

local iter=10
set obs 100
gen str64 myseed=""
gen slopes=.
gen x=rnormal(0,1)
gen mu=1+(2*x)
gen y=rnormal(mu,1)

forvalues i=1/`iter' {
 local myseed=c(seed)
 qui replace myseed="`myseed'" in `i'
 qui replace x=rnormal(0,1)
 qui replace mu=1+(2*x)
 qui replace y=rnormal(mu,1)
 qui regress y x
 matrix A=e(b)
 local beta=A[1,1]
 qui replace slopes=`beta' in `i'
}

local revisit=myseed[2]
set seed `revisit'
qui replace x=rnormal(0,1)
qui replace mu=1+(2*x)
qui replace y=rnormal(mu,1)
regress y x
dis "`revisit'" 

In this example, we run a linear regression 10 times and save the slope parameters and the seeds. We can then pick out simulation number 2 and recreate it without retracing any other steps. Nice! Here it is in R (but R RNGs are much more complex and I have to say that the help documentation is far from helpful):

iter<-10
size<-100
slopes<-rep(NA,iter)
seeds<-list(NULL)
for(i in 1:iter) {
 seeds[[i]]<-.Random.seed
 x<-rnorm(size,0,1)
 y<-rnorm(size,1+(2*x),1)
 slopes[i]<-lm(y~x)$coefficients[2]
}
.Random.seed<-seeds[[2]]
x<-rnorm(size,0,1)
y<-rnorm(size,1+(2*x),1)
lm(y~x)

* – I’m glossing over the philosophical meaning of ‘random’ here

11 Comments

Filed under R, Stata

Should every nonparametric test be accompanied by a bootstrap confidence interval?

Well, duh. Obviously. Because (a) every test should have a CI and (b) bootstrap CIs are just awesome. You can get a CI around almost any statistic, they account for non-normality and boundaries.

But you might have to be a little careful in the interpretation, because they might not be measuring the same thing as the test.

Take a classic Wilcoxon rank-sum / [Wilcoxon-]Mann-Whitney independent-samples test (don’t you just love those consistent and memorable names?). This ranks all the data and compares them across two groups. Every bit of the distribution is contributing, and there isn’t an intuitive statistic; what you’re testing is the W statistic. Do you know what a W of 65000 looks like? No, neither do I. If there’s a difference somewhere in terms of location, it might come up.

It’s so much simpler for the jolly old t-test. You take means and compare them. You get CIs around those means with a simple formula. And everybody knows what a mean is, even if they don’t really want to grapple with a t-statistic and Satterthwaite’s degrees of freedom.

So, in the Mann-Whitney case, the most sensible measure might be the difference between the medians. There is no formula for a CI for this, though undoubtedly we could get a pretty bad approximation by the usual techniques. So, we reach for the bootstrap. In fact, perhaps we should just be using it all the time…?

So the problem here is that you could have a significant Mann-Whitney but a median difference Ci that crosses zero. Interpreting that is not so easy, and I found one of my students in just that pickle recently. It was my fault really; I’d suggested the bootstrap CI. How could we deal with this situation? Running the risk of cliché, it’s not a problem but an opportunity. Because the test and the CI look at the data in slightly different ways, you’re actually getting more insight into the distribution, not less. Consider this situation:

Spend hours making it just so in R? No. Use a pencil.

Spend hours making it just so in R? No. Use a pencil.

Here, the groups have the same median but should get a significant Mann-Whitney result if the sample size is not tiny. You can surely imagine the opposite too, with a bimodal distribution where the median flips from one clump to another through only a tiny movement in the distribution as a whole.

So, in conclusion:

  • my enthusiasm for bootstrapping is undimmed
  • there is still no substitute for drawing lots of graphs to explore your data (and for this, pencils are probably best avoided)

2 Comments

Filed under learning

Measuring nothing and saying the opposite: Stats in the service of the Scottish independence debate

I was half-heartedly considering writing about the ways that GDP can be twisted to back up any argument, when what should come along but this unedifying spectacle.

The Unionists (No campaign) have produced a league table of GDP, showing how far down Scotland would be. So, the argument goes, you should vote for them. This is, however, irrelevant to whether Scots would be better off. The GDP would drop because it would be a small country. GDP is the total sum of economic activities. If you have fewer people, you do less. That doesn’t make you poor, as Monaco or the Vatican City will tell you. If Manhattan declared independence from the rest of the USA, its GDP would go down. If Scotland not only stayed in the UK but convinced North Korea to join too, the UK’s GDP would go up. On this logic, every time a border is removed, people immediately get rich.

Meanwhile, the Separatists (Yes campaign) have produced a league table of GDP per capita, showing how far up Scotland would be. So, the argument goes, you should vote for them. This is, however, irrelevant to whether Scots would be better off, despite being a far less bad guess than GDP in toto. It tells the individual voter (let’s call him Mr Grant) practically nothing about whether the Grants would be better off, because that depends on a million other factors. Small countries with some very wealthy people, relying on foreign investment, will have inflated GDP per capita. It’s just the same thing kids learn in primary school now: when there are outliers, the mean is not so useful.

Anyway, the whole measure is used to form arguments about wellbeing, which is just nonsense. Otherwise all our ex-colonies would be kicking themselves at trading the jackpot of Saxe-Coburg-Gotha rule for silly stuff like identity and self-determination. (Except the USA, Canada, Australia and Ireland, who have higher GDP per capita than us and so are presumably happier. We should try to join them instead, as that will make us happy – though they will feel sad when we arrive.) No wonder there are still plenty of people who, having asked what I do for a living, gleefully say “lies, damned lies…”

PS: This Langtonian doesn’t get a vote because I live in London – that “giant suction machine” – and here’s a great post at Forbes about the UK joining the USA and becoming the lowest of the low. 

1 Comment

Filed under Uncategorized

Beeps and progress alerts to your phone

[Note: this post first went up in April 2014, but today I noticed it was missing. No idea why! Maybe it got taken down because I had an embedded video with copyrighted music, I don't know. Anyway, I copied it back from r-bloggers.com]

[Another note: pingr is now replaced with beepr]

Recently I encountered an R package called pingr, made by Rasmus Bååth (the same guy who did MCMC in a web page, my visualization of 2013). You install it, you type ping(), and it goes ping. Nice.

In fact there are nine built-in pingr noises. It’s more useful than it may seem; I was using it within minutes of reading the blog post because I had a series of Bayesian models running on my laptop while I wrote some stuff on my desktop PC. When the models finished, they went ping, making everything as efficient as possible. It got me thinking about beeping alerts in all sorts of data analysis software.

In Stata, you can just type ‘beep’. Job done. In fact, that locates the system general alert sound (in Windows at least) and plays it. I spent some time extracting data from a primary care database recently, where there were several computers grinding through the big data for different researchers in a windowless room. Every now and then, a lion’s roar would emanate from one of them. I found it a bit disconcerting but played it cool until someone told me they had replaced the Windows alert beep with this .wav file for a laugh.

SPSS used to have sound alerts in the General Options menu, but they have quietly (?) been dropped sometime around version 20. The pain about that was it was either on, beeping every time some output was added, or off. There didn’t seem to be a syntax command for beeping. However, there is now one (STATS SOUND) in the extension commands package; it’s not clear whether one has to pay extra for that or not, and frankly, I’m not going to bother finding out.

When I’m able to glance at the computer regularly, perhaps because I’m eating what passes for lunch in Stats HQ, I particularly like R’s txtProgressBar with style=3. Stata users can easily display dots in a similar fashion, although it’s interesting to look online and see the alternative solutions, such as displaying progress in the window title, which could have advantages in some situations.

My latest long-running simulation made me try something quite different. I wanted progress reports but I was going to be in another room. If something went wrong, I would go back to the office and try to fix it. On my (Android) cellphone I have an app called Minutes. It’s a basic text editor that syncs very easily to Dropbox. So all I needed to do was have the stats software write periodically to a text file in the Minutes folder, and the update appears on my phone!

Screenshot_2014-03-25-13-51-56

How 21st-century is that! This is how I’ve done it in R:

1
2
3
4
5
6
7
8
9
10
11
progress<-0
fileConn<-file("C:/Users/Robert/Documents/Dropbox/Apps/Minutes/progress.txt")
for (k in 1:1000) {
if(floor(k/100)>progress) {
writeLines(paste("Now on iteration ",k,sep=""),fileConn)
progress<-floor(k/100)
}
# more complicated stuff follows, and then ...
}
close(fileConn)
ping() 

and in Stata:

1
2
3
4
5
6
7
8
9
10
11
12
local progress=0
forvalues i=1/1000 {
 if (floor(`i'/100)>`progress') {
 file open minutes using "progress.txt", write text replace
 file seek minutes tof
 file write minutes "Now on iteration `i'"
 file close minutes
 local progress=floor(`i'/100)
 }
 // some complicated time-consuming stuff...
}
beep

Notice how the file is written to the drive each time you writeLines in R, even without closing the fileConn, but in Stata you have to close inside the if branch. Also, R will carry trying to run commands after an error, so it’ll (probably) go ping, while Stata will stop and therefore you will hear no beep.

It will get a little more complicated to catch errors, but not much. If your program grinds to an unpleasant halt, your progress.txt file will just be stuck there on the last number, and it could be a while before you get suspicious and go to check. One simple solution is to write all your output to the progress.txt file, but this will slow things down if you can’t avoid (or don’t want to avoid) writing lots of lines to the output; this was the case for my simulation with rstan. You only want one special line written in case of an error that says

1
I'm afraid I can't do that, Dave.

You could send an SMS too, or tweet, if you prefer…

Leave a comment

Filed under Uncategorized

Two handy documents for making good UK maps

Everybody loves a good map. Even if you don’t have any reason to make one, your boss will love it when you do, so check this out and get yourself a pay rise (possibly).

First, this set of diagrams via ONS Geographies on Twitter, showing how the different terminologies of UK administrative geography overlap and nest. It looks horrible, like it was made by NSA staffers after their PowerPoint refresher day, but it does the trick, and I haven’t seen this pulled together in one simple way like this before.

Second, this nice presentation from the most recent LondonR meeting, by Simon Hailstone. He shows the value of proper mapping tools inside R with some real-life ambulance service data. Croydon, Romford, Kingston, West End, OK. Heathrow Airport is a bit of a surprise.

Leave a comment

Filed under R, Visualization