Despite winter rain, I was delighted to head uptown last week to Skills Matter on the old Goswell Road for the first ever London Julia meetup. The first thing I learnt was that Julia’s friends are called Julians.

If you don’t know it yet, Julia is a pretty new (v 0.3 is current) programming language for fast numerical computing. Everything is designed from the ground up for speed, by some very clever people. They claim speeds consistently close to compiled machine code, which is generally the upper limit, like the speed of light. But a few facts make it potentially Revolutionary Computing: you don’t have to compile it before running, you can mess about in a command-line interface to learn it, it’s free and open source, you can directly call C functions from inside normal Julia code – and vice versa, and the syntax is LISP-ish and light as eiderdown (there are some nice comparative examples of this on the homepage).

The focus was on getting started, and the room was packed. Personally, I spent some time playing with it last year and then let it lapse, but now with v0.3 out there it seems to be time to get back up to speed.

For stats people, there are a few important packages to install: Distributions, Stats, DataFrames, HypothesisTests, and possibly Optim, MCMC, depending on your own interests. That’s all pretty straightforward, but when you start up Julia or load one of the packages like this:

using(HypothesisTests)

it takes a noticeable while to get ready. This is an artefact of the just-in-time compiler and open source programming. Almost all of the packages and the standard library are written in Julia itself. When you first need it, it gets compiled, and after that it should be superfast. Apparently a package is on the way to supply a pre-compiled standard library, to increase startup speeds.

Here’s a little power simulation I tried out afterwards:

using(HypothesisTests) starttime=time() nsig=0; for (i in 1:100000) xx=140+(15*randn(10)); yy=135+(15*randn(10)); sig= pvalue(EqualVarianceTTest(xx,yy))<0.05 ? 1 : 0; nsig = nsig+sig; end time()-starttime

This does 100,000 simulations of independent-samples t-tests with sample size 10 per group, means 140 and 135, and SD 15, and took 5.05 seconds on a teeny weeny Samsung N110 ‘netbook’ with 1.6GHz Atom CPU and 1GB RAM (not what you would normally use!) once the package was loaded.

In R, you could do this at least two ways. First a supposedly inefficient looped form:

Sys.time() nsig<-0 for (i in 1:100000) { xx<-rnorm(10,mean=140,sd=15) yy<-rnorm(10,mean=135,sd=15) if(t.test(xx,yy)$p.value<0.05) { nsig<-nsig+1 } } Sys.time() print(nsig)

Next, a supposedly more efficient vectorized form:

tp<-function(x) { return(t.test(x[,1],x[,2])$p.value) } Sys.time() nsig<-0 xx<-array(c(rnorm(1000000,mean=140,sd=15), rnorm(1000000,mean=135,sd=15)), dim=c(100000,10,2)) pp<-apply(xx,1,tp) ppsig<-(pp<0.05) table(ppsig) #nsig<-sum(apply(xx,1,tp)<0.05) Sys.time() print(nsig)

In fact, the first version was slightly quicker at 2 minutes 3 seconds, compared to 2 minutes 35. While we’re about it, let’s run it in Stata too:

</pre> clear all timer on 1 set obs 10 local p = 0 gen x=. gen y=. forvalues i=1/1000 { qui replace x=rnormal(140,15) qui replace y=rnormal(135,15) qui ttest x==y, unpaired if r(p)<0.05 local p = `p'+1 } dis `p' timer off 1 timer list <pre>

That took 30 seconds so we’re looking at 50 minutes to do the whole 100,000 simulations, but Stata black belts would complain that the standard language is not the best tool for this sort of heavy duty number-crunching. I asked top clinical trial statistician Dan Bratton for some equivalent code in the highly optimised Mata language:

timer clear 1 timer on 1 mata: reps = 100000 n = (10 \ 10) m = (140 , 135) s = (15 , 15) pass = 0 for (i=1;i<=reps;i++) { X = rnormal(10,1,m,s) mhat = mean(X) v = variance(X) df = n[1]+n[2]-2 t = (mhat[1]-mhat[2])/sqrt((1/n[1]+1/n[2])*((n[1]-1)*v[1,1]+(n[2]-1)*v[2,2])/df) p = 2*ttail(df,t) if (p<0.05) pass = pass+1 } pass/reps end timer off 1 timer list 1

… which clocked in at 7 seconds. I’m not going to try anything more esoteric because I’m interested in the speed for those very pragmatic simulations such as sample size calculations, which the jobbing statistician must do quite often. (Actually, there is an adequate approximation formula for t-tests that means you would never do this simulation.)

That time difference surprised me, to say the least. It means that Julia is an option to take very seriously indeed for heavy-duty statistical calculations. It really isn’t hard to learn. However, I don’t know of any scientific papers published yet that used Julia instead of any more established software. Perhaps the version 0.x would worry editors and reviewers, but surely v1.0 is not far away now.

I’m pretty sure you cannot call C++ from Julia (or not easily). Just C.

Ok – I’ve not tried it.

Hi!

Nice example, but you forgot to mention that

xx=140+(15*randn(10));

is plain ugly compared to

xx<-rnorm(n=10,mean=140,sd=15)

(the latter being self explaining) and that a for-loop and the unnecessary introduction of an unused variable i is far less elegant than beautiful commands like R's "replicate". Also you forgot to mention R's compiler package.

Obviously Julia will be faster for large computations. But then, the most important thing in large computations is to have no errors in your code and you actually compute the right thing. Being able to call the mean "mean" and the standard deviation "sd" in the random number generation call and to call replicates "replicate" will make it easier for scientists to overview their own code. Programming enthusiasts in R will easily outsource their most critical code to C++ (using RCpp) and thereby be no slower than Julia.

So yes: Speed is important, but it is not that important, that you should just omit everything else in any comparison.

Cheers,

Bernhard

I owe you my 5 line version in R, which I consider easy to read to everyone who knows English and the concept of a function. Well maybe you need to explain the rnorm() function. Because it is easy to read, it will be easy to spot errors.

is.significant <- function(){

xs<-rnorm(10,mean=140,sd=15)

ys<-rnorm(10,mean=135,sd=15)

return(t.test(xs,ys)$p.value<0.05)

}

sum(replicate(100000, is.significant()))

By the way, if you call t.test() like this, you get the Welch correction. If you want the variant for equal variances you need to tell the t.test() function. Now compare that to the mata-version above, which programs the t test from scratch. See what I mean?

Greetings,

Bernhard

You can write code a lot like that in Julia. Sometimes you incur a performance hit, sometimes not. Here’s 4 lines (excluding imports and comments).

using Distributions

using HypothesisTests

# Generate random samples on demand

randxs = () -> rand(Normal(140, 15), 10)

randys = ()-> rand(Normal(135, 15), 10)

# Test if difference in sample means is significant.

is_sigdiff = (xs, ys) -> pvalue(EqualVarianceTTest(xs, ys)) < 0.05

# Simulate t-tests

@time n_significant = sum([is_sigdiff(randxs(), randys()) for i = 1:10_000])

I can't confirm if this would be slower on Robert's test hardware, but I doubt it would be by much.

Sorry to make two comments, but here are the timings of the various versions I get on my machine. (2.5GHz i7 Mac)

Robert’s version gives me an elapsed time of 0.32402992248535 seconds.

My version gives me an elapsed time of 0.384465952 seconds.

Bernhard’s version in R gives me an elapsed time of 18.572 seconds.

So there’s almost no cost to making the Julia code somewhat more concise and readable. The differences between my version and Bernhards are (1) keywords for the normal distribution parameters, and (2) replicate() vs. an array comprehension.

I don’t see (1) as a big deal. I think it’s very clear what you’re getting with Normal(135, 15) (though maybe there’s potential for confusion over whether the second argument is sd or var). But note that you get rand(Normal(m, sd), n) as a call to generate random variables, which I’d say is clearer than rnorm(n, m, sd).

And as to (2) I don’t see the array comprehension as any less readable than the replicate function, but it would be trivial to define a replicate function in Julia that wrapped this.

But overall, I think it’s not quite right to say that Julia’s performance comes at the expense of readability or expressiveness. The Julia code to me is quite clear for a 50x speedup.

Thanks Bernhard, you have made some extremely important points there. I like to quote Uwe Ligges in an old mailing list conversation: “RAM is cheap, and thinking hurts.” I can’t remember who first introduced me to that, maybe it’s in The R Inferno.

FWIW apply isn’t a vectorised function, it’s a just a fairly thin wrapper around a for loop, and you don’t usually expect it to have any particularly large impact on performance.

You might want to read “Computing thousands of test statistics simultaneously with R” in http://stat-computing.org/newsletter/issues/scgn-18-1.pdf. It works through efficient ways to solve this problem (and related problems) in R.

I had a quick look at this. Do you know about Rprof()? It’s a profiler that comes with R so you can see where it spends its time. Call Rprof() once before the test. Then Rprof(NULL) afterwards to stop profiling, followed by summaryRprof() to see where it spent its time. I noticed recently it can can even profile line by line, but I haven’t tried that yet.

In this case it reveals that over 90% of the time was spent in t.test() being called repetitively 100,000 times. The particular vectorization attempts you tried didn’t avoid that.

For example, 100,000 calls to rnorm(10) took 2 seconds on my even smaller netbook (800Mhz). One rnorm(1e6) call was quicker at 0.5sec. Those times should compare ok to Julia. So it’s not R as a language per se, it’s that some functions in R (like t.test()) are heavy-weight functions meant to be called once, whose results are quite heavy including information you may not need. Others are lower level and single purpose, more like the Julia approach. In contrast, the Mata example seems to code the calculation directly as t = (mhat[1]-mhat[2])/sqrt((1/n[1]+1/n[2])*((n[1]-1)*v[1,1]+(n[2]-1)*v[2,2])/df), and it could be done that way in R too.

Next step after discovering the culprit was t.test() … I went to Stack Overflow and searched for “[r] faster t.test”, which came up with this :

http://stackoverflow.com/questions/11460680/whats-the-fastest-way-to-apply-t-test-to-each-column-of-a-large-matrix

The accepted answer starts by suggesting multicore but then goes on to suggest a direct method avoiding t.test(). The other answer points to a published article “computing thousands of test statistics simultaneously in R” which may be of interest.

Anyway, that’s the modus operandi for iterating “heavy” stats functions in R. First, find a lighter direct function. If search doesn’t help, ask on Stack Overflow. If they say no direct method exists, then make it yourself, usually by looking at the heavy function and calling the core function directly or extracting out the core parts.

But the first step in any language when you’ve got a speed issue is the mantra profile, profile, profile.

A minor quibble: v0.3 is under development, v0.2 is released and current/stable.

Those papers you wish to see are most certainly on their way. It is my observation that academic folks, especially graduate students, are some of the more engaged and avid Julia users. I personally know some who are intending to publish in the near future. “Soon.”

Your better of using the Distributions package rather than generation your own random variables in Julia. It’s faster and looks nicer. You also don’t need all those parenthesis and semicolons. You only need to use semicolons in Julia if you want multiple statements on the same line.

Here is a slightly cleaned up version:

using HypothesisTests, Distributions

function testperf(n::Integer)

nsig = 0

for i in 1:n

xx = rand(Normal(140, 15), 10)

yy = rand(Normal(135, 15), 10)

nsig += pvalue(EqualVarianceTTest(xx,yy))<0.05 ? 1 : 0

end

end

@time testperf(100_000)

I'm also warpping the computation in a function and using the @time macro. I think it looks cleaner.

I’d be curious to see some benchmark tests between Julia and R with file I/O and string manipulation. As a corpus linguist, I spend nearly all of my time in R reading in (hundreds of) text files with “for” loops and doing something with them:

(1) Searching with regular expressions for specific strings and preparing the lines with matches to be imported in a spreadsheet, with the matching word in their own column and the preceding and following words on their own columns on either side.

(2) Making frequency lists of words, including dispersion measurements of how many different files each word occurs in.

(3) Calculating the frequencies of all attested preceding and following words, given a specific node word.

(4) Going to websites, often getting past a login screen with Curl (or rather RCurl in R) and pulling the HTML code into R and searching for specific strings with XPath syntax (in the R “XML” package) .

(5) And many other corpus linguistics tasks.

Does Julia’s lightning speed with numbers and stats also apply to this type of work?

I haven’t seen any direct comparisons to R, but a comparison put it between Python and Java, with Java at 150ms, Julia at 185ms and Python at 200ms.