Giving R the strengths of Stata

Stata-R-Hmmm

This is not a partisan post that extols the virtues of one software package over another. I love Stata and R and use them both all the time. They each have strengths and weaknesses and if I could only take one to the desert island, I’d find it hard to choose. For me, the greatest unique selling point in Stata is the flexibility of the macros. If I write something like this:

local now=1988

summarize gdp if year==`now'

I will get a summary of GDP in 1988, the same as if I had typed

summarize gdp if year==1988

And I could do the same thing in R (assuming I have installed the package pastecs):

now<-1988

stat.desc(gdp[year==now])

All very nice. But they are doing this in two different ways. R holds all sorts of objects in memory: data frames, vectors, scalars, etc, and accesses any of their contents when you name them. Stata can only have one data file open at a time and stores other stuff in temporary memory as matrices, scalars or these macros, which are set up with the commands local or global. When you give it a name like now, it will by default look for a variable in the data file with that name. So, you  place the macro name between the backward quote* and the apostrophe in order to alert it to fetch the contents of the macro, stick them into the command and then interpret the whole command together. That is a very flexible way of working because you can do stuff that most programming languages forbid, like shoving your macro straight into variable names:

summarize gdp`now'

// the same as summarize gdp1988

or into text:

display as result "Summary of world GDP in the year `now':"

or indeed into other macros’ names in a nested way:

local now=1988
local chunk "ow"
summarize gdp if year==`n`chunk''

or even into commands!

local dothis "summa"
 `dothis'rize gdp if year==`now':"

I believe that is also how Python works, which no doubt helps account for its popularity in heavy number crunching (so I hear – I’ve never gone near it).

Now, the difference between these approaches is not immediately obvious, but because R does not differentiate in naming different classes of object, like scalars, matrices or estimation outputs, you can do whatever you like with them (helpful), except just jamming their names into the middle of commands and expecting R to replace the name with the contents. That is the strength of Stata’s two-stage interpretation. How can we give that strength to R?

A popular question among new useRs is “how do I manipulate the left-hand side of an assignment?”

Here’s the typical scenario: you have a series of analyses and want the results to be saved with names like result1, result2 and so on. Nine times out of ten, R will easily produce what you want as a list or array, but sometimes this collection of distinct objects really is what you need. The problem is, you can’t do things like:

mydata <- matrix(1:12,nrow=3)

paste("columntotal", 1:4, sep="") <- apply(mydata, 2, sum)

And hope it will produce the same as:

columntotal1 <- sum(mydata[, 1])
columntotal2 <- sum(mydata[, 2])
columntotal3 <- sum(mydata[, 3])
columntotal4 <- sum(mydata[, 4])

Instead you need assign()! It’s one of a series of handy R functions that can be crudely described as doing something basic in a flexible way, something which you would normally do with a simple operator such as <- but with more options.

for (i in 1:4) {

assign(paste("columntotal", i, sep=""), sum(mydata[,i]))

}

will do exactly what you wanted above.

If you need to fetch a whole bunch of objects by name, mget() is a function that takes a vector of strings and searches for objects with those names. The contents of the objects get returned in a single list. Now you can easily work on all the disparate objects by lapply() and the like. Now, before you mget too carried away with all this fun, take time to read this excellent post, which details the way that R goes looking for objects. It could save you a lot of headaches.

All right, now we know how to mess around with object names. What about functions? do.call() is your friend here. The first argument do.call wants is a string which is the name of a function. The second argument is a list containing your function’s arguments, and it passes them along. You could do crazy stuff like this:

omgitsafunction <- paste("s","um",sep="")

do.call(omgitsafunction,list(mydata))

and it would be the same as:

sum(mydata)

…which raises the possibility of easily making bespoke tables of statistics by just feeding a vector of function names into do.call:

loadsafunctions <- c("sum","mean","sd")

for (i in 1:length(loadsafunctions)) {

print(do.call(loadsafunctions[i],list(mydata)))

}

or more succinctly:

listafunctions <- as.list(loadsafunctions)

lapply(listafunctions,FUN=do.call,list(mydata))

Another neat feature of Stata is that you can prefix any line of code with capture: and it will absorb error messages and let you proceed. In R you can do this with try(). This is never going to work:

geewhiz<-list(3,10,8,"abc",2,"xyz")

lapply(geewhiz,log)

But maybe you want it to run, skip the strings and give you the numeric results (of course, you could do this by indexing with is.numeric(), but I just want to illustrate a general point, and try() is even more flexible):

lapply(geewhiz,function(x) try(log(x),TRUE))

will work just fine. Note the one line function declaration inside lapply(), which is there because lapply wants a function, not the object returned by try(log()).

attach() and (more likely) with() are useful functions if you need to work repetitively on a batch of different data frames. After all, what’s the first thing newbies notice is cool about R? You can open more than one data file at a time. So why not capitalise on that? That takes you into territory that Stata can only reach by repeatedly reading and writing from the hard drive (which will slow you down).

subset() is another good one. Really it just does the same as the indexing operator [, but because it’s a function, you can link it up with mget() and/or do.call() and get it to work it’s way through all sorts of subsets of different objects under different conditions, running different analyses on them. Nice!

The final function I want to highlight is substitute(). This allows you to play around with text which needs to be evaluated by another function as if it was typed in by the user, and yet still have it work

mydata<-c(1,2,3)
xx<-"my"
substitute(paste(xx,"data",sep=""),list(xx=xx))
eval(substitute(paste(xx,"data",sep=""),list(xx=xx)))
mget(eval(substitute(paste(xx,"data",sep=""),list(xx=xx))))

Pretty cool huh? I hope this opens up some new ideas for the more advanced Stata user who wants to use R functionality. On the other hand, if you use R all the time, perhaps this will encourage you to take Stata seriously too.

* – thanks to my Portuguese students who taught me this is the acento contrario

About these ads

8 Comments

Filed under R, Stata

8 responses to “Giving R the strengths of Stata

  1. BarryR

    No No No No No! Don’t use assign like that! 99.99 times out of a hundred its a massive red flag telling you are doing it wrong! In your first example, creating columntotal1 to columntotal4 for example, what you really should do is create a vector of four values, columntotals, say, and then you can get the individual elements with columntotals[1] to columntotals[4]. Yes, you do have to type some extra square brackets, but the big win is that to get column total ‘i’ you can do columntotal[i], and not something like get(paste(“columntotal”,i)). This kind of mucking about with names and trying to emulate ‘macro’ processing features of other systems is not a good idea, leads to code that is hard to distribute and debug. Really, please, its not a good idea!

    • Absolutely – terrible R programming. Very poor. If only everyone was a fluent R programmer, we wouldn’t be offended by this sort of practice. However, they’re not, and we are faced with two options: refuse to deal with anyone who is not skilled at the highest level, or try to bridge the gap and bring new opportunities to people who could do valuable things with data. I guess we’ve both made our choices.

  2. Thanks Robert for your post. I occasionally use Stata and it’s always helpful to connect concepts across languages (even when not up to some people’s standards, pff).

  3. I don’t mean to be snobbish, but I honestly feel beginners should be taught vectorisation and application from the beginning! I don’t think it’s harder, just slightly different. Old dogs like me, poisoned by imperative programming, might be beyond saving but I think the new folks deserve a chance ;)

  4. Jeremy

    Why would you want the totals to be separate objects anyway? If the aim is just to get the sum, mean, and sd then you can do it in one line of code :

    res <- apply(mydata,2,function(x) c(sum(x),mean(x),sd(x)))

    as for the part where you want it per year:

    # make some data with a year column
    myrealdata <- data.frame("Year" = c(rep(1997,5),rep(1998,6),rep(1999,3)),
    "Value1" = c(runif(14,0,100)),
    "Value2" = c(runif(14,0,1)),
    "Value3" = c(runif(14,0,1000)))
    # get the sum, mean and sd
    res1998 <- apply(myrealdata[myrealdata$Year == "1998",2:4],2,function(x) c(sum(x),mean(x),sd(x)))
    row.names(res1998) <- c("sum","mean","sd")

    If you want the sum mean and sd for all years then I would use the data.table package, but it uses its own version of syntax which makes it less straight forward:

    library(data.table)
    myrealdata.dt <- data.table(myrealdata)
    res_all <- data.frame("Stat" = rep(c("sum","mean","sd"),3),
    "Year" = rep(c(1997,1998,1999),3),
    "Value1" = myrealdata.dt[,c(sum(Value1),mean(Value1),sd(Value1)),by=Year][,V1],
    # the second square bracket there tells it which columns to output since it gives a result with the first column being the year and second column being the sum, mean and sd's
    "Value2" = myrealdata.dt[,c(sum(Value2),mean(Value2),sd(Value2)),by=Year][,V1],
    "Value3" = myrealdata.dt[,c(sum(Value3),mean(Value3),sd(Value3)),by=Year][,V1])

    • Thanks Jeremy, good suggestions there. I like the inline triple function. I’ve been in a situation where I had to make multiple objects and it would have been clearer than writing to an array or list and then partitioning that. I can’t remember the circumstances now, but it was possibly something passing objects to other software, maybe JAGS or Stan. Anyway there’s a lot of people out there who know some other data analysis syntax like Stata but want to dip into R, and it will probably put them off to learn data.table and the like on day 1. But of course, if they like R, they should ascend out of the R Inferno as quick as they can.

  5. I’ve used R for 8+ years, and only recently got Stata, and even tend to use it more often on my personal projects than R now. But the whole macro thing is a 1970′s concept whose time is past. (Python is a modern programming language, not a scripting/macro language. You may be thinking of older versions of Perl, that did originally use macro-like substitution because it was originally an improvement on UNIX command shell scripts.)

    When doing straightforward things — when the analysis is on rails — Stata’s a dream. As I said, I actually prefer it. But once you want to go beyond that, things get rapidly complicated. First, you learn about macros and Stata’s programming language (do files), then about the world of hidden variables that are not in your dataset (various return variables, matrices, etc). Then Stata’s newer programming language (mata). Then…

    The whole idea of putting four similar results in four different variables (“result1″, “result2″, “result3″, “result4″) is used in Stata (and other older statistics packages) because you have to cram most everything into the equivalent of a single R data frame. It’s not a benefit, it’s a workaround for an inherent restriction in the tool. (If you want to see this issue multiplied, try using IRF’s: Stata resorts to storing your results in files in order to handle the collection of data.)

    I don’t want to get lost in the weeds, boosting R and criticizing Stata. In fact, I’ve written a couple of blog postings on “Stata for R users”. I like and use Stata on a weekly basis. Just saying that the whole macro thing is useful in Stata because of the way it’s organized (and because it’s not a programming language), but it’s an awkward and error-prone way of doing business in a programming language, like R.

  6. Nice post! I think what Robert is illustrating here is very unique and important in R. Personally, I knew the caveat of using assgin() about one year after learning R. So I always remember and avoid this practice even though I didn’t really know why. Until one day I encounter with very similar a setting (the columntotal example) Roberts is showing in this post. Admittedly R has a very powerful index system, maybe that is BarryR’s point, but I guess it’s not the direction Robert is pointing anyway.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s