Save your simulation study seeds

Here in the Northern hemisphere, gardeners are gathering seeds from their prize-winning vegetables are storing them away for next year’s crop. Today at the 20th London Stata Users’ Group meeting, I learnt a similar trick. It’s strange I never thought of it before; regular readers will know how keen I am on simulation studies and techniques. Sometimes you want to go back and investigate one particular simulated dataset, either because the results were particularly interesting or because your program didn’t behave the way you hoped, and you think there will be a clue on how to debug it in that one weird dataset.

Bill Gould’s sage advice was to gather the seed for the random number generator (RNG) at the beginning of each iteration and store it as a string. The seed initialises the RNG (which is never really random*, just unpredictable) and if you set it, you can reproduce the results. A basic example in Stata:

clear all

local iter=10
set obs 100
gen str64 myseed=""
gen slopes=.
gen x=rnormal(0,1)
gen mu=1+(2*x)
gen y=rnormal(mu,1)

forvalues i=1/`iter' {
 local myseed=c(seed)
 qui replace myseed="`myseed'" in `i'
 qui replace x=rnormal(0,1)
 qui replace mu=1+(2*x)
 qui replace y=rnormal(mu,1)
 qui regress y x
 matrix A=e(b)
 local beta=A[1,1]
 qui replace slopes=`beta' in `i'
}

local revisit=myseed[2]
set seed `revisit'
qui replace x=rnormal(0,1)
qui replace mu=1+(2*x)
qui replace y=rnormal(mu,1)
regress y x
dis "`revisit'" 

In this example, we run a linear regression 10 times and save the slope parameters and the seeds. We can then pick out simulation number 2 and recreate it without retracing any other steps. Nice! Here it is in R (but R RNGs are much more complex and I have to say that the help documentation is far from helpful):

iter<-10
size<-100
slopes<-rep(NA,iter)
seeds<-list(NULL)
for(i in 1:iter) {
 seeds[[i]]<-.Random.seed
 x<-rnorm(size,0,1)
 y<-rnorm(size,1+(2*x),1)
 slopes[i]<-lm(y~x)$coefficients[2]
}
.Random.seed<-seeds[[2]]
x<-rnorm(size,0,1)
y<-rnorm(size,1+(2*x),1)
lm(y~x)

* – I’m glossing over the philosophical meaning of ‘random’ here

Advertisements

11 Comments

  1. What about simply using `set.seed` in R? From the help page: “‘.Random.seed’ […] should not be altered by the user.” and later “‘set.seed’ is the recommended way to specify seeds.” Your example would become:
    iter <- 10
    size <- 100
    slopes <- rep(NA, iter)
    for (i in 1:iter) {
    set.seed(i)
    x <- rnorm(size, 0, 1)
    y <- rnorm(size, 1 + (2 * x), 1)
    slopes[i] <- lm(y ~ x)$coefficients[2]
    }
    slopes
    set.seed(2)
    x <- rnorm(size, 0, 1)
    y <- rnorm(size, 1 + (2 * x), 1)
    lm(y ~ x)
    Hope this helps (and that Markdown formatting gets parsed…)
    Mathieu.

  2. Rather than using .Random.seed and copying to/from that directly, it’s easier to use the set.seed() function in R. And if you want to change the RNG the RNGkind() function can be used. We tried to provide a few accessible worked examples (from an econometric perspective) in the following paper if you’re interested.
    Christian Kleiber, Achim Zeileis (2013). Reproducible Econometric Simulations. Journal of Econometric Methods, 2(1), 89-99.
    http://dx.doi.org/10.1515/jem-2012-0004
    A preprint version is available at
    http://eeecon.uibk.ac.at/~zeileis/papers/Kleiber+Zeileis-2013.pdf

  3. Set.seed fans, please note I am avoiding re-running all the simulations. Can you provide code that shows how one simulation out of a large iter, say 10^7, can be reproduced using set.seed without running all the preceding simulations?

  4. Robert, either I do not understand your question, or this is exactly what my use of `set.seed` does. In the example above, it allows you to re-run directly the second iteration (`set.seed(2)` in this case, but pick the one you like) without going through the whole set of iterations (10 in this case).
    Mathieu.

  5. If I understand you, Robert, it isn’t that you want to set the seed before you do all of the calculations, it’s that you want to be able to repeat an arbitrary element within the results. If that’s the case, might I suggest the following implementation:
    iter <- 10
    size <- 100
    slopes <- rep(NA, iter)
    ## predefine all the seeds we need
    seeds <- sample(.Machine$integer.max, size = iter, replace = TRUE)
    ## do the calculation
    slopes <- sapply(seeds, function(sd) {
    set.seed(sd)
    x <- rnorm(size, mean = 0, sd = 1)
    y <- rnorm(size, mean = 1 + (2 * x), sd = 1)
    lm(y ~ x)$coefficients[2]
    })
    ## hurry, save the data!
    # save(seeds, slopes, file = "seeds_and_slopes.Rda")
    ## reproduce one of them
    set.seed(seeds[2])
    x <- rnorm(size, mean = 0, sd = 1)
    y <- rnorm(size, mean = 1 + (2 * x), sd = 1)
    lm(y ~ x)$coefficients[2]
    The point of the help page and the previous comments is that you most likely do not want to be messing with `.Random.seed` at all, even if you're just over-writing it with a previous value of itself. First, I tend to believe docs when they say you shouldn't be mucking with the under-workings like that. Second, your implementation scales poorly:
    ## just to demonstrate size, not usability
    object.size(replicate(iter, .Random.seed))
    ## 25240 bytes
    object.size(sample(.Machine$integer.max, size = iter))
    ## 88 bytes
    When you get into the 1e7 range of iter, it's going to get unnecessarily big (~23GB). When using `set.seed()`, the object size grows to ~38MB at 1e7. (Object size is not directly multiplicative. My approach yields 400,040 bytes at 1e5 ints, yours yields 250,400,200 bytes.)

      1. Interesting. Thanks r2evans. I think life is a lot simpler with Stata which does not (yet) have Mersenne Twister and so the state of the RNG is far smaller. I am also anxious about writing rather than saving the state (!=seed) in case something unpredictable from _inside_ the iteration makes it change before you get to the crucial bit of the action. That is not the case in our examples here, but it could happen.

  6. (1) I’m not a pro on random number generation, but the articles I’ve found on the Mersenne Twister characterize it as a better, faster, and more reliable PRNG. It is the default generator for SPSS and SAS, both providing an alternative (that happens to be deprecated). R uses it by default as well, but offers six others, including one that does not pass all tests of the Diehard battery (a test of the quality of a PRNG). Without testing, I am only assuming that STATA’s lack of Mersenne-Twister is not necessarily a good thing, even though it might be providing you a simpler solution.
    (2) With regards to your fear of something unpredictable happening inside the iteration changing your control over the seeds, both of our implementations effect the same level of control. That is, if mine is prone to uncontrolled randomness, so is yours, as they attempt to do the same thing with the same control or lack thereof. I don’t think there’s a risk here, mind you.
    (3) I’m going to alter my initial assertion. The help docs state “It can be saved and restored, but should not be altered by the user.” So, the way you are doing it is just fine, in that you merely store and restore the entire contents as needed, however (in)efficient it might be. Your method does that. The big caveat to doing it this way is that the generator in use must not change. The docs do say that any use of `set.seed` should also set the ‘kind’ and ‘normal.kind’ to ensure true reproducibility; this applies to your use as well.
    (4) HOWEVER, one risk of your implementation is that if the generator being used (default or otherwise) ever changes, your code will become dangerous. The content of `.Random.seed` changes based on the generator in use, as evidenced here:
    RNGkind(kind = “Mersenne-Twister”) # the default
    str(.Random.seed)
    ## int [1:626] 403 624 -1039110868 -1445025731 -32189990 1462640019 1157703096 1988987993 -1412241338 905285135 …
    RNGkind(kind = “Wichmann-Hill”)
    str(.Random.seed)
    ## int [1:4] 400 5866 29998 3637
    RNGkind(kind = “Mersenne-Twister”) # restoring to the default
    Consider the possibility that you execute your code without checking to ensure the RNG ‘kind’ is identical. When you simply “restore” the seeds, you will be restoring an incompatible structure into the generator; at best, the extra (or potentially missing) information will be truncated or assumed away. The only risk with using `set.seed(2)` with a different generator is different results. The only risk with using `set.seed(2, kind = “Mersenne-Twister”, normal.kind = “Inversion”)` is future unavailability of either the kind or the normal.kind, in which the case call to `set.seed` should issue a Warning or a full Stop.
    Either way, it seems to me that the recommendation of using `set.seed` is partially to prevent you from restoring incompatible structures, and partially to ensure that setting the seed affects all the generators it can. Sounds close to a Best Practice to me.
    BTW: I am really enjoying this discussion, as it is forcing me to research and understand more of the behavior of stochastic elements in my systems.

    1. These are important points – thanks for sharing. I too am learning a lot about RNGs! One excellent addition to the code is the checking of the RNG type. It would be useful to have a function to map the state into a seed (and is this not a one-many mapping?). Regarding Stata (never ‘STATA’ please!), the Mersenne Twister is coming, and that will complicated some of the seed setting but give much longer periods of course.
      One last point: if the rngkind changes, and I set .Random.seed, things might stop – which would be better than continuing and returning wrong results, as is the case if I set.seed.

  7. My understanding is that you do not want to keep resetting the random number seed inside a loop as some of the other comments above suggest. When you do that, the sequence of pseudo-random numbers is much less random than it would otherwise be. Set the seed once at the top of your program, then don’t mess with it.

    1. I can’t find many arguments against using set.seed in a loop. There are, in my mind, three categories of arguments possible against doing it:
      * Mathematically: if setting the seed from within a loop had different results than setting the seed outside the loop, then this would be a problem. Fortunately, setting a seed is universally the same regardless of where you set it, in that the subsequent draws are identical.
      * Code Efficiency: setting the seed does have a performance penalty, as does any other code. On naive benchmarking, a single set.seed() is ~100 times slower than a single random draw, but the remaining calculation is more likely more impactful. I recognize this is a consideration, but not (per se) a prohibitive problem.
      * Stochastic: this is the most successful argument, but it is conditioned on the needs of the simulations. If you are setting the seeds for convenience, (e.g., 1, 2, 3, …) then you are not getting a good random sequence. If you are saving randomly-generated seeds, however, this argument weakens a little.
      The third, stoch, is the only one in my mind that has traction, but can be mitigated (though perhaps not completely) by using a randomly-generated sequence of seeds. This allows the ability to arbitrarily return to a specific cycle in the loop without spoiling the analysis with an initial known starting point. My first example provides this insulation from an analyst-made bias.
      @Brian, are there other arguments I’m not considering?
      I did find a discussion at Stat.com (http://www.stata.com/manuals13/rsetseed.pdf) against doing it, but their logic was based on the stochastics of it, and not the mathematics or performance of the loop. I still contend that my solution precludes this problem.
      There is another option that hasn’t been discussed: if we know the number of random numbers generated in each cycle of the loop, then we should be able to reproduce the act of drawing all of the random data without the overhead of the calculation. In this case, set the seed once before the loop. If reproduction is required, then run:
      set.seed(same seed)
      needToReproduce <- 7
      .ignore <- rnorm(n = 2 * (needToReproduce – 1)) # use up random numbers
      x <- rnorm(size, mean = 0, sd = 1)
      y <- rnorm(size, mean = 1 + (2 * x), sd = 1)
      lm(y ~ x)$coefficients[2]
      This method works "better" when the time of calculation is much larger than the time of retrieving random numbers, but would probably not work terribly well in Robert's case where he is working on the order of 1e7 iterations.
      @Brian, as far as "much less random" goes, isn't the point of using set.seed() (or over-writing .Random.seed) to "replay" a sequence of pseudo-random numbers? I'm not arguing that a reproduced sequence of random numbers is truly random, quite the opposite. However, by my example, I'm using a random sequence of numbers for preplanned seeds. If 10 of the 1e7 tests produce something significant, I'll run the tests with a simple metric, filter down to the significant ones, then, using the seed, reproduce the initial conditions and perform more robust analysis of the (in this example) regression. My initial sample of random seeds is not hand-picked.
      Robert's method seems like a more robust method in that it isn't writing the seed, merely recording the PRNG state each time. If there were a method for taking this state and restoring it verbatim, it would be more arguably robust. This is possible in Stata saving "c(seed)" in each loop, and ". set seed ” to reproduce a specific run. R does not (yet) have this functionality, so the workaround is to (1) over-write .Random.seed despite the developers’ recommendations, or (2) accept that set.seed(xyz) actually uses xyz to consistently generate a seed simultaneously for all of the PRNGs. (BTW: I read somewhere that some PRNGs in R save their state outside of .Random.seed; not a factor if you always use Mersenne-Twister, I think.)
      Tangent: I wonder how many journal articles that set the seed for reproducibility had hand-chosen a seed that provided supporting results vice using a randomly-generated seed. I use Rmarkdown files for many of my reports, so I think I’ll start including this in the setup:
      “`{r seedSetup}
      (seed <- sample(.Machine$integer.max, size = 1))
      set.seed(seed)
      “`
      In this way, I can argue that I did not hand-pick the seed, but it is still displayed in the output for later reproduction. (This is similar to how I recommended doing this in my first example on the 13th.)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s