Solving for a seed value in R

Question 1

I'm trying to reproduce a gbm model which was estimated without a set.seed value. To do so I need to determine what seed was used, which I can figure out based on one of the summary metrics from the estimated model (as shown below).

require(MatchIt)
require(gbm)
data("lalonde")
i <- 1
while(!(tmp$rel.inf[1] == 82.3429390)){
 gps <- gbm(treat ~ age + educ + nodegree + re74 + re75, 
 distribution = "bernoulli", 
 data = lalonde, n.trees = 100, 
 interaction.depth = 4, 
 train.fraction = 0.8, shrinkage=0.0005,
 set.seed(i))
 tmp <- summary(gps, plotit=F)
 cat(i,"\n")
 i <- i + 1
}

I think it would be very helpful both for this specific use case and for general future reference to know of any more efficient way of carrying this out. A multicore solution might be a good way to go; I'm researching that myself now. Or perhaps there's a way to improve it by using apply?

Question 2

You are forgetting about floating point and rounding errors. Your condition for exiting should be that you are within some good confidence of the target. Assuming you read "82.3429390" in a publication, then it's likely that the true experimental value was between 82.34293895 and 82.34293905.

Question 3

@flodel I would think so as well, except that I can replicate every number in the article to the 8th (max) decimal place except for the results of this one model.

Question 4

This might not be useful, but why are you using set.seed(i) as an argument to gbm? set.seed returns NULL so you are essentially passing NULL as an unnamed argument to gbm and potentially messing up with it. Should you instead be running set.seed(i) as its own statement, before calling gbm?

Question 5

Here is a small reproducible example to what I was trying to explain with the rounding errors: set.seed(123); x = runif(1); print(x) gives me [1] 0.2875775. But now if I run set.seed(123); runif(1) == 0.2875775 it returns FALSE. What I am saying is that your condition for exiting the loop should be while(abs(tmp$rel.inf[1] - 82.3429390) > eps) for some small eps, probably 5e-8.

Question 6

@flodel set.seed() is a valid argument to gbm. I got it from the manual page and when I run it I get the same result every time (and it's in the expected range) but when I don't it varies slightly every time. I tried setting the seed at the top of my script but that didn't work for this.

Question 7

It seems that you are looping through seeds to find the one that causes a randomized procedure's output to match the output from a previous run.

If you had set the random seed immediately before running the randomized procedure and have simply forgotten the seed you used, then this in theory would work; all you need to do is loop through the billion or so possible input seeds until one matches. There's no real way to speed up the process (beyond parallelizing, which would be easy because the problem is embarrassingly parallel). apply is just a wrapper on a loop, so that would not speed up the process.

Unfortunately, more likely than not you did not set the random seed immediately before running the code. Therefore you would really need to test all the internal states of the pseudorandom number generator (PRNG) that you used to find the one that matches the results. Unfortunately there are intractably many internal states; for instance, the most popular implementation of the Mersenne Twister, which you are likely using, has a period of 2^19937 - 1, meaning it has at least that many possible internal states. Clearly it's impractical to test this many states, so it's probably hopeless to try to match an exact PRNG state if you hadn't set the seed immediately prior to running your randomized procedure.

Question 8

I didn't run the original model. I'm trying to replicate the results of a journal article. Could you please show the code for parallelization? I'd like to at least try testing 1 through 1 billion, just so that I can show I tried if nothing else. I've heard lots of stories of researchers having figured out other researchers' seed values (I don't know how, maybe they picked memorable values like the year or a birthday). I've managed to test the first 2 million potential seed values since I wrote the question.

josliber josliber 1,2219 silver badges17 bronze badges · Answer 1 · 2016-06-30 18:50:10Z

It seems that you are looping through seeds to find the one that causes a randomized procedure's output to match the output from a previous run.

If you had set the random seed immediately before running the randomized procedure and have simply forgotten the seed you used, then this in theory would work; all you need to do is loop through the billion or so possible input seeds until one matches. There's no real way to speed up the process (beyond parallelizing, which would be easy because the problem is embarrassingly parallel). apply is just a wrapper on a loop, so that would not speed up the process.

Unfortunately, more likely than not you did not set the random seed immediately before running the code. Therefore you would really need to test all the internal states of the pseudorandom number generator (PRNG) that you used to find the one that matches the results. Unfortunately there are intractably many internal states; for instance, the most popular implementation of the Mersenne Twister, which you are likely using, has a period of 2^19937 - 1, meaning it has at least that many possible internal states. Clearly it's impractical to test this many states, so it's probably hopeless to try to match an exact PRNG state if you hadn't set the seed immediately prior to running your randomized procedure.

I didn't run the original model. I'm trying to replicate the results of a journal article. Could you please show the code for parallelization? I'd like to at least try testing 1 through 1 billion, just so that I can show I tried if nothing else. I've heard lots of stories of researchers having figured out other researchers' seed values (I don't know how, maybe they picked memorable values like the year or a birthday). I've managed to test the first 2 million potential seed values since I wrote the question.

Stack Exchange Network

Solving for a seed value in R

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Solving for a seed value in R

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions