I'm trying to reproduce a gbm
model which was estimated without a set.seed
value. To do so I need to determine what seed was used, which I can figure out based on one of the summary metrics from the estimated model (as shown below).
require(MatchIt)
require(gbm)
data("lalonde")
i <- 1
while(!(tmp$rel.inf[1] == 82.3429390)){
gps <- gbm(treat ~ age + educ + nodegree + re74 + re75,
distribution = "bernoulli",
data = lalonde, n.trees = 100,
interaction.depth = 4,
train.fraction = 0.8, shrinkage=0.0005,
set.seed(i))
tmp <- summary(gps, plotit=F)
cat(i,"\n")
i <- i + 1
}
I think it would be very helpful both for this specific use case and for general future reference to know of any more efficient way of carrying this out. A multicore solution might be a good way to go; I'm researching that myself now. Or perhaps there's a way to improve it by using apply
?
1 Answer 1
It seems that you are looping through seeds to find the one that causes a randomized procedure's output to match the output from a previous run.
If you had set the random seed immediately before running the randomized procedure and have simply forgotten the seed you used, then this in theory would work; all you need to do is loop through the billion or so possible input seeds until one matches. There's no real way to speed up the process (beyond parallelizing, which would be easy because the problem is embarrassingly parallel). apply
is just a wrapper on a loop, so that would not speed up the process.
Unfortunately, more likely than not you did not set the random seed immediately before running the code. Therefore you would really need to test all the internal states of the pseudorandom number generator (PRNG) that you used to find the one that matches the results. Unfortunately there are intractably many internal states; for instance, the most popular implementation of the Mersenne Twister, which you are likely using, has a period of 2^19937 - 1, meaning it has at least that many possible internal states. Clearly it's impractical to test this many states, so it's probably hopeless to try to match an exact PRNG state if you hadn't set the seed immediately prior to running your randomized procedure.
-
\$\begingroup\$ I didn't run the original model. I'm trying to replicate the results of a journal article. Could you please show the code for parallelization? I'd like to at least try testing 1 through 1 billion, just so that I can show I tried if nothing else. I've heard lots of stories of researchers having figured out other researchers' seed values (I don't know how, maybe they picked memorable values like the year or a birthday). I've managed to test the first 2 million potential seed values since I wrote the question. \$\endgroup\$Hack-R– Hack-R2016年06月30日 18:55:21 +00:00Commented Jun 30, 2016 at 18:55
"82.3429390"
in a publication, then it's likely that the true experimental value was between82.34293895
and82.34293905
. \$\endgroup\$set.seed(i)
as an argument togbm
?set.seed
returnsNULL
so you are essentially passingNULL
as an unnamed argument togbm
and potentially messing up with it. Should you instead be runningset.seed(i)
as its own statement, before callinggbm
? \$\endgroup\$set.seed(123); x = runif(1); print(x)
gives me[1] 0.2875775
. But now if I runset.seed(123); runif(1) == 0.2875775
it returnsFALSE
. What I am saying is that your condition for exiting the loop should bewhile(abs(tmp$rel.inf[1] - 82.3429390) > eps)
for some smalleps
, probably 5e-8. \$\endgroup\$set.seed()
is a valid argument togbm
. I got it from the manual page and when I run it I get the same result every time (and it's in the expected range) but when I don't it varies slightly every time. I tried setting the seed at the top of my script but that didn't work for this. \$\endgroup\$