My aim is to simulate the following model by means of a Monte Carlo simulation. I wonder if my R code is correct for generating the data.
Could somebody check?
The model:
$$Y = \sum_{j=1}^{100} (1+(-1)^{j}A_j X_j + B_j \sin(6X_j)) \sum_{j=1}^{50} (1+X_j/50) + \epsilon$$
where
- \$A_1, \dots, A_{100}\$ are i.i.d. \$∼ \text{Unif}([0.6,1])\$
- \$B_1, \dots, B_{100}\$ are i.i.d. \$∼ \text{Unif}([0.8,1.2])\$ and independent of \$A_j\$
- \$X \sim \text{Unif}([0,1])\$ where all components are i.i.d. \$∼ \text{Unif}([0, 1])\$
- \$\epsilon \sim N(0,2)\$ and \$X_j\$ represents the \$j\$th column of the design matrix
You can find the model here, p. 14
This is my code attempt
n_sim <- 10
n_sample <- 200
n_reg <- 100
sd_eps <- sqrt(2)
X <- replicate(n_reg, runif(n_sample, 0,1))
A <- replicate(n_reg, runif(1, 0.6,1))
B <- replicate(n_reg, runif(1, 0.8,1.2))
f_1 <- vector(mode = 'integer', length = n_sample)
f_2 <- vector(mode = 'integer', length = n_sample)
for (d in seq(100)){
part1 <- 1 + (-1)^d*A[d]*X[,d]+B[d]*sin(6*X[,d])
f_1 <- f_1 + part1
}
for (d in seq(50)){
part2 <- 1 + X[,d]/50
f_2 <- f_2 + part2
}
# True DGP Train ----
f_true <- f_1*f_2
y <- replicate(n_sim, f_true) + replicate(n_sim, rnorm(n_sample, 0,sd_eps))
1 Answer 1
The first thing that jumps out from the definition is that, if you have X, A, B and epsilon, you can compute y deterministically. This means you can readily test your implementation. You should always strive to find ways to define pure functions in your R code, and try to use vectorisation instead of for
loops.
Based on your existing code, I'll assume for a given model that X is a matrix (n_sample, 100), A and B are vectors of length 100 and epsilon is a vector of length n_sample.
Based on your implementation, the function would look something like
compute_y <- function(X, A, B, epsilon) {
n_sample <- nrow(X)
# note that your f_[1|2] stored `double`s not `integers`
f_1 <- numeric(n_sample)
f_2 <- numeric(n_sample)
for (d in seq(100)){
part1 <- 1 + (-1)^d*A[d]*X[,d] + B[d]*sin(6*X[,d])
f_1 <- f_1 + part1
}
for (d in seq(50)){
part2 <- 1 + X[,d]/50
f_2 <- f_2 + part2
}
f_1 * f_2 + epsilon
}
But that's a bit scruffy.
The easiest bit to clean up is the bit that defines f_2
:
f_2 <- numeric(n_sample)
for (d in seq(50)) {
part2 <- 1 + X[,d]/50
f_2 <- f_2 + part2
}
Here you're only using the first 50 columns of X
. You could rewrite it as:
f_2 <- numeric(n_sample)
W <- 1 + X[, 1:50]/50
for (d in seq(50)) {
f_2 <- f_2 + W[,d]
}
But in the latter, you're summing along the rows of W
. So you could ditch the for
loop altogether:
W <- 1 + X[, 1:50] / 50
f_2 <- rowSums(W)
This gives us:
compute_y <- function(X, A, B, epsilon) {
n_sample <- nrow(X)
f_1 <- numeric(n_sample)
for (d in seq(100)){
part1 <- 1 + (-1)^d*A[d]*X[,d] + B[d]*sin(6*X[,d])
f_1 <- f_1 + part1
}
f_2 <- rowSums(1 + X[, 1:50] / 50)
f_1 * f_2 + epsilon
}
There is a way to replace the for-loop that computes f_1.
First note you're adding 1 to f_1 one hundred times, so you might as well start with f_1 storing the value 100
f_1 <- rep(100, n_sample)
for (d in seq(100)){
part1 <- (-1)^d*A[d]*X[,d] + B[d]*sin(6*X[,d])
f_1 <- f_1 + part1
}
For speed, I'll just show you how to do it:
tX <- t(X)
a <- colSums(c(-1, 1) * A * tX)
b <- colSums(B * sin(6 * tX))
f_1 <- 100 + a + b
That code would be a bit faster, but I don't think it looks as clean as your definition of f_1.
If you want you can move the code that defines X, A, B, and epsilon into a model-definition function.
X ~ Unif([0,1])
has an exponent of 100 in the PDF version of the model:X ~ Unif([0,1]^100)
. I'm not familiar with that notation. \$\endgroup\$[0,1]^100
notation inX ~ Unif([0, 1]^100)
is just shorthand for a setwise product. You'll probably have seenR^3
as shorthand for the set of 3-dimensional real numbers. It means thatX
is a 100-dimensional vector where each component is uniformly distributed on the set[0, 1]
\$\endgroup\$