1
\$\begingroup\$

My aim is to simulate the following model by means of a Monte Carlo simulation. I wonder if my R code is correct for generating the data.

Could somebody check?

The model:

$$Y = \sum_{j=1}^{100} (1+(-1)^{j}A_j X_j + B_j \sin(6X_j)) \sum_{j=1}^{50} (1+X_j/50) + \epsilon$$

where

  • \$A_1, \dots, A_{100}\$ are i.i.d. \$∼ \text{Unif}([0.6,1])\$
  • \$B_1, \dots, B_{100}\$ are i.i.d. \$∼ \text{Unif}([0.8,1.2])\$ and independent of \$A_j\$
  • \$X \sim \text{Unif}([0,1])\$ where all components are i.i.d. \$∼ \text{Unif}([0, 1])\$
  • \$\epsilon \sim N(0,2)\$ and \$X_j\$ represents the \$j\$th column of the design matrix

You can find the model here, p. 14

This is my code attempt

n_sim <- 10
n_sample <- 200
n_reg <- 100
sd_eps <- sqrt(2)
X <- replicate(n_reg, runif(n_sample, 0,1))
A <- replicate(n_reg, runif(1, 0.6,1))
B <- replicate(n_reg, runif(1, 0.8,1.2))
f_1 <- vector(mode = 'integer', length = n_sample)
f_2 <- vector(mode = 'integer', length = n_sample)
for (d in seq(100)){
 part1 <- 1 + (-1)^d*A[d]*X[,d]+B[d]*sin(6*X[,d])
 f_1 <- f_1 + part1
}
for (d in seq(50)){
 part2 <- 1 + X[,d]/50
 f_2 <- f_2 + part2
}
# True DGP Train ----
f_true <- f_1*f_2
y <- replicate(n_sim, f_true) + replicate(n_sim, rnorm(n_sample, 0,sd_eps))
Null
1,4633 gold badges19 silver badges27 bronze badges
asked Dec 27, 2018 at 15:00
\$\endgroup\$
2
  • \$\begingroup\$ I find this model difficult to interpret. It seems ambiguous w.r.t. how many Y values are generated. Another question that came to mind w.r.t. interpreting the model is that the X ~ Unif([0,1]) has an exponent of 100 in the PDF version of the model: X ~ Unif([0,1]^100). I'm not familiar with that notation. \$\endgroup\$ Commented Dec 28, 2018 at 21:40
  • \$\begingroup\$ The [0,1]^100 notation in X ~ Unif([0, 1]^100) is just shorthand for a setwise product. You'll probably have seen R^3 as shorthand for the set of 3-dimensional real numbers. It means that X is a 100-dimensional vector where each component is uniformly distributed on the set [0, 1] \$\endgroup\$ Commented Jan 8, 2019 at 16:26

1 Answer 1

2
\$\begingroup\$

The first thing that jumps out from the definition is that, if you have X, A, B and epsilon, you can compute y deterministically. This means you can readily test your implementation. You should always strive to find ways to define pure functions in your R code, and try to use vectorisation instead of for loops.

Based on your existing code, I'll assume for a given model that X is a matrix (n_sample, 100), A and B are vectors of length 100 and epsilon is a vector of length n_sample.

Based on your implementation, the function would look something like

compute_y <- function(X, A, B, epsilon) {
 n_sample <- nrow(X)
 # note that your f_[1|2] stored `double`s not `integers`
 f_1 <- numeric(n_sample)
 f_2 <- numeric(n_sample)
 for (d in seq(100)){
 part1 <- 1 + (-1)^d*A[d]*X[,d] + B[d]*sin(6*X[,d])
 f_1 <- f_1 + part1
 }
 for (d in seq(50)){
 part2 <- 1 + X[,d]/50
 f_2 <- f_2 + part2
 }
 f_1 * f_2 + epsilon
}

But that's a bit scruffy.

The easiest bit to clean up is the bit that defines f_2:

f_2 <- numeric(n_sample)
for (d in seq(50)) {
 part2 <- 1 + X[,d]/50
 f_2 <- f_2 + part2
}

Here you're only using the first 50 columns of X. You could rewrite it as:

f_2 <- numeric(n_sample)
W <- 1 + X[, 1:50]/50
for (d in seq(50)) {
 f_2 <- f_2 + W[,d]
}

But in the latter, you're summing along the rows of W. So you could ditch the for loop altogether:

W <- 1 + X[, 1:50] / 50
f_2 <- rowSums(W)

This gives us:

compute_y <- function(X, A, B, epsilon) {
 n_sample <- nrow(X)
 f_1 <- numeric(n_sample)
 for (d in seq(100)){
 part1 <- 1 + (-1)^d*A[d]*X[,d] + B[d]*sin(6*X[,d])
 f_1 <- f_1 + part1
 }
 f_2 <- rowSums(1 + X[, 1:50] / 50)
 f_1 * f_2 + epsilon
}

There is a way to replace the for-loop that computes f_1.

First note you're adding 1 to f_1 one hundred times, so you might as well start with f_1 storing the value 100

f_1 <- rep(100, n_sample)
for (d in seq(100)){
 part1 <- (-1)^d*A[d]*X[,d] + B[d]*sin(6*X[,d])
 f_1 <- f_1 + part1
}

For speed, I'll just show you how to do it:

tX <- t(X)
a <- colSums(c(-1, 1) * A * tX)
b <- colSums(B * sin(6 * tX))
f_1 <- 100 + a + b

That code would be a bit faster, but I don't think it looks as clean as your definition of f_1.

If you want you can move the code that defines X, A, B, and epsilon into a model-definition function.

answered Jan 8, 2019 at 19:10
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.