Simulating data generating process correctly

Question 1

My aim is to simulate the following model by means of a Monte Carlo simulation. I wonder if my R code is correct for generating the data.

Could somebody check?

The model:

$$Y = \sum_{j=1}^{100} (1+(-1)^{j}A_j X_j + B_j \sin(6X_j)) \sum_{j=1}^{50} (1+X_j/50) + \epsilon$$

where

\$A_1, \dots, A_{100}\$ are i.i.d. \$∼ \text{Unif}([0.6,1])\$
\$B_1, \dots, B_{100}\$ are i.i.d. \$∼ \text{Unif}([0.8,1.2])\$ and independent of \$A_j\$
\$X \sim \text{Unif}([0,1])\$ where all components are i.i.d. \$∼ \text{Unif}([0, 1])\$
\$\epsilon \sim N(0,2)\$ and \$X_j\$ represents the \$j\$th column of the design matrix

You can find the model here, p. 14

This is my code attempt

n_sim <- 10
n_sample <- 200
n_reg <- 100
sd_eps <- sqrt(2)
X <- replicate(n_reg, runif(n_sample, 0,1))
A <- replicate(n_reg, runif(1, 0.6,1))
B <- replicate(n_reg, runif(1, 0.8,1.2))
f_1 <- vector(mode = 'integer', length = n_sample)
f_2 <- vector(mode = 'integer', length = n_sample)
for (d in seq(100)){
 part1 <- 1 + (-1)^d*A[d]*X[,d]+B[d]*sin(6*X[,d])
 f_1 <- f_1 + part1
}
for (d in seq(50)){
 part2 <- 1 + X[,d]/50
 f_2 <- f_2 + part2
}
# True DGP Train ----
f_true <- f_1*f_2
y <- replicate(n_sim, f_true) + replicate(n_sim, rnorm(n_sample, 0,sd_eps))

Question 2

I find this model difficult to interpret. It seems ambiguous w.r.t. how many Y values are generated. Another question that came to mind w.r.t. interpreting the model is that the X ~ Unif([0,1]) has an exponent of 100 in the PDF version of the model: X ~ Unif([0,1]^100). I'm not familiar with that notation.

Question 3

The [0,1]^100 notation in X ~ Unif([0, 1]^100) is just shorthand for a setwise product. You'll probably have seen R^3 as shorthand for the set of 3-dimensional real numbers. It means that X is a 100-dimensional vector where each component is uniformly distributed on the set [0, 1]

Question 4

The first thing that jumps out from the definition is that, if you have X, A, B and epsilon, you can compute y deterministically. This means you can readily test your implementation. You should always strive to find ways to define pure functions in your R code, and try to use vectorisation instead of for loops.

Based on your existing code, I'll assume for a given model that X is a matrix (n_sample, 100), A and B are vectors of length 100 and epsilon is a vector of length n_sample.

Based on your implementation, the function would look something like

compute_y <- function(X, A, B, epsilon) {
 n_sample <- nrow(X)
 # note that your f_[1|2] stored `double`s not `integers`
 f_1 <- numeric(n_sample)
 f_2 <- numeric(n_sample)
 for (d in seq(100)){
 part1 <- 1 + (-1)^d*A[d]*X[,d] + B[d]*sin(6*X[,d])
 f_1 <- f_1 + part1
 }
 for (d in seq(50)){
 part2 <- 1 + X[,d]/50
 f_2 <- f_2 + part2
 }
 f_1 * f_2 + epsilon
}

But that's a bit scruffy.

The easiest bit to clean up is the bit that defines f_2:

f_2 <- numeric(n_sample)
for (d in seq(50)) {
 part2 <- 1 + X[,d]/50
 f_2 <- f_2 + part2
}

Here you're only using the first 50 columns of X. You could rewrite it as:

f_2 <- numeric(n_sample)
W <- 1 + X[, 1:50]/50
for (d in seq(50)) {
 f_2 <- f_2 + W[,d]
}

But in the latter, you're summing along the rows of W. So you could ditch the for loop altogether:

W <- 1 + X[, 1:50] / 50
f_2 <- rowSums(W)

This gives us:

compute_y <- function(X, A, B, epsilon) {
 n_sample <- nrow(X)
 f_1 <- numeric(n_sample)
 for (d in seq(100)){
 part1 <- 1 + (-1)^d*A[d]*X[,d] + B[d]*sin(6*X[,d])
 f_1 <- f_1 + part1
 }
 f_2 <- rowSums(1 + X[, 1:50] / 50)
 f_1 * f_2 + epsilon
}

There is a way to replace the for-loop that computes f_1.

First note you're adding 1 to f_1 one hundred times, so you might as well start with f_1 storing the value 100

f_1 <- rep(100, n_sample)
for (d in seq(100)){
 part1 <- (-1)^d*A[d]*X[,d] + B[d]*sin(6*X[,d])
 f_1 <- f_1 + part1
}

For speed, I'll just show you how to do it:

tX <- t(X)
a <- colSums(c(-1, 1) * A * tX)
b <- colSums(B * sin(6 * tX))
f_1 <- 100 + a + b

That code would be a bit faster, but I don't think it looks as clean as your definition of f_1.

If you want you can move the code that defines X, A, B, and epsilon into a model-definition function.

Russ Hyde Russ Hyde 3002 silver badges8 bronze badges · Answer 1 · 2019-01-08 19:10:10Z

The first thing that jumps out from the definition is that, if you have X, A, B and epsilon, you can compute y deterministically. This means you can readily test your implementation. You should always strive to find ways to define pure functions in your R code, and try to use vectorisation instead of for loops.

Based on your existing code, I'll assume for a given model that X is a matrix (n_sample, 100), A and B are vectors of length 100 and epsilon is a vector of length n_sample.

Based on your implementation, the function would look something like

compute_y <- function(X, A, B, epsilon) {
 n_sample <- nrow(X)
 # note that your f_[1|2] stored `double`s not `integers`
 f_1 <- numeric(n_sample)
 f_2 <- numeric(n_sample)
 for (d in seq(100)){
 part1 <- 1 + (-1)^d*A[d]*X[,d] + B[d]*sin(6*X[,d])
 f_1 <- f_1 + part1
 }
 for (d in seq(50)){
 part2 <- 1 + X[,d]/50
 f_2 <- f_2 + part2
 }
 f_1 * f_2 + epsilon
}

But that's a bit scruffy.

The easiest bit to clean up is the bit that defines f_2:

f_2 <- numeric(n_sample)
for (d in seq(50)) {
 part2 <- 1 + X[,d]/50
 f_2 <- f_2 + part2
}

Here you're only using the first 50 columns of X. You could rewrite it as:

f_2 <- numeric(n_sample)
W <- 1 + X[, 1:50]/50
for (d in seq(50)) {
 f_2 <- f_2 + W[,d]
}

But in the latter, you're summing along the rows of W. So you could ditch the for loop altogether:

W <- 1 + X[, 1:50] / 50
f_2 <- rowSums(W)

This gives us:

compute_y <- function(X, A, B, epsilon) {
 n_sample <- nrow(X)
 f_1 <- numeric(n_sample)
 for (d in seq(100)){
 part1 <- 1 + (-1)^d*A[d]*X[,d] + B[d]*sin(6*X[,d])
 f_1 <- f_1 + part1
 }
 f_2 <- rowSums(1 + X[, 1:50] / 50)
 f_1 * f_2 + epsilon
}

There is a way to replace the for-loop that computes f_1.

First note you're adding 1 to f_1 one hundred times, so you might as well start with f_1 storing the value 100

f_1 <- rep(100, n_sample)
for (d in seq(100)){
 part1 <- (-1)^d*A[d]*X[,d] + B[d]*sin(6*X[,d])
 f_1 <- f_1 + part1
}

For speed, I'll just show you how to do it:

tX <- t(X)
a <- colSums(c(-1, 1) * A * tX)
b <- colSums(B * sin(6 * tX))
f_1 <- 100 + a + b

That code would be a bit faster, but I don't think it looks as clean as your definition of f_1.

If you want you can move the code that defines X, A, B, and epsilon into a model-definition function.

Stack Exchange Network

Simulating data generating process correctly

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Simulating data generating process correctly

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions