Monday, January 4, 2010

Example 7.20: Simulate categorical data

Both SAS and R provide means of simulating categorical data (see section 1.10.4). Alternatively, it is trivial to write code to do this directly. In this entry, we show how to do it once. In a future entry, we'll demonstrate writing a SAS Macro (section A.8.1) and a function in R (section B.5.2) to do it repeatedly.

SAS


data test;
p1 = .1; p2 = .2; p3 = .3;
do i = 1 to 10000;
x = uniform(0);
mycat = (x ge 0) + (x gt p1) + (x gt p1 + p2)
+ (x gt p1 + p2 + p3);
output;
end;
run;


Here the parenthetical logical tests in the mycat = line resolve to 1 if the test is true and 0 otherwise, as discussed in section 1.4.9.
The (x ge 0) makes the categories range from 1 to 4, rather than 0 to 3.

The results can be assessed using proc freq:


proc freq data=test; tables mycat; run;

Cumulative Cumulative
mycat Frequency Percent Frequency Percent
----------------------------------------------------------
1 947 9.47 947 9.47
2 2061 20.61 3008 30.08
3 3039 30.39 6047 60.47
4 3953 39.53 10000 100.00




R

In contrast, the R syntax to get the results is rather dense.


p <- c(.1,.2,.3)
x <- runif(10000)
mycat <- numeric(10000)
for (i in 0:length(p)) {
mycat <- mycat + (x>= sum(p[0:i]))
}


We can display the results using the summary() function.


summary(factor(mycat))
1 2 3 4
990 2047 2978 3985

11 comments:

Douglas Rivers said...

Or, you could just use

mycat <- cut(runif(10000), c(0, 0.1, 0.3, 0.6, 1), labels=FALSE)

January 4, 2010 at 10:53 PM
Ken Kleinman said...

Thanks, Douglas! Much better.

It looks like if I omit the labels=FALSE, the factor labels are very useful, too.

> mycat <- cut(runif(10000), c(0, 0.1, 0.3, 0.6, 1))

> summary(mycat)
(0,0.1] (0.1,0.3] (0.3,0.6] (0.6,1]
987 1993 3047 3973

January 5, 2010 at 8:43 AM
Unknown said...

Sample may be a better function to simulate categorical data:

> sample(1:4,10000,rep=TRUE,prob=c(.1,.2,.3,.4))
> table(sample)

1 2 3 4
1012 2074 2924 3990

January 8, 2010 at 4:36 AM
Anonymous said...

Hello,

how could I simulate data from a multinomial logit model depending on a metric variable.

April 19, 2011 at 10:21 AM
Ken Kleinman said...

I'm not sure what you're asking. You can simulate data from a multinomial logistic model using a process similar to what we show for logistic regression in this entry: http://sas-and-r.blogspot.com/2009/06/example-72-simulate-data-from-logistic.html. What do you mean by a "metric" variable, though?

April 19, 2011 at 12:36 PM
burakaydin said...

Hello,
Can I simulate variables with a known Pearson covariance matrix?
I need to simulate categorical, continuous and binary variables based on the pearson covariance matrix? thanks

March 30, 2012 at 5:48 PM
Ken Kleinman said...

In example 6.3 in our book, we show correlated binary variables, based on Lipsitz et al, Stats in Med 1990, 9:1517-1525. You'll find many cites if you search with "simulate correlated" as your base.

March 30, 2012 at 8:24 PM
burakaydin said...

Thanks for the response.
There is an R package called "bindata". It performs almost perfect to create correlated binary variables, with known marginal probabilities and correlations.
What I need is the simulation of correlated continuous and categorical variables using a single multivariate distribution.

March 30, 2012 at 9:27 PM
Ken Kleinman said...

Good to know about that one, thanks. I don't know of a technique to do what you need, offhand. A brief search turned up this thread: http://stats.stackexchange.com/questions/22856/how-to-generate-correlated-test-data-that-has-bernoulli-categorical-and-contin where copulas are suggested. And also this paper: http://www.springerlink.com/content/011x633m554u843g/. Let me know what you end up doing.

March 30, 2012 at 10:20 PM
Nick Horton said...

There is a literature that might be relevant. A starting point might be Cox, D. R. and Wermuth, N. (1992). Response models for mixed binary and quantitative variables. Biometrika, 79, 441-461. They propose a flexible multivariate distribution which might be useful.

March 31, 2012 at 9:31 AM
Anonymous said...

What is the variance of the error term when a multinomial logit is simulated in this way?

May 13, 2013 at 11:04 AM

Post a Comment

Subscribe to: Post Comments (Atom)

AltStyle によって変換されたページ (->オリジナル) /