Showing posts with label R function. Show all posts
Showing posts with label R function. Show all posts

Wednesday, January 22, 2014

Example 2014.2: Block randomization

This week I had to block-randomize some units. This is ordinarily the sort of thing I would do in SAS, just because it would be faster for me. But I had already started work on the project R, using knitr/LaTeX to make a PDF, so it made sense to continue the work in R.

R
As is my standard practice now in both languages, I set thing up to make it easy to create a function later. I do this by creating variables with the ingredients to begin with, then call them as variables, rather than as values, in my code. In the example, I assume 40 assignments are required, with a block size of 6.
I generate the blocks themselves with the rep() function, calculating the number of blocks needed to ensure at least N items will be generated. Then I make a data frame with the block numbers and a random variate, as well as the original order of the envelopes. The only possibly confusing part of the sequence is the use of the order() function. What it returns is a vector of integer values with the row numbers of the original data set sorted by the listed variables. So the expression a1[order(a1$block,a1$rand),] translates to "from the a1 data frame, give me the rows ordered by sorting the rand variable within the block variable, and all columns." I assign the arms in a systematic way to the randomly ordered units, then resort them back into their original order.
seed=42
blocksize = 6
N = 40
set.seed(seed)
block = rep(1:ceiling(N/blocksize), each = blocksize)
a1 = data.frame(block, rand=runif(length(block)), envelope= 1: length(block))
a2 = a1[order(a1$block,a1$rand),]
a2$arm = rep(c("Arm 1", "Arm 2"),times = length(block)/2)
assign = a2[order(a2$envelope),]
> head(assign,12)
 block rand envelope arm
1 1 0.76450776 1 Arm 1
2 1 0.62361346 2 Arm 2
3 1 0.14844661 3 Arm 2
4 1 0.08026447 4 Arm 1
5 1 0.46406955 5 Arm 1
6 1 0.77936816 6 Arm 2
7 2 0.73352796 7 Arm 2
8 2 0.81723044 8 Arm 1
9 2 0.17016248 9 Arm 2
10 2 0.94472033 10 Arm 2
11 2 0.29362384 11 Arm 1
12 2 0.14907205 12 Arm 1
It's trivial to convert this to a function-- all I have to do is omit the lines where I assign values to the seed, sample size, and block size, and make the same names into parameters of the function.
blockrand = function(seed,blocksize,N){
 set.seed(seed)
 block = rep(1:ceiling(N/blocksize), each = blocksize)
 a1 = data.frame(block, rand=runif(length(block)), envelope= 1: length(block))
 a2 = a1[order(a1$block,a1$rand),]
 a2$arm = rep(c("Arm 1", "Arm 2"),times = length(block)/2)
 assign = a2[order(a2$envelope),]
 return(assign)
}


SAS
This job is also pretty simple in SAS. I use the do loop, twice, to produce the blocks and items (or units) within block, sssign the arm systematically, and generate the random variate which will provide the sort order within block. Then sort on the random order within block, and use the "Obs" (observation number) that's printed with the data as the envelope number.
%let N = 40;
%let blocksize = 6;
%let seed = 42;
data blocks;
call streaminit(&seed);
do block = 1 to ceil(&N/&blocksize);
 do item = 1 to &blocksize;
 if item le &blocksize/2 then arm="Arm 1";
 else arm="Arm 2";
 rand = rand('UNIFORM');
 output;
 end;
end;
run;
proc sort data = blocks; by block rand; run;
proc print data = blocks (obs = 12) obs="Envelope"; run;
Envelope block item arm rand
 1 1 3 Arm 1 0.13661
 2 1 1 Arm 1 0.51339
 3 1 5 Arm 2 0.72828
 4 1 2 Arm 1 0.74696
 5 1 4 Arm 2 0.75284
 6 1 6 Arm 2 0.90095
 7 2 2 Arm 1 0.04539
 8 2 6 Arm 2 0.15949
 9 2 4 Arm 2 0.21871
 10 2 1 Arm 1 0.66036
 11 2 5 Arm 2 0.85673
 12 2 3 Arm 1 0.98189
It's also fairly trivial to make this into a macro in SAS.
%macro blockrand(N, blocksize, seed); 
data blocks;
call streaminit(&seed);
do block = 1 to ceil(&N/&blocksize);
 do item = 1 to &blocksize;
 if item le &blocksize/2 then arm="Arm 1";
 else arm="Arm 2";
 rand = rand('UNIFORM');
 output;
 end;
end;
run;
proc sort data = blocks; by block rand; run;
%mend blockrand;


An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Wednesday, January 20, 2010

Example 7.23: the Monty Hall problem

The Monty Hall problem illustrates a simple setting where intuition often leads to a solution different from formal reasoning. The situation is based on the game show Let's Make a Deal. First, Monty puts a prize behind one of three doors. Then the player chooses a door. Next, (without moving the prize) Monty opens an unselected door, revealing that the prize is not behind it. The player may then switch to the other non-selected door. Should the player switch?

Many people see that there are now two doors to choose between, and feel that since Monty can always open a non-prize door, there's still equal probability for each door. If that were the case, the player might as well keep the original door.

This approach is so attractive that when Marilyn Vos Savant asserted that the player should switch (in her Parade magazine column), there were reportedly 10,000 letters asserting she was wrong. There was also an article published in The American Statistician. As reported on the Wikipedia page, out of 228 subjects in one study, only 13% chose to switch (Granberg and Brown, Personality and Social Psychology Bulletin 21(7): 711-729). In her book, The Power of Logical Thinking, vos Savant quotes cognitive psychologist Massimo Piattelli-Palmarini as saying "... no other statistical puzzle comes so close to fooling all the people all the time" and "that even Nobel physicists systematically give the wrong answer, and that they insist on it, and they are ready to berate in print those who propose the right answer."

One correct intuitive route is to observe that Monty's door is fixed. The probability that the player has the right door is 1/3 before Monty opens the non-prize door, and remains 1/3 after that door is open. This means that the probability the prize is behind one of the other doors is 2/3, both before and after Monty opens the non-prize door. After Monty opens the non-prize door, the player gets a 2/3 chance of winning by switching to the remaining door. If the player wants to win, they should switch doors.

The excellent Wikipedia entry referenced above provides additional intuitive tools, as well as variants and history.

One way to prove to yourself that switch is best is through simulation. In fact, even deciding how to code the problem may be enough to convince yourself to switch.

SAS
In SAS, we assign the prize to a door, then make an initial guess. If the guess was right, Monty can open either door. We'll switch to the other door. Rather than have Monty choose a door, we'll choose one, under the assumption that Monty opened the other one. We do this with a do until loop (section 1.11.1). If our initial guess was wrong, Monty will open the only remaining non-prize door, and when we switch we'll be choosing the prize door.



data mh;
do i = 1 to 100000;
prize = rand("TABLE",.333,.333);
* Monty puts the prize behind a random door;
initial_guess = rand("TABLE",.333,.333);
* We make a random initial guess;
* if the initial guess is right, Monte can
open either of the others;
* which means that player can switch to either
of the other doors;
if initial_guess eq prize then do;
new_guess = initial_guess;
* choose a door until it's different from
the initial guess-- that's
the door we switch to;
do until (new_guess ne initial_guess);
new_guess = rand("TABLE",.333,.333);
end;
end;
* If the initial guess was wrong, Monte can rule
out only one of the other doors;
* which means that we must switch to the right door;
if initial_guess ne prize then new_guess = prize;
output;
end;
run;


To see what happened, we'll summarize the resulting data set. If our initial guess was right, we'll win by keeping it. If switching leads to a win, the new guess is where the prize is. We create new variables to indicate winning using a logical test (as in section 1.4.9).


data mh2;
set mh;
win_by_keep = (initial_guess eq prize);
win_by_switch = (new_guess eq prize);
run;

proc means data = mh2 mean;
var win_by_keep win_by_switch;
run;



The MEANS Procedure

Variable Mean
-----------------------------
win_by_keep 0.3332700
win_by_switch 0.6667300
-----------------------------


Note that these two values sum to exactly one. If we chose the prize initially, we always win by keeping it, and if we did not choose the prize initially, we always win by switching. In any event, the simulation supports the conclusion that we should switch.



R

In R, we write two functions. In one, Monty opens a door, choosing at random among the non-chosen doors if the initial choice was correct, or choosing the one non-selected non-prize door if the initial choice was wrong. The other function returns the door chosen by swapping. We use the sample() function (section 1.5.2) to randomly pick one value. We then use these functions on each trial with the apply() statement (section B.5.3).


numsim = 100000
doors = 1:3
opendoor = function(x) {
if (x[1]==x[2])
return(sample(doors[-c(x[1])], 1))
else return(doors[-c(x[1],x[2])])
}
swapdoor = function(x) { return(doors[-c(x[1], x[2])]) }
winner = sample(doors, numsim, replace=TRUE)
choice = sample(doors, numsim, replace=TRUE)
open = apply(cbind(winner, choice), 1, opendoor)
newchoice = apply(cbind(open, choice), 1, swapdoor)


To display the results, we use the cat() statement to show how text and variables can be integrated.


> cat("without switching, won ",
+ round(sum(winner==choice)/numsim*100,1),"
+ percent of the time.\n", sep="")
without switching, won 33.2 percent of the time.
> cat("with switching, won ",
+ round(sum(winner==newchoice)/numsim*100,1),"
+ percent of the time.\n", sep="")
with switching, won 66.8 percent of the time.

Thursday, September 17, 2009

Example 7.12: Calculate and plot a running average

The Law of Large Numbers concerns the stability of the mean, as sample sizes increase. This is an important topic in mathematical statistics. The convergence (or lack thereof, for certain distributions) can easily be visualized in SAS and R (see also Horton, Qian and Brown, 2004).

Assume that X1, X2, ..., Xn are independent and identically distributed realizations from some distribution with center mu. We define x-bar(k) as the average of the first k observations. Recall that because the expectation of a Cauchy random variable is undefined (Romano and Siegel, 1986) the sample average does not converge to the population parameter.

SAS
In SAS, we begin by writing a macro (section A.8) to generate the random variables and calculate the running average.

The macro runs a data step to generate the data and calculate the average after each value is generated using the sum function (section 1.8.1). The macro accepts the name of a data set to hold the variates and their running average, the number of random variates to generate, and the argument(s) to the rand function (section 1.10) so that variates from many distributions can be generated. For the purposes of making a dramatic example, we set a fixed random number seed (section 1.10.9).


%macro runave(outdata=, n=, distparms=) ;
data &outdata;
call streaminit(1984);
do i = 1 to &n;
x = rand &distparms;
runtot = sum(x, runtot);
avex = runtot/i;
output;
end;
run;
%mend runave;


In this example, we compare the Cauchy distribution to a t distribution with 4 degrees of freedom (for which the average does converge to the mean) for 1000 observations. To do this, we call the macro twice.


%runave(outdata=cauchy, n=1000, distparms= ("CAUCHY"));
%runave(outdata=t4, n=1000, distparms= ('T',4));


To plot the two running averages on a single figure, we combine the two data sets, using the in= data set option to identify which data set each observation came from. Then we plot the two curves using the a*b=c syntax for proc gplot (section 5.2.2). To connect the dots and choose line styles and colors, we use the symbol statement (sections 5.3.9-11), and a title statement (section 5.2.9) to add a title. SAS generates a legend by default with the a*b=c syntax.


data c_t4;
set cauchy t4 (in=t4);
if t4 then dist="t with 4 df";
else dist="Cauchy";
run;

symbol1 i=j c=black l=1 w=2;
symbol2 i=j c=black l=2 w=2;

title"Running average of two distributions";
proc gplot data=c_t4;
plot avex * i = dist / vref=0;
run;
quit;




R
Within R, we define a function (section B.5.2) to calculate the running average for a given vector, again allowing for variates from many distributions to be generated.


runave <- function(n, gendist, ...) {
x <- gendist(n, ...)
avex <- numeric(n)
for (k in 1:n) {
avex[k] <- mean(x[1:k])
}
return(data.frame(x, avex))
}



Next, we generate the data, using our new function. To make sure we have a nice example, we first set a fixed seed (section 1.10.9).


vals <- 1000
set.seed(1984)
cauchy <- runave(vals, rcauchy)
t4 <- runave(vals, rt, 4)



Finally, we're ready to plot. We begin by making an empty plot with the correct x and y limits, using the type="n" specification (section 5.1.1). We then plot the running average using the lines function (section 5.2.1) and varying the line style (section 5.3.9) and thickness (section 5.3.11) with the lty and lwd specification. Finally we add a title (section 5.2.9) and a legend (section 5.2.14).


plot(c(cauchy$avex, t4$avex), xlim=c(1, vals), type="n")
lines(1:vals, cauchy$avex, lty=1, lwd=2)
lines(1:vals, t4$avex, lty=2, lwd=2)
abline(0, 0)
title("Running average of two distributions")
legend(vals*.6, -1, legend=c("Cauchy", "t with 4 df"),
lwd=2, lty=c(1,2))


Wednesday, July 15, 2009

Example 7.5: Replicating a prettier jittered scatterplot

The scatterplot in section 7.4 is a plot we could use repeatedly. We demonstrate how to create a macro (SAS, section A.8) and a function (R, section B.5) to do it more easily.

SAS

%macro logiplot(x=x, y=y, data=, jitterwidth=.05, smooth=50);
data lp1;
set &data;
if &y eq 0 or &y eq 1;
jitter = uniform(0) * &jitterwidth;
if &y eq 1 then yplot = &y - jitter;
else if &y eq 0 then yplot = &y + jitter;
run;

axis1 minor=none label = ("&x");
axis2 minor=none label = (angle = 270 rotate = 90 "&y");
symbol1 i=sm&smooth.s v=none c=blue;
symbol2 i=none v=dot h=.2 c=blue;
proc gplot data=lp1;
plot (&y yplot) * &x / overlay haxis=axis1 vaxis=axis2;
run;
quit;


R

logiplot <- function(x, y, jitamount=.01, smoothspan=2/3,
xlabel="X label", ylabel="Y label") {
jittery <- jitter(y, amount=jitamount/2)
correction <- ifelse(y==0, jitamount/2, -jitamount/2)
jittery <- jittery + correction
plot(x, y, type="n", xlab=xlabel, ylab=ylabel)
points(x, jittery, pch=20, col="blue")
lines(lowess(x, y, f=smoothspan), col="blue")
}


We’ll load the example data set from the book via the web (section 1.1.6), then make a plot of the real data.

SAS

filename myurl
url 'http://www.math.smith.edu/sasr/datasets/help.csv' lrecl=704;

proc import
datafile=myurl
out=ds dbms=dlm;
delimiter=',';
getnames=yes;
run;

%logiplot(x = mcs, y = homeless, data=ds, smooth=40);


R

ds <- read.csv("http://www.math.smith.edu/sasr/datasets/help.csv")
logiplot(ds$mcs, ds$homeless, jitamount=.05, smoothspan=.3,
xlabel="mcs", ylabel="homeless")


The resulting plots are quite similar, but still differ with respect to the smoother and the exact jitter applied to each point.



(Click for a bigger picture.)
Subscribe to: Comments (Atom)

AltStyle によって変換されたページ (->オリジナル) /