Showing posts with label looping. Show all posts
Showing posts with label looping. Show all posts

Monday, April 30, 2012

Example 9.29: the perils of for loops

A recent exchange on the R-sig-teaching list featured a discussion of how best to teach new students R. The initial post included an exercise to write a function, that given a n, will draw n rows of a triangle made up of "*", noting that for a beginner, this may require two for loops. For example, in pseudo-code:

for i = 1 to n
for j = 1 to i
print "*"

Unfortunately, as several folks (including Richard M. Heiberger and R. Michael Weylandt) noted, for loops in general are not the best way to take full advantage of R. In this entry, we review two solutions they proposed which fit within the R philosophy.

Richard's solution uses the outer() function to generate a 5x5 matrix of logical values indicating whether the column number is bigger than the row number. Next the ifelse() function is used to replace TRUE with *.

> ifelse(outer(1:5, 1:5, `>=`), "*", " ")
[,1] [,2] [,3] [,4] [,5]
[1,] "*" " " " " " " " "
[2,] "*" "*" " " " " " "
[3,] "*" "*" "*" " " " "
[4,] "*" "*" "*" "*" " "
[5,] "*" "*" "*" "*" "*"

Michael's solution uses the lapply() function to call a function repeatedly for different values of n. This returns a list rather than a matrix, but accomplishes the same task.

> lapply(1:5, function(x) cat(rep("*", x), "\n"))
*
* *
* * *
* * * *
* * * * *

While this exercise is of little practical value, it does illustrate some important points, and provides a far more efficient as well as elegant way of accomplishing the tasks. For those interested in more, another resource is the R Inferno project of Patric Burns.

SAS
We demonstrate a SAS data step solution mainly to call out some useful features and cautions. In all likelihood a proc iml matrix-based solution would be more elegant;

data test;
array star [5] $ star1 - star5;
do i = 1 to 5;
star[i] = "*";
output;
end;
run;

proc print noobs; var star1 - star5; run;

star1 star2 star3 star4 star5

*
* *
* * *
* * * *
* * * * *

In particular, note the $ in the array statement, which allows the variables to contain characters; by default variables created by an array statement are numeric. In addition, note the reference to a sequentially suffixed list of variables using the single hyphen shortcut; this would help in generalizing to n rows. Finally, note that we were able to avoid a second do loop (SAS' primary iterative looping syntax) mainly by luck-- the most recently generated value of a variable is saved by default. This can cause trouble, in general, but here it keeps all the previous "*"s when moving on to the next row.




An unrelated note about aggregatorsWe love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers and PROC-X with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Wednesday, January 11, 2012

Example 9.19: Demonstrating the central limit theorem


A colleague recently asked "why should the average get closer to the mean when we increase the sample size?" We should interpret this question as asking why the standard error of the mean gets smaller as n increases. The central limit theorem shows that (under certain conditions, of course) the standard error must do this, and that the mean approaches a normal distribution. But the question was why does it? This seems so natural that it may have gone unquestioned in the past.

The best simple rationale may be that there are more ways to get middle values than extreme values--for example, the mean of a die roll (uniform discrete distribution on 1, 2, ..., 6) is 3.5. With one die, you're equally likely to get an "average" of 3 or of 1. But with two dice there are five ways to get an average of 3, and only one way to get an average of 1. You're 5 times more likely to get the value that's closer to the mean than the one that's further away.

Here's a simple graphic to show that the standard error decreases with increasing n.


SAS
We begin by simulating some data-- normal, here, but of course that doesn't matter (assuming that the standard deviation exists for whatever distribution we pick and the sample size is appropriately large). Rather than simulate separate samples with n = 1 ... k, it's easier to add a random variate to a series and keep a running tally of the mean, which is easy with a little algebra. This approach also allows tracking the progress of the mean of each series, which could also be useful.


%let nsamp = 100;
data normal;
do sample = 1 to &nsamp;
meanx = 0;
do obs = 1 to &nsamp;
x = normal(0);
meanx = ((meanx * (obs -1)) + x)/obs;
output;
end;
end;
run;

We can now plot the means vs. the number of observations.

symbol1 i = none v = dot h = .2;
proc gplot data = normal;
plot meanx * obs;
run;
quit;

symbol1 i=join v=none r=&nsamp;
proc gplot data=normal;
plot meanx * obs = sample / nolegend;
run; quit;

The graphic resulting from the first proc gplot is shown above, and demonstrates both how quickly the variability of the estimate of the mean decreases when n is small, and how little it changes when n is larger. A plot showing the means for each sequence converging can be generated with the second block of code. Note the use of the global macro variable nsamp assigned using the %let statement (section A.8.2).

R
We'll also generate sequences of variates in R. We'll do this by putting the random variates in a matrix, and treating each row as a sequence. We'll use the apply() function (sections 1.10.6 and B.5.3) to treat each row of the matrix separately.

numsim = 100
matx = matrix(rnorm(numsim^2), nrow=numsim)

runavg = function(x) { cumsum(x)/(1:length(x)) }
ramatx = t(apply(matx, 1, runavg))

The simple function runavg() calculates the running average of a vector and returns the a vector of equal length. By using it as the function in apply() we can get the running average of each row. The result must be transposed (with the t() function, section 1.9.2) to keep the sequences in rows. To plot the values, we'll use the type="n" option to plot(), specifying the first column of the running total as the y variable. While it's possible that the running average will surpass the average when n=1, we ignore that case in this simple demonstration.

plot(x=1:numsim, y = ramatx[,1], type="n",
xlab="number of observations", ylab="running mean")
rapoints = function(x) points(x~seq(1:length(x)), pch=20, cex=0.2)
apply(ramatx,1,rapoints)

plot(x=1:numsim, y = ramatx[,1], type="n",
xlab="number of observations", ylab="running mean")
ralines = function(x) lines(x~seq(1:length(x)))
apply(ramatx, 1, ralines)

Here we define another simple function to plot the values in a vector against the place number, then again use the apply() function to plot each row as a vector. The first set of code generates a plot resembling the SAS graphic presented above. The second set of code will connect the values in each sequence, with results shown below.

Wednesday, January 20, 2010

Example 7.23: the Monty Hall problem

The Monty Hall problem illustrates a simple setting where intuition often leads to a solution different from formal reasoning. The situation is based on the game show Let's Make a Deal. First, Monty puts a prize behind one of three doors. Then the player chooses a door. Next, (without moving the prize) Monty opens an unselected door, revealing that the prize is not behind it. The player may then switch to the other non-selected door. Should the player switch?

Many people see that there are now two doors to choose between, and feel that since Monty can always open a non-prize door, there's still equal probability for each door. If that were the case, the player might as well keep the original door.

This approach is so attractive that when Marilyn Vos Savant asserted that the player should switch (in her Parade magazine column), there were reportedly 10,000 letters asserting she was wrong. There was also an article published in The American Statistician. As reported on the Wikipedia page, out of 228 subjects in one study, only 13% chose to switch (Granberg and Brown, Personality and Social Psychology Bulletin 21(7): 711-729). In her book, The Power of Logical Thinking, vos Savant quotes cognitive psychologist Massimo Piattelli-Palmarini as saying "... no other statistical puzzle comes so close to fooling all the people all the time" and "that even Nobel physicists systematically give the wrong answer, and that they insist on it, and they are ready to berate in print those who propose the right answer."

One correct intuitive route is to observe that Monty's door is fixed. The probability that the player has the right door is 1/3 before Monty opens the non-prize door, and remains 1/3 after that door is open. This means that the probability the prize is behind one of the other doors is 2/3, both before and after Monty opens the non-prize door. After Monty opens the non-prize door, the player gets a 2/3 chance of winning by switching to the remaining door. If the player wants to win, they should switch doors.

The excellent Wikipedia entry referenced above provides additional intuitive tools, as well as variants and history.

One way to prove to yourself that switch is best is through simulation. In fact, even deciding how to code the problem may be enough to convince yourself to switch.

SAS
In SAS, we assign the prize to a door, then make an initial guess. If the guess was right, Monty can open either door. We'll switch to the other door. Rather than have Monty choose a door, we'll choose one, under the assumption that Monty opened the other one. We do this with a do until loop (section 1.11.1). If our initial guess was wrong, Monty will open the only remaining non-prize door, and when we switch we'll be choosing the prize door.



data mh;
do i = 1 to 100000;
prize = rand("TABLE",.333,.333);
* Monty puts the prize behind a random door;
initial_guess = rand("TABLE",.333,.333);
* We make a random initial guess;
* if the initial guess is right, Monte can
open either of the others;
* which means that player can switch to either
of the other doors;
if initial_guess eq prize then do;
new_guess = initial_guess;
* choose a door until it's different from
the initial guess-- that's
the door we switch to;
do until (new_guess ne initial_guess);
new_guess = rand("TABLE",.333,.333);
end;
end;
* If the initial guess was wrong, Monte can rule
out only one of the other doors;
* which means that we must switch to the right door;
if initial_guess ne prize then new_guess = prize;
output;
end;
run;


To see what happened, we'll summarize the resulting data set. If our initial guess was right, we'll win by keeping it. If switching leads to a win, the new guess is where the prize is. We create new variables to indicate winning using a logical test (as in section 1.4.9).


data mh2;
set mh;
win_by_keep = (initial_guess eq prize);
win_by_switch = (new_guess eq prize);
run;

proc means data = mh2 mean;
var win_by_keep win_by_switch;
run;



The MEANS Procedure

Variable Mean
-----------------------------
win_by_keep 0.3332700
win_by_switch 0.6667300
-----------------------------


Note that these two values sum to exactly one. If we chose the prize initially, we always win by keeping it, and if we did not choose the prize initially, we always win by switching. In any event, the simulation supports the conclusion that we should switch.



R

In R, we write two functions. In one, Monty opens a door, choosing at random among the non-chosen doors if the initial choice was correct, or choosing the one non-selected non-prize door if the initial choice was wrong. The other function returns the door chosen by swapping. We use the sample() function (section 1.5.2) to randomly pick one value. We then use these functions on each trial with the apply() statement (section B.5.3).


numsim = 100000
doors = 1:3
opendoor = function(x) {
if (x[1]==x[2])
return(sample(doors[-c(x[1])], 1))
else return(doors[-c(x[1],x[2])])
}
swapdoor = function(x) { return(doors[-c(x[1], x[2])]) }
winner = sample(doors, numsim, replace=TRUE)
choice = sample(doors, numsim, replace=TRUE)
open = apply(cbind(winner, choice), 1, opendoor)
newchoice = apply(cbind(open, choice), 1, swapdoor)


To display the results, we use the cat() statement to show how text and variables can be integrated.


> cat("without switching, won ",
+ round(sum(winner==choice)/numsim*100,1),"
+ percent of the time.\n", sep="")
without switching, won 33.2 percent of the time.
> cat("with switching, won ",
+ round(sum(winner==newchoice)/numsim*100,1),"
+ percent of the time.\n", sep="")
with switching, won 66.8 percent of the time.

Saturday, July 25, 2009

Example 7.7: Tabulate binomial probabilities

Suppose we wanted to assess the probability P(X=x) for a binomial random variate with n = 10 and with p = .81, .84, ..., .99. This could be helpful, for example, in various game settings.

In SAS, we find the probability that X=x using differences in the CDF calculated via the cdf function (1.10.1). We loop through the various binomial probabilities and values of x using the do ... end structure (section 1.11.1). Finally, we store the probabilities in implicitly named variables via an array statement (section 1.11.5).

SAS

data table (drop=j);
array probs [11] prob0 prob1 - prob10;
do p = .81 to .99 by .03;
do j = 0 to 10;
if j eq 1 then probs[j+1] = cdf("BINOMIAL", 0, p, 10);
else probs[j+1] = cdf("BINOMIAL", j, p, 10)
- cdf("BINOMIAL", j-1, p, 10);
end;
output;
end;
run;


The results are printed with two decimal places using the format statement (section 1.2.4). The options statement (section A.4) uses the ls option to narrow the output width to be compatible with Blogger.


options ls=64;
proc print data=table noobs;
var p prob0 prob1 - prob10;
format _numeric_ 3.2;
run;


And the results are:

p
p p p p p p p p p p r
r r r r r r r r r r o
o o o o o o o o o o b
b b b b b b b b b b 1
p 0 1 2 3 4 5 6 7 8 9 0
.81 .00 .00 .00 .00 .00 .02 .08 .19 .30 .29 .12
.84 .00 .00 .00 .00 .00 .01 .05 .15 .29 .33 .17
.87 .00 .00 .00 .00 .00 .00 .03 .10 .25 .37 .25
.90 .00 .00 .00 .00 .00 .00 .01 .06 .19 .39 .35
.93 .00 .00 .00 .00 .00 .00 .00 .02 .12 .36 .48
.96 .00 .00 .00 .00 .00 .00 .00 .01 .05 .28 .66
.99 .00 .00 .00 .00 .00 .00 .00 .00 .00 .09 .90


R
In R we start by making a vector of the binomial probabilities, using the : operator (section 1.11.3) to generate a sequence of integers. After creating a matrix (section B.4.4) to hold the table results, we loop (section 1.11.1) through the binomial probabilities, calling the dbinom() function (section 1.1) to find the probability that X=x. This calculation is nested within the round() function (section 1.2.5) to reduce the digits displayed. Finally, we glue the vector of binomial probabilities to the results.


p <- .78 + (3 * 1:7)/100
allprobs <- matrix(nrow=length(p), ncol=11)
for (i in 1:length(p)) {
allprobs[i,] <- round(dbinom(0:10, 10, p[i]),2)
}
table <- cbind(p, allprobs)
table


With the result:

p
[1,] 0.81 0 0 0 0 0 0.02 0.08 0.19 0.30 0.29 0.12
[2,] 0.84 0 0 0 0 0 0.01 0.05 0.15 0.29 0.33 0.17
[3,] 0.87 0 0 0 0 0 0.00 0.03 0.10 0.25 0.37 0.25
[4,] 0.90 0 0 0 0 0 0.00 0.01 0.06 0.19 0.39 0.35
[5,] 0.93 0 0 0 0 0 0.00 0.00 0.02 0.12 0.36 0.48
[6,] 0.96 0 0 0 0 0 0.00 0.00 0.01 0.05 0.28 0.66
[7,] 0.99 0 0 0 0 0 0.00 0.00 0.00 0.00 0.09 0.90


As with the example of jittered scatterplots for dichotomous y by continuous x, (Example 7.3, Example 7.4, and Example 7.5) we will revisit this example for improvement in later entries.

Friday, June 12, 2009

Example 7.1: Create a Fibonacci sequence

The Fibonacci numbers have many mathematical relationships and have been discovered repeatedly in nature. They are constructed as the sum of the previous two values, initialized with the values 1 and 1.

A pdf of this example is available here.

SAS
In SAS, we use the lag function (section 1.4.17, p. 22) to retrieve the last value.

data fibo;
do i = 1 to 10;
fib = sum(fib, lag(fib));
if i eq 1 then fib = 1;
output;
end;
run;
proc print data=fibo;
run;

This generates the following output:

Obs i fib
1 1 1
2 2 1
3 3 2
4 4 3
5 5 5
6 6 8
7 7 13
8 8 21
9 9 34
10 10 55

R
In R we can loop over an array to perform the same job.

len = 10
fibvals = numeric(len)
fibvals[1] = 1
fibvals[2] = 1
for (i in 3:len) {
fibvals[i] = fibvals[i-1] + fibvals[i-2]
}

This generates the following output:

> fibvals
[1] 1 1 2 3 5 8 13 21 34 55
Subscribe to: Comments (Atom)

AltStyle によって変換されたページ (->オリジナル) /