SAS and R: matrices

Catalogs of posts

Showing posts with label matrices. Show all posts

Monday, May 17, 2010

Example 7.37: calculation of Hotelling's T^2

Hotelling's T^2 is a multivariate statistic used to compare two groups, where multiple outcomes are observed for each subject.

Here we demonstrate how to calculate Hotelling's T^2 using R and SAS, and test the code using a simulation study then apply it to data from the HELP study.

R

We utilize an approach suggested by Peter Mandeville in a posting to the R-help mailing list.


hotelling = function(y1, y2) {
 # check the appropriate dimension
 k = ncol(y1)
 k2 = ncol(y2)
 if (k2!=k) 
 stop("input matrices must have the same number of columns: y1 has ",
 k, " while y2 has ", k2)
 
 # calculate sample size and observed means
 n1 = nrow(y1)
 n2 = nrow(y2) 
 ybar1= apply(y1, 2, mean); ybar2= apply(y2, 2, mean)
 diffbar = ybar1-ybar2
 
 # calculate the variance of the difference in means
 v = ((n1-1)*var(y1)+ (n2-1)*var(y2)) /(n1+n2-2)
 
 # calculate the test statistic and associated quantities
 t2 = n1*n2*diffbar%*%solve(v)%*%diffbar/(n1+n2)
 f = (n1+n2-k-1)*t2/((n1+n2-2)*k)
 pvalue = 1-pf(f, k, n1+n2-k-1)
 
 # return the list of results
 return(list(pvalue=pvalue, f=f, t2=t2, diff=diffbar))
}

We begin by ensuring that the two input matrices have the same number of columns, then calculate quantities needed for the calculation. The function returns a list with a number of named objects.

To test the function, we ran a simulation study to ensure that under the null hypothesis the function rejected approximately 5% of the time.


# do a simple simulation study to ensure Type I error rate is preserved
library(MASS)
numoutcomes = 5
mu1 = rep(0, numoutcomes); mu2 = rep(0, numoutcomes)
rho = .4
# compound symmetry covariance matrix
Sigma = rho*matrix(rep(1, numoutcomes^2), numoutcomes) + 
 diag(rep(1-rho, numoutcomes)) 
numsim = 10000
reject = numeric(numsim) 
for (i in 1:numsim) {
 grp1 = mvrnorm(100, mu1, Sigma)
 grp2 = mvrnorm(150, mu2, Sigma)
 reject[i] = hotelling(grp1, grp2)$pvalue<0.05
}

This yields appropriate Type I error rate (just under 5%).


> table(reject)
reject
 0 1 
9537 463

We can also undertake a comparison of non-homeless and homeless subjects in the HELP dataset, comparing individual observations on the PCS (physical component score) and MCS (mental component scores) of the SF-36 and the CESD measure of depressive symptoms.


ds = read.csv("http://www.math.smith.edu/sasr/datasets/help.csv")
smallds = with(ds, cbind(pcs, mcs, cesd))
grp1 = subset(smallds, ds$homeless==0)
grp2 = subset(smallds, ds$homeless==1)
res = hotelling(grp1, grp2)

This saves the results in the object res, which we can display.


> names(res)
[1] "pvalue" "f" "t2" "diff" 
> res
$pvalue
 [,1]
[1,] 0.1082217

$f
 [,1]
[1,] 2.035024

$t2
 [,1]
[1,] 6.132267

$diff
 pcs mcs cesd 
 2.064049 1.755975 -2.183760

While there are differences in the mean PCS, MCS, and CESD scores between
the homeless and non-homeless subjects, we do not have sufficient evidence to reject the null hypothesis that they have equivalent means back in the populations (p=0.11).

SAS

In SAS, we can easily make the two-group multivariate comparison using the MANOVA statement in proc glm. This hides the actual value of the statistic, but generates the desired p-value. As usual, we use the ODS system to reduce SAS output.


ods select multstat;
proc glm data="c:\book\help";
 class homeless;
 model pcs mcs cesd = homeless;
 manova h=homeless;
run;
quit;

SAS prints four valid tests, which result in equal values in this case.


 MANOVA Test Criteria and Exact F Statistics for
 the Hypothesis of No Overall HOMELESS Effect
 H = Type III SSCP Matrix for HOMELESS
 E = Error SSCP Matrix

 S=1 M=0.5 N=223.5

 Statistic Value F Value Num DF Den DF Pr> F

 Wilks' Lambda 0.98658536 2.04 3 449 0.1082
 Pillai's Trace 0.01341464 2.04 3 449 0.1082
 Hotelling-Lawley Trace 0.01359704 2.04 3 449 0.1082
 Roy's Greatest Root 0.01359704 2.04 3 449 0.1082

Here we also demonstrate using SAS proc iml (section 1.9) to calculate the statistic and perform the test from scratch. This roughly parallels the R development above, though we omit the test for equal number of variables.


*get the help data;
data help;
set "c:\book\help";
run;

proc iml;
* build a function;
start hotelling;
 n1 = nrow(h1);
 n0 = nrow(h0);
 onesum = j(n1,1,1); *make an n1*1 matrix of 1s;
 zerosum = j(n0,1,1); 
 meanh1 = h1`*onesum/n1;
 meanh0 = h0`*zerosum/n0;
 i1 = i(n1); * make identity matrix;
 i0 = i(n0); 
 vc1 = h1`*(i1-onesum*onesum`/n1)*h1/(n1-1.0);
 * make variance-covariance matrix;
 vc0 = h0`*(i0-zerosum*zerosum`/n0)*h0/(n0-1.0);
 pooledvc = ((n1-1.0)*vc1+(n0-1.0)*vc0)/(n1+n0-2.0);
 meandiff = meanh1 - meanh0;
 t2 = meandiff`* inv(pooledvc) *meandiff * (n1 * n0)/(n1 + n0);
 nvars = ncol(h1);
 f=((n1+n0-nvars-1)/(nvars*(n1+n0-2)))*t2;
 df2=n1+n0-nvars-1;
 p=1-probf(f,nvars,df2);
 fcrit = finv(.95, nvars, df2);
 print meandiff pooledvc t2 f nvars df2 p fcrit;
finish;

*use the function with the HELP data;
use help;
 read all var {cesd pcs mcs} where (homeless = 1) into h1;
 read all var {cesd pcs mcs} where (homeless = 0) into h0;
 run hotelling;
quit;

Here's the output:


 meandiff pooledvc t2 f nvars

 2.1837595 155.76862 -38.46634 -108.8555 6.1322669 2.0350243 3
 -2.064049 -38.46634 115.50213 14.423897
 -1.755975 -108.8555 14.423897 164.44444


 df2 p fcrit

 449 0.1082217 2.6247689

Posted by Nick Horton

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Saturday, July 25, 2009

Example 7.7: Tabulate binomial probabilities

[フレーム]

Suppose we wanted to assess the probability P(X=x) for a binomial random variate with n = 10 and with p = .81, .84, ..., .99. This could be helpful, for example, in various game settings.

In SAS, we find the probability that X=x using differences in the CDF calculated via the cdf function (1.10.1). We loop through the various binomial probabilities and values of x using the do ... end structure (section 1.11.1). Finally, we store the probabilities in implicitly named variables via an array statement (section 1.11.5).

SAS


data table (drop=j);
array probs [11] prob0 prob1 - prob10;
do p = .81 to .99 by .03;
 do j = 0 to 10;
 if j eq 1 then probs[j+1] = cdf("BINOMIAL", 0, p, 10);
 else probs[j+1] = cdf("BINOMIAL", j, p, 10) 
 - cdf("BINOMIAL", j-1, p, 10);
 end;
 output;
 end;
run;

The results are printed with two decimal places using the format statement (section 1.2.4). The options statement (section A.4) uses the ls option to narrow the output width to be compatible with Blogger.


options ls=64;
proc print data=table noobs;
 var p prob0 prob1 - prob10;
 format _numeric_ 3.2; 
run;

And the results are:


 p
 p p p p p p p p p p r
 r r r r r r r r r r o
 o o o o o o o o o o b
 b b b b b b b b b b 1
 p 0 1 2 3 4 5 6 7 8 9 0
.81 .00 .00 .00 .00 .00 .02 .08 .19 .30 .29 .12
.84 .00 .00 .00 .00 .00 .01 .05 .15 .29 .33 .17
.87 .00 .00 .00 .00 .00 .00 .03 .10 .25 .37 .25
.90 .00 .00 .00 .00 .00 .00 .01 .06 .19 .39 .35
.93 .00 .00 .00 .00 .00 .00 .00 .02 .12 .36 .48
.96 .00 .00 .00 .00 .00 .00 .00 .01 .05 .28 .66
.99 .00 .00 .00 .00 .00 .00 .00 .00 .00 .09 .90

R
In R we start by making a vector of the binomial probabilities, using the : operator (section 1.11.3) to generate a sequence of integers. After creating a matrix (section B.4.4) to hold the table results, we loop (section 1.11.1) through the binomial probabilities, calling the dbinom() function (section 1.1) to find the probability that X=x. This calculation is nested within the round() function (section 1.2.5) to reduce the digits displayed. Finally, we glue the vector of binomial probabilities to the results.


p <- .78 + (3 * 1:7)/100
allprobs <- matrix(nrow=length(p), ncol=11)
for (i in 1:length(p)) {
 allprobs[i,] <- round(dbinom(0:10, 10, p[i]),2)
}
table <- cbind(p, allprobs)
table

With the result:


 p 
[1,] 0.81 0 0 0 0 0 0.02 0.08 0.19 0.30 0.29 0.12
[2,] 0.84 0 0 0 0 0 0.01 0.05 0.15 0.29 0.33 0.17
[3,] 0.87 0 0 0 0 0 0.00 0.03 0.10 0.25 0.37 0.25
[4,] 0.90 0 0 0 0 0 0.00 0.01 0.06 0.19 0.39 0.35
[5,] 0.93 0 0 0 0 0 0.00 0.00 0.02 0.12 0.36 0.48
[6,] 0.96 0 0 0 0 0 0.00 0.00 0.01 0.05 0.28 0.66
[7,] 0.99 0 0 0 0 0 0.00 0.00 0.00 0.00 0.09 0.90

As with the example of jittered scatterplots for dichotomous y by continuous x, (Example 7.3, Example 7.4, and Example 7.5) we will revisit this example for improvement in later entries.

Posted by Ken Kleinman

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: data step, formatted output, looping, matrices, probability distributions, sequences 2 comments

Subscribe to SAS and R!

RSS: Or: Get SAS and R by Email

Search the SAS and R Blog

The book (second edition, 2014)

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

Buy a book

SAS and R: Data Management, Statistical Analysis, and Graphics, Second Edition

Using SAS for Data Management, Statistical Analysis, and Graphics

Using R for Data Management, Statistical Analysis, and Graphics

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

Sidebar list of all entries.

SAS and R

Catalogs of posts

Monday, May 17, 2010

Example 7.37: calculation of Hotelling's T^2

Saturday, July 25, 2009

Example 7.7: Tabulate binomial probabilities

About SAS and R

Topics discussed