SAS and R: correlated data models

Catalogs of posts

Showing posts with label correlated data models. Show all posts

Monday, May 17, 2010

Example 7.37: calculation of Hotelling's T^2

Hotelling's T^2 is a multivariate statistic used to compare two groups, where multiple outcomes are observed for each subject.

Here we demonstrate how to calculate Hotelling's T^2 using R and SAS, and test the code using a simulation study then apply it to data from the HELP study.

R

We utilize an approach suggested by Peter Mandeville in a posting to the R-help mailing list.


hotelling = function(y1, y2) {
 # check the appropriate dimension
 k = ncol(y1)
 k2 = ncol(y2)
 if (k2!=k) 
 stop("input matrices must have the same number of columns: y1 has ",
 k, " while y2 has ", k2)
 
 # calculate sample size and observed means
 n1 = nrow(y1)
 n2 = nrow(y2) 
 ybar1= apply(y1, 2, mean); ybar2= apply(y2, 2, mean)
 diffbar = ybar1-ybar2
 
 # calculate the variance of the difference in means
 v = ((n1-1)*var(y1)+ (n2-1)*var(y2)) /(n1+n2-2)
 
 # calculate the test statistic and associated quantities
 t2 = n1*n2*diffbar%*%solve(v)%*%diffbar/(n1+n2)
 f = (n1+n2-k-1)*t2/((n1+n2-2)*k)
 pvalue = 1-pf(f, k, n1+n2-k-1)
 
 # return the list of results
 return(list(pvalue=pvalue, f=f, t2=t2, diff=diffbar))
}

We begin by ensuring that the two input matrices have the same number of columns, then calculate quantities needed for the calculation. The function returns a list with a number of named objects.

To test the function, we ran a simulation study to ensure that under the null hypothesis the function rejected approximately 5% of the time.


# do a simple simulation study to ensure Type I error rate is preserved
library(MASS)
numoutcomes = 5
mu1 = rep(0, numoutcomes); mu2 = rep(0, numoutcomes)
rho = .4
# compound symmetry covariance matrix
Sigma = rho*matrix(rep(1, numoutcomes^2), numoutcomes) + 
 diag(rep(1-rho, numoutcomes)) 
numsim = 10000
reject = numeric(numsim) 
for (i in 1:numsim) {
 grp1 = mvrnorm(100, mu1, Sigma)
 grp2 = mvrnorm(150, mu2, Sigma)
 reject[i] = hotelling(grp1, grp2)$pvalue<0.05
}

This yields appropriate Type I error rate (just under 5%).


> table(reject)
reject
 0 1 
9537 463

We can also undertake a comparison of non-homeless and homeless subjects in the HELP dataset, comparing individual observations on the PCS (physical component score) and MCS (mental component scores) of the SF-36 and the CESD measure of depressive symptoms.


ds = read.csv("http://www.math.smith.edu/sasr/datasets/help.csv")
smallds = with(ds, cbind(pcs, mcs, cesd))
grp1 = subset(smallds, ds$homeless==0)
grp2 = subset(smallds, ds$homeless==1)
res = hotelling(grp1, grp2)

This saves the results in the object res, which we can display.


> names(res)
[1] "pvalue" "f" "t2" "diff" 
> res
$pvalue
 [,1]
[1,] 0.1082217

$f
 [,1]
[1,] 2.035024

$t2
 [,1]
[1,] 6.132267

$diff
 pcs mcs cesd 
 2.064049 1.755975 -2.183760

While there are differences in the mean PCS, MCS, and CESD scores between
the homeless and non-homeless subjects, we do not have sufficient evidence to reject the null hypothesis that they have equivalent means back in the populations (p=0.11).

SAS

In SAS, we can easily make the two-group multivariate comparison using the MANOVA statement in proc glm. This hides the actual value of the statistic, but generates the desired p-value. As usual, we use the ODS system to reduce SAS output.


ods select multstat;
proc glm data="c:\book\help";
 class homeless;
 model pcs mcs cesd = homeless;
 manova h=homeless;
run;
quit;

SAS prints four valid tests, which result in equal values in this case.


 MANOVA Test Criteria and Exact F Statistics for
 the Hypothesis of No Overall HOMELESS Effect
 H = Type III SSCP Matrix for HOMELESS
 E = Error SSCP Matrix

 S=1 M=0.5 N=223.5

 Statistic Value F Value Num DF Den DF Pr> F

 Wilks' Lambda 0.98658536 2.04 3 449 0.1082
 Pillai's Trace 0.01341464 2.04 3 449 0.1082
 Hotelling-Lawley Trace 0.01359704 2.04 3 449 0.1082
 Roy's Greatest Root 0.01359704 2.04 3 449 0.1082

Here we also demonstrate using SAS proc iml (section 1.9) to calculate the statistic and perform the test from scratch. This roughly parallels the R development above, though we omit the test for equal number of variables.


*get the help data;
data help;
set "c:\book\help";
run;

proc iml;
* build a function;
start hotelling;
 n1 = nrow(h1);
 n0 = nrow(h0);
 onesum = j(n1,1,1); *make an n1*1 matrix of 1s;
 zerosum = j(n0,1,1); 
 meanh1 = h1`*onesum/n1;
 meanh0 = h0`*zerosum/n0;
 i1 = i(n1); * make identity matrix;
 i0 = i(n0); 
 vc1 = h1`*(i1-onesum*onesum`/n1)*h1/(n1-1.0);
 * make variance-covariance matrix;
 vc0 = h0`*(i0-zerosum*zerosum`/n0)*h0/(n0-1.0);
 pooledvc = ((n1-1.0)*vc1+(n0-1.0)*vc0)/(n1+n0-2.0);
 meandiff = meanh1 - meanh0;
 t2 = meandiff`* inv(pooledvc) *meandiff * (n1 * n0)/(n1 + n0);
 nvars = ncol(h1);
 f=((n1+n0-nvars-1)/(nvars*(n1+n0-2)))*t2;
 df2=n1+n0-nvars-1;
 p=1-probf(f,nvars,df2);
 fcrit = finv(.95, nvars, df2);
 print meandiff pooledvc t2 f nvars df2 p fcrit;
finish;

*use the function with the HELP data;
use help;
 read all var {cesd pcs mcs} where (homeless = 1) into h1;
 read all var {cesd pcs mcs} where (homeless = 0) into h0;
 run hotelling;
quit;

Here's the output:


 meandiff pooledvc t2 f nvars

 2.1837595 155.76862 -38.46634 -108.8555 6.1322669 2.0350243 3
 -2.064049 -38.46634 115.50213 14.423897
 -1.755975 -108.8555 14.423897 164.44444


 df2 p fcrit

 449 0.1082217 2.6247689

Posted by Nick Horton

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Saturday, July 18, 2009

Book excerpts now posted

[フレーム]

We've posted excerpts from the book on the book website. The excerpts include Chapter 3 (regression and ANOVA) in its entirety. This demonstrates how the entries (the generic descriptions of software functions) and the worked examples reinforce each other.

We've also posted selected entries from each of the other chapters, describing probability distributions and random variables, two sample comparisons (T-tests, Wilcoxon rank-sum test, Kolmogorov-Smirnov test, median test), linear mixed models for correlated data (including general correlation, random intercepts and slopes, and more complex structures), graphics (scatterplots, histograms, saving plots as PDF), and power calculations (analytic and Monte Carlo).

Posted by Nick Horton

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Subscribe to SAS and R!

RSS: Or: Get SAS and R by Email

Search the SAS and R Blog

The book (second edition, 2014)

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

Buy a book

SAS and R: Data Management, Statistical Analysis, and Graphics, Second Edition

Using SAS for Data Management, Statistical Analysis, and Graphics

Using R for Data Management, Statistical Analysis, and Graphics

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

Sidebar list of all entries.

SAS and R

Catalogs of posts

Monday, May 17, 2010

Example 7.37: calculation of Hotelling's T^2

Saturday, July 18, 2009

Book excerpts now posted

About SAS and R

Topics discussed