Showing posts with label parameterization. Show all posts
Showing posts with label parameterization. Show all posts

Tuesday, September 30, 2014

Example 2014.11: Contrasts the basic way for R

As we discuss in section 6.1.4 of the second edition, R and SAS handle categorical variables and their parameterization in models quite differently. SAS treats them on a procedure-by-procedure basis, which leads to some odd differences in capabilities and default parameterizations. For example, in the logistic procedure, the default is effect cell coding, while in the genmod procedure-- which also fits logistic regression-- the default is reference cell coding. Meanwhile, many procedures can only accommodate reference cell coding.

In R, in contrast, categorical variables can be designated as "factors" and parameterization stored an attribute of the factor.

In section 6.1.4, we demonstrate how the parameterization of a factor can be easily changed on the fly, in R, in lm(),glm(), and aov, using the contrasts= option in those functions. Here we show how to set the attribute more generally, for use in functions that don't accept the option. This post was inspired by a question from Julia Kuder, of Brigham and Women's Hospital.

SAS
We begin by simulating censored survival data as in Example 7.30. We'll also export the data to use in R.
data simcox;
 beta1 = 2;
 lambdat = 0.002; *baseline hazard;
 lambdac = 0.004; *censoring hazard;
 do i = 1 to 10000;
 x1 = rantbl(0, .25, .25,.25);
 linpred = exp(-beta1*(x1 eq 4));
 t = rand("WEIBULL", 1, lambdaT * linpred);
 * time of event;
 c = rand("WEIBULL", 1, lambdaC);
 * time of censoring;
 time = min(t, c); * which came first?;
 censored = (c lt t);
 output;
 end;
run;
proc export data=simcox replace
outfile="c:/temp/simcox.csv"
dbms=csv;
run;

Now we'll fit the data in SAS, using effect coding.
proc phreg data=simcox;
class x1 (param=effect);
model time*censored(0)= x1 ;
run;

We reproduce the rather unexciting results here for comparison with R.
 Parameter Standard 
 Parameter DF Estimate Error 
 x1 1 1 -0.02698 0.03471
 x1 2 1 -0.01211 0.03437
 x1 3 1 -0.05940 0.03458


R
In R we read the data in, then use the C() function to assign the contr.sum contrast to a version of the x1 variable that we save as a factor. Once that is done, we can fit the proportional hazards regression with the desired contrast.
simcox<- read.csv("c:/temp/simcox.csv") sc2 = transform(simcox, x1.eff = C(as.factor(x1), contr.sum(4))) effmodel <- coxph(Surv(time, censored)~ x1.eff,data= sc2) summary(effmodel) 
We excerpt the relevant output to demonstrate equivalence with SAS.
 coef exp(coef) se(coef) 
x1.eff1 -0.02698 0.97339 0.03471 
x1.eff2 -0.01211 0.98797 0.03437 
x1.eff3 -0.05940 0.94233 0.03458
An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Thursday, March 22, 2012

Example 9.24: Changing the parameterization for categorical predictors

In our book, we discuss the important question of how to assign different parameterizations to categorical variables when fitting models (section 3.1.3). We show code in R for use in the lm() function, as follows:

lm(y ~ x, contrasts=list(x,"contr.treatment")

This works great in lm() and some other functions, notably glm(). But for functions from contributed packages, the contrasts option may not work.

Here we show a more generic approach to setting contrasts in R, using Firth logistic regression, which is discussed in Example 8.15, to demonstrate. This approach is also shown in passing in section 3.7.5.

R
We'll simulate a simple data set for logistic regression, then examine the results of the default parameterization.

n = 100
j = rep(c(0,1,2), each = n)
linpred = -2.5 + j
y = (runif(n*3) < exp(linpred)/(1 + exp(linpred)) )
library(logistf)
flrfactor = logistf(y ~ as.factor(j))
summary(flrfactor)
coef se(coef) Chisq p
(Intercept) -2.1539746 0.3276441 Inf 0.000000e+00
as.factor(j)1 0.3679788 0.4343756 0.7331622 3.918601e-01
as.factor(j)2 1.7936917 0.3855682 26.2224650 3.042623e-07

To see what R is doing, use the contrasts() function:

> contrasts(as.factor(j))
1 2
0 0 0
1 1 0
2 0 1

R made indicator ("dummy") variables for two of the three levels, so that the estimated coefficients are the log relative odds for these levels vs. the omitted level. This is the "contr.treatment" structure (default for unordered factors). The defaults can be changed with options("contrasts"), but this is a sensible one.

But what if we wanted to assess whether a linear effect was plausible, independent of any quadratic effect? For glm() objects we could examine the anova() between the model with the linear term and the model with the linear and quadratic terms. Or, we could use the syntax shown in the introduction, but with "contr.poly" in place of "contr.treatment". The latter approach may be preferable, and for the logistf() function (and likely many other contributed functions) the contrasts = option does not work. In those cases, use the contrasts function:

jfactor = as.factor(j)
contrasts(jfactor) = contr.poly(3)
flrfc = logistf(y ~ jfactor)
summary(flrfc)
coef se(coef) Chisq p
(Intercept) -1.4334177 0.1598591 Inf 0.000000e+00
jfactor.L 1.2683316 0.2726379 26.222465 3.042623e-07
jfactor.Q 0.4318181 0.2810660 2.472087 1.158840e-01

Not surprisingly, there is no need for a quadratic term, after the linear trend is accounted for. The canned contrasts available in R are somewhat limited--effect cell coding is not included, for example. You can assign contrasts(x) a matrix you write manually in such cases.

SAS
In SAS, the class statement for the logistic procedure allows many parametrizations, including "orthpoly", which matches the "contr.poly" contrast from R. However, most modeling procedures do not have this flexibility, and you would have to generate your contrasts manually in those cases, typically by creating new variables with the appropriate contrast values. Here we show the reference cell coding that is the default in R. Perversely, it is not the the default in proc logistic despite it being the only option in most procedures. On the other hand, it does allow the user to specify the reference category.

data test;
do i = 1 to 300;
j = (i gt 100) + (i gt 200);
linpred = -2.5 + j;
y = (uniform(0) lt exp(linpred)/(1 + exp(linpred)) );
output;
end;
run;

title "Reference cell";
proc logistic data = test;
class j (param=ref ref='0');
model y(event='1') = j / firth clparm = pl;
run;

title "Polynomials";
proc logistic data = test;
class j (param=orthpoly);
model y(event='1') = j;
run;

With the results:

Reference cell
Standard Wald
Parameter DF Estimate Error Chi-Square Pr> ChiSq
Intercept 1 -2.6110 0.1252 434.6071 <.0001
j 1 1 1.2078 0.1483 66.3056 <.0001
j 2 1 2.2060 0.1409 245.1215 <.0001


Polynomials
Standard Wald
Parameter DF Estimate Error Chi-Square Pr> ChiSq
Intercept 1 -1.4761 0.0540 746.6063 <.0001
j OPOLY1 1 0.9032 0.0577 245.3952 <.0001
j OPOLY2 1 -0.0502 0.0501 1.0029 0.3166


An unrelated note about aggregators
We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers and PROC-X with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Tuesday, September 21, 2010

Example 8.6: Changing the reference category for categorical variables

How can we change the reference category for a categorical variable? This question comes up often in a consulting practice.

When including categorical covariates in regression models, there is a question of how to incorporate the categories. One simple method is to generate indicator variables, sometimes called dummy variables. We go into some detail about the parameterization of categorical covariates in the SAS and R book, section 3.1.3.

In the indicator variable approach, new dichotomous variables are generated for all but one of the categories; these have a value of 1 if the subject is in the category and 0 otherwise. SAS and R each have simple ways to do this without explicitly creating new variables. In SAS, many procedures accept a class statement, while in R a variable can be defined as a factor, for example by using as.factor.

Let's consider a simple example with the following display of a categorical variable and the resulting indicators.

id catvar indA indB indC
1 A 1 0 0
2 B 0 1 0
3 D 0 0 0
4 C 0 0 1
5 B 0 1 0
6 D 0 0 0
7 A 1 0 0

When we fit the model, the parameter associated with the indA variable is an estimate of the difference between categories A and D. But what if we want the difference between A and C? Well, we can take out our calculators, but we'd also like the standard error of that estimated difference. One way to do this is to change the reference category, and that is what we'll explore today. In a future entry, we'll demonstrate how to calculate arbitrary comparisons, or contrasts, without refitting the model. That method is likely superior to the one shown here, but as consulting statisticians, the question "how do I change the reference category" is one we often answer.

SAS

For procs logistic, genmod, phreg, and surveylogistic, you can use the ref= option, as follows:

proc logistic data=ds;
class classvar (param=ref ref="name-of-ref-group");
model y = classvar;
run;

Unfortunately, changing the reference in SAS is awkward for other procedures. The SAS default is to make the last category the referent, when last is determined by ordering the characters. To change this, use the order option, frequently an option to the class statement but sometimes an option to the proc statement. If the desired referent is the first category, you can make it the referent by sorting on the variable in descending order and then using the order=data option:

proc sort data=ds; by descending classvar; run;

proc glm data=ds order=data;
class classvar;
model y = classvar;
run;

If your desired reference category is lexicographically in the middle of the list, your best bet is to re-code the categories. My colleague Sheryl Rifas-Shiman renames the labels as, e.g., "a. blue", "b. other", "c. brown". Then sort on the new variable and use the order=data approach. You might also get lucky by sorting on some other variable in the data set and using order=data.

As an example, we consider the simple analysis of covariance discussed in section 3.7.2. The default reference cell for substance is heroin. We can replace this with alcohol using the sorting approach.

proc import datafile='c:/book/help.dta' out=help_a dbms=dta;
run;

proc sort data=help_a; by descending substance; run;

proc glm data=help_a order=data;
class substance;
model i1 = age substance age * substance / solution;
run; quit;
Standard
Parameter Estimate Error t Value Pr> |t|

Intercept 7.913018261 B 6.79251599 1.16 0.2447
AGE 0.557076729 B 0.17437966 3.19 0.0015
SUBSTANCE heroin -2.600851794 B 9.66168958 -0.27 0.7879
SUBSTANCE cocaine 7.853879213 B 10.16492979 0.77 0.4401
SUBSTANCE alcohol 0.000000000 B . . .
AGE*SUBSTANCE heroin -0.450423400 B 0.26525085 -1.70 0.0902
AGE*SUBSTANCE cocaine -0.662468379 B 0.27702051 -2.39 0.0172
AGE*SUBSTANCE alcohol 0.000000000 B . . .

Note that SAS creates the levels for the interaction based on the same implied indicator variables.

R

In R there are several options for changing the reference cell. The simplest of these may be the relevel() function. The two arguments are the factor name and the desired reference category. The as.factor() function can be nested within relevel() if necessary.

> ds = read.csv("http://www.math.smith.edu/sasr/datasets/help.csv")
> lm3 = lm(i1 ~ relevel(substance, "alcohol") * age, data=ds)
> summary(lm3)
Call:
lm(formula = i1 ~ relevel(substance, "alcohol") * age, data=ds)

Residuals:
Min 1Q Median 3Q Max
-34.653 -9.625 -4.832 5.576 102.891

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.9130 6.7925 1.165 0.2447
relevel(substance, "alcohol")cocaine 7.8539 10.1649 0.773 0.4401
relevel(substance, "alcohol")heroin -2.6009 9.6617 -0.269 0.7879
age 0.5571 0.1744 3.195 0.0015 **
relevel(substance, "alcohol")cocaine:age -0.6625 0.2770 -2.391 0.0172 *
relevel(substance, "alcohol")heroin:age -0.4504 0.2653 -1.698 0.0902 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 17.7 on 447 degrees of freedom
Multiple R-squared: 0.2268, Adjusted R-squared: 0.2181
F-statistic: 26.22 on 5 and 447 DF, p-value: < 2.2e-16
Subscribe to: Comments (Atom)

AltStyle によって変換されたページ (->オリジナル) /