SAS and R: table()

Catalogs of posts

Showing posts with label table(). Show all posts

Monday, October 8, 2012

Example 10.5: Convert a character-valued categorical variable to numeric

In some settings it may be necessary to recode a categorical variable with character values into a variable with numeric values. For example, the matching macro we discussed in example 7.35 will only match on numeric variables. One way to convert character variables to numeric values is to determine which values exist, then write a possibly long series of conditional tests to assign numbers to the values. Surely there's a better way?

SAS
In SAS, Rick Wicklin offers an IML solution and links to a macro with the same function. But if you're not an IML coder, and you don't want to investigate a macro solution, it's simple enough to do with data steps. We'll begin by making some fake data.

data test;
 do i = 1 to 100;
 cat = "meow";
 if i gt 30 then cat = "Purr";
 if i gt 70 then cat = "Hiss";
 output;
 end;
run;

To make the new variable, we'll just sort (section 1.5.6) the data on the categorical variable we want to convert, then use the set ds; by x; syntax to keep track of when a new value is encountered in the data. It's hard to believe that we've never demonstrated this useful syntax before-- perhaps we just can't find it today. The set ds; by x; syntax makes new temporary variables first.x and last.x that are equal to 1 for the first and last observations of each new level of x, respectively, and 0 otherwise. When we find a new value, we'll increase a counter by 1; the counter is our new numeric-valued variable.

proc sort data = test; by cat; run;
data catize;
set test;
by cat;
retain catnum 0;
if first.cat then catnum = catnum + 1;
run;
/* check the result */
proc freq data = catize;
tables cat * catnum;
run;

The table also shows the recoding values.

 Table of cat by catnum
 cat catnum
 Frequency|
 Percent |
 Row Pct |
 Col Pct | 1| 2| 3| Total
 ---------+--------+--------+--------+
 Hiss | 30 | 0 | 0 | 30
 | 30.00 | 0.00 | 0.00 | 30.00
 | 100.00 | 0.00 | 0.00 |
 | 100.00 | 0.00 | 0.00 |
 ---------+--------+--------+--------+
 Purr | 0 | 40 | 0 | 40
 | 0.00 | 40.00 | 0.00 | 40.00
 | 0.00 | 100.00 | 0.00 |
 | 0.00 | 100.00 | 0.00 |
 ---------+--------+--------+--------+
 meow | 0 | 0 | 30 | 30
 | 0.00 | 0.00 | 30.00 | 30.00
 | 0.00 | 0.00 | 100.00 |
 | 0.00 | 0.00 | 100.00 |
 ---------+--------+--------+--------+
 Total 30 40 30 100
 30.00 40.00 30.00 100.00

R
We begin by making the data. To convert to numbers, we use the labels option to the factor() function, feeding it the sequences of numbers between 1 and however many different values there are. Note that we find this using the factor() function again. There's probably a better way of doing this, but it's a little bit amusing to code it this way. Then we have numbers, but they're store as a factor. We can get them out with a call to as.numeric().

cat = c(rep("meow",30),rep("Hiss",30), rep("Purr", 40))
catn1 = factor(cat, labels=(1:length(levels(factor(cat)))))
catn = as.numeric(catn1)
table(catn,cat)
 cat
catn Hiss meow Purr
 1 30 0 0
 2 0 30 0
 3 0 0 40

There's a warning in the documentation for factor() that the values are assigned in location-specific fashion, so the table should be used to establish how the codes were assigned. For the record, the use cases for this kind of recoding in R may be more strained than the SAS example given above.

An unrelated note about aggregators: We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Posted by Ken Kleinman

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: as.numeric(), convert categorical class to numeric, factor(), proc freq, set ds; by x;, table() 3 comments

Monday, January 31, 2011

Example 8.23: Expanding latent class model results

[フレーム]

In Example 8.21 we described how to fit a latent class model to data from the HELP dataset using SAS and R (using poLCA() , and then followed up in example 8.22 using randomLCA() . In both entries, we classified subjects based on their observed (manifest) status on the following variables (on street or in shelter in past 180 days [homeless], CESD scores above 20, received substance abuse treatment [satreat], or linked to primary care [linkstatus]). We arbitrarily specify a three class solution.

In this example, we write a function to augment the default output of randomLCA() to make it easier for the analyst to interpret the results.

R

We begin by reading in the data.


ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
attach(ds)
library(randomLCA)

We will write a function wrapper for randomLCA that does some additional work in a generic fashion. This will allow easier estimation of other models. We annotate the function to explain what we're doing. The resulting objects are outcomep, which contains the outcome probabilities, and classp, with the class probabilities.


runlca = function(df, nclass=2, names=c("item"), verbose=FALSE) {
 nvars = dim(df)[2]

 # create a list of names for the items
 if (length(names)==1) { names = rep(names, nvars) }

 # include only complete cases
 bigtable = table(na.omit(df))
 allpatterns = as.data.frame(ftable(bigtable))
 # keep only the patterns that occur
 nonzeropatterns = allpatterns[allpatterns$Freq> 0,]

 # fit the model
 results = randomLCA(nonzeropatterns[,1:nvars],
 nonzeropatterns$Freq, nclass=nclass, calcSE=FALSE)

 # display available sample size
 cat("nobs=", results$nobs, "\n")
 oldopt = options(digits=2)
 if (verbose==TRUE) { # display patterns
 whichclass = apply(results$classprob, 1, which.max) 
 nonzeropatterns$class = whichclass
 print(nonzeropatterns[order(whichclass),])
 }
 print(summary(results))
 resvals = cbind(results$outcomep, results$classp)

 # label the margins with our desired variable names 
 # (plus class probability)
 colnames(resvals) = c(names, "classprob")
 # annotate standard output with rounded values
 print(round(resvals, 2))
 options(oldopt)
 return(results)
}

Now let's apply the function. We start by creating a dichotomous variable with high scores on the CESD, and put this together as part of a dataframe to be given as input to the function. Then we call the runlca() function. By specifying the verbose option the code displays each of the patterns, sorted by which class it is in (based on the highest predicted probability).


cesdcut = ifelse(cesd>20, 1, 0)

smallds = data.frame(homeless, cesdcut, satreat, linkstatus)
results = runlca(smallds, nclass=3, 
 names=c("homeless", "cesd", "satreat", "linkstatus"), 
 verbose=TRUE)

This generates the following output:


nobs= 431 
 homeless cesdcut satreat linkstatus Freq class
5 0 0 1 0 16 1
7 0 1 1 0 33 1

6 1 0 1 0 4 2
8 1 1 1 0 37 2
13 0 0 1 1 1 2
14 1 0 1 1 4 2
15 0 1 1 1 9 2
16 1 1 1 1 23 2

1 0 0 0 0 17 3
2 1 0 0 0 15 3
3 0 1 0 0 82 3
4 1 1 0 0 64 3
9 0 0 0 1 10 3
10 1 0 0 1 9 3
11 0 1 0 1 62 3
12 1 1 0 1 45 3
 Classes AIC BIC logLik
 3 2093 2150 -1032
Class probabilities 
Class 1 Class 2 Class 3 
 0.07846 0.21621 0.70534 
Outcome probabilities 
 homeless cesd satreat linkstatus classprob
[1,] 0.00 0.58 1 0.00 0.08
[2,] 0.73 0.88 1 0.40 0.22
[3,] 0.44 0.83 0 0.41 0.71

The results are equivalent to the results from the prior example, but the predicted classes are listed, and the class probabilities (and proportion endorsing the item) are more clearly discernible. It might be useful in a later iteration of the function to add some blank lines and the proportion of the seeds that resulted in the maximum likelihood.

Posted by Nick Horton

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: as.data.frame(), colnames(), dim(), HELP data set, latent class analysis, manifest variable, options(), proc lca, randomLCA package, table(), unobserved class 1 comments

Subscribe to SAS and R!

RSS: Or: Get SAS and R by Email

Search the SAS and R Blog

The book (second edition, 2014)

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

Buy a book

SAS and R: Data Management, Statistical Analysis, and Graphics, Second Edition

Using SAS for Data Management, Statistical Analysis, and Graphics

Using R for Data Management, Statistical Analysis, and Graphics

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

Sidebar list of all entries.

SAS and R

Catalogs of posts

Monday, October 8, 2012

Example 10.5: Convert a character-valued categorical variable to numeric

Monday, January 31, 2011

Example 8.23: Expanding latent class model results

About SAS and R

Topics discussed