Showing posts with label lattice library. Show all posts
Showing posts with label lattice library. Show all posts
Monday, July 16, 2012
Example 9.38: dynamite plots, revisited
[フレーム]
Dynamite plots are a somewhat pejorative term for a graphical display where the height of a bar indicates the mean, and the vertical line on top of it represents the standard deviation (or standard error). These displays are commonly found in many scientific disciplines, as a way of communicating group differences in means.
Many find these displays troubling. One post entitled them unmitigated evil.
The Vanderbilt University Department of Biostatistics has a formal policy discouraing use of these plots, stating that:
Dynamite plots often hide important information. This is particularly true of small or skewed data sets. Researchers are highly discouraged from using them, and department members have the option to decline participation in papers in which the lead author requires the use of these plots.
Despite the limitations of the display, we believe that there may be times when the display is helpful as a way to compare groups, assuming distributions that are approximately normal. Samuel Brown also described creation of these figures, as a way to encourage computing in R. We previously demonstrated how to create them in SAS and R, and today discuss code created by Randall Pruim to demonstrate how such graphics can be created using lattice graphics within the mosaic package.
R
library(mosaic)
dynamitePlot <- function(height, error, names = as.character(1:length(height)),
significance = NA, ylim = c(0, maxLim), ...) {
if (missing(error)) { error = 0 }
maxLim <- 1.2* max(mapply(sum, height, error))
mError <- min(c(error, na.rm=TRUE))
barchart(height ~ names, ylim=ylim, panel=function(x,y,...) {
panel.barchart(x,y,...)
grid.polyline(c(x,x), c(y, y+error), id=rep(x,2), default.units='native',
arrow=arrow(angle=45, length=unit(mError, 'native')))
grid.polyline(c(x,x), c(y, y-error), id=rep(x,2), default.units='native',
arrow=arrow(angle=45, length=unit(mError, 'native')))
grid.text(x=x, y=y + error + .05*maxLim, label=significance,
default.units='native')
}, ...)
}
Much of the code involves setting up the appropriate axis limits, then drawing the lines and adding the text. We can call this new function with an artificial example with 4 groups:
Values <- c(1,2,5,4)
Errors <- c(0.25, 0.5, 0.33, 0.12)
Names <- paste("Trial", 1:4)
Sig <- c("a", "a", "b", "b")
dynamitePlot(Values, Errors, names=Names, significance=Sig)
We still don't recommend frequent use of these plots (as other displays may be better (e.g. dotplots for very small sample sizes or violin plots), but having the capability to generate dynamite plots within the lattice framework can be handy.
An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.
Thursday, October 13, 2011
Example 9.9: Simplifying R using the mosaic package (part 1)
[フレーム]
While both SAS and R are powerful systems for statistical analysis, they can be frustrating to new users or those learning statistics for the first time.
R
The mosaic package is designed to help simplify the interface for such new users, while allowing them to undertake sophisticated analyses.
As an example of how the package simplifies life for the novice user, consider calculating summary statistics and displaying a densityplot for the CESD (measure of depressive symptom) scores by substance abuse group in the HELP dataset. Doing this in R without the package would require mastering a package such as plyr to replicate results by substance or a typing-intensive use of syntax to select rows corresponding to each substance.
ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
library(mosaic)
options(digits=3)
After loading the data and the package, and setting the number of digits to a more reasonable default, we can call the mean() function to easily calculate this statistic (denoted by S in the result) for each of the three substance abuse groups alcohol, cocaine or heroin.
> mean(cesd ~ substance, data=ds) substance S N Missing 1 alcohol 34.4 177 0 2 cocaine 29.4 152 0 3 heroin 34.9 124 0
Similar results are seen when we calculate the standard deviations per group:
> sd(cesd ~ substance, data=ds) substance S N Missing 1 alcohol 12.1 177 0 2 cocaine 13.4 152 0 3 heroin 11.2 124 0
Another function can calculate a raft of summary statistics for each group that are nicely formatted.
UPDATE FROM 12/14/2013: favstats(cesd ~ substance, data=ds) is an even better way to do this...
> summary(cesd ~ substance, data=ds, fun=favstats) cesd N=453 +---------+-------+---+----+---+-------+---+----+-----+----+---+--------+ | | |N |min |Q1 |median |Q3 |max |mean |sd |n |missing | +---------+-------+---+----+---+-------+---+----+-----+----+---+--------+ |substance|alcohol|177|4 |26 |36 |42 |58 |34.4 |12.1|177|0 | | |cocaine|152|1 |19 |30 |39 |60 |29.4 |13.4|152|0 | | |heroin |124|4 |28 |35 |43 |56 |34.9 |11.2|124|0 | +---------+-------+---+----+---+-------+---+----+-----+----+---+--------+ |Overall | |453|1 |25 |34 |41 |60 |32.8 |12.5|453|0 | +---------+-------+---+----+---+-------+---+----+-----+----+---+--------+
These commands allow quick review of the data to ensure, for example, that assumptions of equal variance are justified, or that coding errors or missing values haven't crept in.
A graphical depiction using a set of densityplots (shown above) can be created using the command:
densityplot(~ cesd, group=substance, data=ds, auto.key=TRUE)
SAS
We're unaware of any similar program that attempts to simplify SAS syntax for educational use. To replicate the above results, we would use the means and sgpanel procedures.
data ds; set "C:\book\help.sas7bdat"; run; options ls=80; proc means data=ds fw=4 min q1 median q3 max mean std nmiss n; class substance; var cesd; run; Analysis Variable : CESD N Lower Upper Std SUBSTANCE Obs Min Quartile Median Quartile Max Mean Dev ------------------------------------------------------------------ alcohol 177 4.00 26.0 36.0 42.0 58.0 34.4 12.1 cocaine 152 1.00 19.0 30.0 39.0 60.0 29.4 13.4 heroin 124 4.00 28.0 35.0 43.0 56.0 34.9 11.2 ------------------------------------------------------------------ N N SUBSTANCE Obs Miss N --------------------------- alcohol 177 0 177 cocaine 152 0 152 heroin 124 0 124 ---------------------------
After reading the data in, the meansprocedure can produce any of the desired statistics (plus may others) directly. To replicate the mosaic package in printing a single statistic, list only that statistic in the proc means statement. Note that the overall statistic in the R table is not included. To replicate that row, you would re-run the above code, omitting the class statement.
To the best of our knowledge, there still does not exist an easy way to plot multiple densities in a single SAS plot. In example 2.6.4 we show how it can be done using proc kde, saving the density estimates, and plotting separately. (Code for this is included at the book web site.) But in the interest of simple code, we show a simpler method using proc sgpanel. The result, show below, is less useful than the R plot from the the mosaic package, but still gets the point across.
proc sgpanel data=ds; panelby substance / columns=1; density cesd / type=kernel; run;
Monday, June 13, 2011
Example 8.40: Side-by-side histograms
[フレーム]
It's often useful to compare histograms for some key variable, stratified by levels of some other variable. There are several ways to display something like this. The simplest may be to plot the two histograms in separate panels.
SAS
In SAS, the most direct and generalizable approach is through the sgpanel procedure.
proc sgpanel data = 'c:\book\help.sas7bdat';
panelby female;
histogram cesd;
run;
The results are shown above.
R
In R, the lattice package provides a similarly direct approach.
ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
ds$gender = ifelse(ds$female==1, "female", "male")
library(lattice)
histogram(~ cesd | gender, data=ds)
The results are shown below.
Saturday, August 8, 2009
Example 7.9: Get data from SAS into R
[フレーム]
Some people use both SAS and R in their daily work. They might be more familiar with SAS as a tool for manipulating data and R preferable for plotting purposes. While our goal in the book is to enable people to avoid having to switch back and forth, the following example shows how to move data from SAS into R. Our use of Stata format as an interchange mechanism is perhaps unorthodox, but eminently workable. Other file formats (see section 1.2.2, creating files for use by other packages) can also be specified.Suppose we wanted to plot CESD over time for each individual. While we show how to do this sort of thing in SAS (see section 5.6.2), it’s hard to do without SAS version 9.2. Instead, we recall it’s easy using the lattice library in R (see section 5.2.2). But we need a ”long” data set with a row for each time point. This is the sort of data management one might prefer to do in SAS.
SAS
First, we read the data from a data set stored in SAS format. Then we use proc transpose (section 1.5.3) to get the data into the required shape. Finally, we save the ”long” data set in Stata format (1.2.2).
libname k "c:\book";
proc transpose data=k.help out=ds1;
by id;
var cesd1 cesd2 cesd3 cesd4;
run;
proc print data = ds1 (obs = 5); run;
Obs ID _NAME_ _LABEL_ COL1
1 1 CESD1 1 cesd 7
2 1 CESD2 2 cesd .
3 1 CESD3 3 cesd 8
4 1 CESD4 4 cesd 5
5 2 CESD1 1 cesd 11
proc export data=ds1 (rename=(_label_=timec1))
outfile = "c:\book\helpfromsas.dta" dbms=dta;
run;
R
In R, we first read the file from Stata format (1.1.5), then attach() it (section 1.3.1) for ease of typing. Then we check the first few lines of data using the head() function (section 1.13.1). Noting that the time variable we brought from SAS is a character string, we convert it to a numeric variable using as.numeric() (section 1.4.2) and substr() (section 1.4.3). After loading the lattice library, we display the series for each subject. In this, we use the syntax for subsetting observations (1.5.1) to keep only the first 20 observations and the as.factor() function (1.4.10) to improve the labels in the output.
> library(foreign)
> xsas <- read.dta("C:\\book\\helpfromsas.dta")
> head(xsas)
id _name_ timec1 col1
1 1 CESD1 1 cesd 7
2 1 CESD2 2 cesd NA
3 1 CESD3 3 cesd 8
4 1 CESD4 4 cesd 5
5 2 CESD1 1 cesd 11
6 2 CESD2 2 cesd NA
> attach(xsas)
> time <- as.numeric(substr(timec1, 1, 1))
> library(lattice)
> xyplot(col1[id < 21] ~ time[id < 21]|as.factor(id[id < 21]))
In the next entry, we'll reverse this process to manipulate data in R and produce the plot in SAS.
Subscribe to:
Comments (Atom)