SAS and R: boxplot

Catalogs of posts

Showing posts with label boxplot. Show all posts

Tuesday, December 6, 2011

Example 9.17: (much) better pairs plots

Pairs plots (section 5.1.17) are a useful way of displaying the pairwise relations between variables in a dataset. But the default display is unsatisfactory when the variables aren't all continuous. In this entry, we discuss ways to improve these displays that have been proposed by John Emerson, Walton Green, Barret Schloerke, Dianne Cook, Heike Hofmann, and Hadley Wickham in a manuscript under review entitled The Generalized Pairs Plot. http://www.blogger.com/img/blank.gif

Implementations of the methods in the paper are available in the gpairs and GGally packages; here we use the latter, which is based on the grammar of graphics and the ggplot2 package. This is an R-only entry: we are unaware of efforts to replicate this approach in SAS.

New users may find it easier to break process down into steps, rather than to do everything at once, as the R language allows. One way to do that is to make a smaller version of a dataset, with just the analysis variables included. here we use the HELP data set and choose two categorical variables (gender and housing status) and two continuous ones (the number of drinks per day and a measure of depressive symptoms). Once this new subset is created, the call to ggpairs() is straightforward.

R


library(GGally)
ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
ds$sex = as.factor(ifelse(ds$female==1, "female", "male"))
ds$housing = as.factor(ifelse(ds$homeless==1, "homeless", "housed"))
smallds = subset(ds, select=c("housing", "sex", "i1", "cesd"))
ggpairs(smallds, diag=list(continuous="density", discrete="bar"), axisLabels="show")

For users more comfortable with R, the ggpairs function allows you to select variables to include, via its columns option. The following line produces a plot identical to the above, without the subset().


ggpairs(ds, columns=c("housing", "sex", "i1", "cesd"),
 diag=list(continuous="density", discrete="bar"), axisLabels="show")

Various options are available for the diagonal elements of the plot matrix, and the off-diagonals can be controlled with upper and lower options. The examples(ggpairs) command is very helpful for visualizing some of the possibilities.

Posted by Nick Horton

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: boxplot, generalized pairs plots, GGally package, ggplot2 package, grammar of graphics, Hadley Wickham, HELP data set, John Emerson, mosaic plot, pairs plots, scatterplot 4 comments

Tuesday, October 26, 2010

Example 8.11: violin plots

[フレーム]

We've continued to get useful feedback and ideas from our posts on the combination dotplot/boxplot and other ways to craft similar displays.

Another notion is the violin plot, which combines a boxplot and a (doubled) kernel density plot. While the basic notion of the violin plot does not include the individual points, such a display has virtues, particularly when comparing multiple groups and with large datasets. For teaching purposes, dots representing the data points could be added in. More details on the plot can be found in: Hintze, J. L. and R. D. Nelson (1998). Violin plots: a box plot-density trace synergism. The American Statistician, 52(2):181-4.

R

In R, the vioplot package includes the vioplot() function, which generated the plot at the top of this entry.


ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
female = subset(ds, female==1)
library(vioplot)
with(female, vioplot(pcs[homeless==0], pcs[homeless==1],
 horizontal=TRUE, names=c("non-homeless", "homeless"), 
 col = "lightblue"))

SAS

We've neglected SAS in the discussion of teaching graphics. Mimicking the tailored appearance of Wild's approach to the dotplot-boxplot would require at least several hours, while even the shorter code suggested by commenters would be difficult. For the most part this reflects the modular nature of R. However violin plots are similar enough to stacked kernel density estimates, that we show them here in order to demonstrate the code.


proc sgpanel data="C:\book\help";
 where female eq 1;
 panelby homeless / columns=1 ;
 density pcs / scale=percent type=kernel ;
run;

The output lacks the graphic depiction of central tendency, and does not double the density, but it does highlight similarities and differences between the categories.

Posted by Nick Horton

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: boxplot, dotplot/boxplot, graphics, HELP data set 0 comments

Tuesday, October 19, 2010

Example 8.10: Combination dotplot/boxplot (teaching graphic in honor of World Statistics Day)

[フレーム]

In honor of World Statistics Day and the read paper that my co-authors Chris Wild, Maxine Pfannkuch, Matt Regan, and I are presenting at the Royal Statistical Society today, we present the R code to generate a combination dotplot/boxplot that is useful for students first learning statistics. One of the over-riding themes of the paper is that introductory students (be they in upper elementary or early university) should keep their eyes on the data.

When describing distributions, students often are drawn to the most visually obvious aspects: median, quartiles and extremes. These are the ingredients of the basic boxplot, which is often introduced as a graphical display in introductory courses. Students are taught to calculate the quartiles, and this becomes one of the components of a boxplot.

One limitation of the boxplot is that it loses the individual data points. Given a dotplot (see here for an example of one in Fathom), it is very easy to guesstimate and draw a boxplot by hand. Wild et al. argue that doing so is probably the best way of gaining an appreciation of just what a boxplot actually is. The boxplot provides a natural bridge between operating entirely in terms of what is seen in graphics to reasoning using summary statistics. Retaining the dots in the combination dotplot/boxplot provides a reminder that the box plot is just summarizing the raw data, thus preserving a connection to more concrete foundations.

Certainly, such a plot has a number of limitations. It breaks down when there are a large number of observations, and has redundancy not suitable for publication. But as a way to motivate informal inference, it has potential value. It's particularly useful when combined in an animation using multiple samples from a known population, as demonstrated here (scroll through the file quickly).

In addition, the code, drafted by Chris and his colleagues Steve Taylor and Dineika Chandrananda at the University of Auckland, demonstrates a number of useful and interesting techniques.

R

Because of the length of the code, we'll focus just on the boxpoints3() function shown below. The remaining support code must be run first (and can be found at http://www.math.smith.edu/sasr/examples/wild-helper.R). The original boxpoints() function has a large number of options and configuration settings, which the function below sets at the default values.


# Create a plot from two variables, one continuous and one categorical.
# For each level of the categorical variable (grps) a stacked dot plot
# and a boxplot summary are created. Derived from code written by 
# Christopher Wild, Dineika Chandrananda and Steve Taylor
#
boxpoints3 = function(x,grps,varnames1,varnames2,labeltext)
{
 observed = (1:length(x))[!is.na(x) & !is.na(grps)]
 x = x[observed]
 grps = as.factor(grps[observed])
 ngrps = length(levels(grps))
 
 # begin section to align titles and labels
 xlims = range(x) + c(-0.2,0.1)*(max(x)-min(x))
 top = 1.1
 bottom = -0.2
 plot(xlims, c(bottom,top), type="n", xlab="", ylab="", axes=F)
 yvals = ((1:ngrps)-0.7)/ngrps
 text(mean(xlims), top, paste(varnames1, "by", varnames2, sep=" "), 
 cex=1, font=2)
 text(xlims[1],top-0.05,varnames2, cex=1, adj=0)
 addaxis(xlims, ylev=0, tickheight=0.03, textdispl=0.07, nticks=5,
 axiscex=1)
 text(mean(xlims), -0.2, varnames1, adj=0.5, cex=1)
 # end section for titles and labels

 for (i in 1:ngrps) {
 xi = x[grps==levels(grps)[i]]
 text(xlims[1], yvals[i]+0.2/ngrps,
 substr(as.character(levels(grps)[i]), 1, 12), adj=0, cex=1)
 prettyrange = range(pretty(xi))
 if (min(diff(sort(unique(xi))))>= diff(prettyrange)/75)
 xbins = xi # They are sufficiently well spaced.
 else {
 xbins = round(75 * (xi-prettyrange[1])/diff(prettyrange))
 }
 addpoints(xi, yval=yvals[i], vmax=0.62/ngrps, vadd=0.075/ngrps,
 xbin=xbins, ptcex=1, ptcol='grey50', ptlwd=2)
 bxplt = fivenum(xi)
 if(length(xi) != 0)
 addbox(x5=bxplt, yval=yvals[i], hbxwdth=0.2/ngrps,
 boxcol="black", medcol="black", whiskercol="black", 
 boxfill=NA, boxlwidth=2)
 }
}

Most of the function tends to issues of housekeeping, in particular aligning titles and labels. Once this is done, the support functions addpoints() and addbox() functions are called with the appropriate arguments. We can call the function using data from female subjects at baseline of the HELP study, comparing PCS (physical component scores) for homeless and non-homeless subjects.


source("http://www.math.smith.edu/sasr/examples/wild-helper.R")
ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
female = subset(ds, female==1)
with(female,boxpoints3(pcs, homeless, "PCS", "Homeless"))

We see that the homeless subjects have lower PCS scores than the non-homeless (though the homeless group also has the highest score of either group in this sample).

Posted by Nick Horton

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: boxplot, Chris Wild, dotplot, informal inference, Matt Regan, Maxine Pfannkuch, Royal Statistical Society, statistics education, teaching statistics, University of Auckland, World Statistics Day 10 comments

Subscribe to SAS and R!

RSS: Or: Get SAS and R by Email

Search the SAS and R Blog

The book (second edition, 2014)

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."
—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."
—Frances Denny, Journal of the Royal Statistical Society, Series A

Buy a book

SAS and R: Data Management, Statistical Analysis, and Graphics, Second Edition

Using SAS for Data Management, Statistical Analysis, and Graphics

Using R for Data Management, Statistical Analysis, and Graphics

About the authors

Nicholas Horton is a Professor of Statistics at Amherst College. He is a biostatistician with expertise in missing data methods, longitudinal regression, statistical computing and statistical education. Nick's home page; Nick's Google Scholar author page

Ken Kleinman is an Associate Professor with the Department of Biostatistics and Epidemiology at the University of Massachusetts, Amherst. He is a consulting biostatistician with expertise in group-randomized trials and disease surveillance; he also offers R training courses. Ken's home page; Ken's Google Scholar author page.

Sidebar list of all entries.

SAS and R

Catalogs of posts

Tuesday, December 6, 2011

Example 9.17: (much) better pairs plots

Tuesday, October 26, 2010

Example 8.11: violin plots

Tuesday, October 19, 2010

Example 8.10: Combination dotplot/boxplot (teaching graphic in honor of World Statistics Day)

About SAS and R

Topics discussed