Showing posts with label plot colors. Show all posts
Showing posts with label plot colors. Show all posts
Tuesday, March 15, 2011
Example 8.30: Compare Poisson and negative binomial count models
[フレーム]
How similar can a negative binomial distribution get to a Poisson distribution?
When confronted with modeling count data, our first instinct is to use Poisson regression. But in practice, count data is often overdispersed. We can fit the overdispersion in the Poisson (Section 4.1) using quasi-likelihood methods, but a better alternative might be to use a negative binomial regression (section 4.1.5). Nick has a paper exploring these models (and others) in an application.
One concern about this is how well the negative binomial might approximate the Poisson, if in fact a Poisson obtains.
We present here a function and a macro to explore how similar the negative binomial can get to the Poisson, if we keep the means of the distributions equal. But before doing so, it will be helpful to review their definitions:
The Poisson is defined as P(Y=y | l) = [e^(-l)l^y]/y!
and the negative binomial as: P(X=x | n,p) = [(n + x + 1)! / (x!)(n+1)!] p^n (1-p)^x
In the Poisson, the mean is l, while the negative binomial counts the number of failures x before n successes, where the probability of success is p. The mean of X is np/(1-p). There are several characterizations of the negative binomial.
R
In R, the pnbinom() function (section 1.10) can be called either with the parameters n and p given above, or by specifying the mean mu and a dispersion parameter (denoted size), where mu = np/(1-p) as above. It's convenient to parameterize via the mean, to keep the negative binomial mean equal to the Poisson mean.
Our function will accept a series of integers and a mean value as input, and plot the Poisson cumulative probabilities and the negative binomial cumulative probabilities for three values of n. We make use of the type="n" option in the plot() function (section 5.1.1) and add the negative binomial values with the lines() function (section 5.2.1).
poissonvsnb = function(values,mean) {
probs = ppois(values,mean)
plot(y=probs, x=values, type="n", ylim=c(0,1))
lines(y=probs, x=values, col="red")
readline("Poisson shown. Press Enter to continue...")
nbprobs1 = pnbinom(values, mu=mean, size=1)
nbprobs5 = pnbinom(values, mu=mean, size=5)
nbprobs40 = pnbinom(values, mu=mean, size=40)
lines(y=nbprobs1, x=values, col="black")
lines(y=nbprobs5, x=values, col="blue")
lines(y=nbprobs40, x=values, col="green")
}
poissonvsnb(0:10,1)
The result is shown above. The red line representing the Poisson is completely overplotted by the negative binomial with size=40. This can be seen when running live, due to the readline() statement, which waits for input before continuing.
SAS
In SAS, the cdf function (section 1.10) does not have the flexibility of parameterizing directly via the mean. To add to the confusion, SAS uses another characterization of the negative binomial, which counts the number of successes x before n failures with the effect that the mean is now n(1-p)/p. Thus is we want to hold the mean constant, we need to solve for p and find probabilities from the distribution where p = n/(n + mu).
To make this process a little less cumbersome to type, we'll also demonstrate the use of proc fcmp, which allows you to compile functions that can be used in data steps and some other procedures. In general, it works as you might hope, with a function statement and a return statement. The only hassle is telling SAS where to store the functions and where to find them when they're needed.
proc fcmp outlib=sasuser.funcs.test;
function poismean_nb(mean, size);
return(size/(mean+size));
endsub;
run;
options cmplib = sasuser.funcs;
run;
Now we're ready to write a macro to replicate the R function. Note how the new function is nested within the call to the cdf function, with the appropriate size parameter. The overlay option allows plotting several y values on the same x axis; the r option to the symbol statement (section 5.1.19) keeps the symbol in effect for several y values. SAS generates a legend easily; this allows us to see the (mostly overplotted) Poisson. Using readline() to pause the output (as in R) is not available.
As a suggestion about how to write macros in SAS, I left this one a little messy. I first wrote the code to make the plot once, with the number of X values and the mean specified in the code with fixed values. This makes two extra lines of code, but when I converted to a macro, I only needed to change the fixed values to the macro parameters. For elegance, I would omit the first two lines and replace the later occurrences of n and mean with the macro parameters.
%macro nbptest(maxn, mean);
data nbp;
n = &maxn;
mean = &mean;
do i = 0 to n;
probpois = cdf("POISSON", i, mean);
probnb1 = CDF("NEGBINOMIAL", i, poismean_nb(mean, 1), 1);
probnb5 = CDF("NEGBINOMIAL", i, poismean_nb(mean, 5), 5);
probnb40 = CDF("NEGBINOMIAL", i, poismean_nb(mean, 40), 40);
output;
end;
run;
axis1 order = (0 to 1 by .2) minor=none ;
symbol1 v=none i=j r=4;
proc gplot data=nbp;
plot (probpois probnb1 probnb5 probnb40)*i /
overlay vaxis=axis1 legend;
run; quit;
%mend;
%nbptest(10,2);
The results are shown below. The negative binomial approaches the Poisson very closely as size increases, holding the mean constant.
Monday, November 8, 2010
Example 8.13: Bike ride plot, part 2
[フレーム]
Before explaining how to make and interpret the plot above, Nick and I want to make a plea for questions--it's hard to come up with useful questions to explore each week!
As shown in Example 8.12, data from the Cyclemeter app can be used to make interesting plots of a bike ride. But in the previous application, questions remain. Why did my speed vary so much, for example? And why do some sections of the route appear to be very straight while others are relatively bumpy? To investigate these questions, we'll add some more data into the plot, using color. First, I'll shade the line according to my elevation-- my speed is probably faster when I'm going down hill. Then I'll add the points where the GPS connected; this may explain the straight sections. Doing all of this will require more complicated coding tasks.
(Data were read in previously.)
SAS
My initial thought was that I would use proc univariate to generate a histogram (section 5.1.4) and generate the color for the line from the histogram categories. But it turns out that you can't force the histogram to have your desired number of categories. (This is true for the R hist() function as well.) So I'm going to find the minimum and maximum elevation using proc summary (section 2.1), saving the results in a data set. Then I'll add those min and max values to each line of the data, using the tip found here. Finally, I'll make my own categories for elevation by looping through the category boundaries until I find one bigger than my observation.
proc summary data=ride;
var elevation__feet_;
output out=minmax max=maxelev min=minelev;;
run;
data aride;
set ride;
setme = 1;
set minmax point=setme;
colorindex = 0;
range = maxelev-minelev;
do i = 0 to 5;
if elevation__feet_ ge (minelev + (i * range/6) )
then colorindex = i + 1;
end;
run;
To make the plot, I'm going to use the annotate facility, as before. However, the annotate %line macro won't work, for reasons I won't go into. So I need to make an annotate data set directly. Briefly, the way annotate data sets work is that they have a pre-specified variable list in which only certain values are allowed. Some of these are discussed in section 5.1, but in addition to these, below we use the function variable; when function="MOVE" the conceptual plotting pen moves without drawing. When function="DRAW" a line is plotted from the last location to the current one. Locations are determined by x and y variables. The line and size variables affect the line style and thickness.
The other interesting bit of the code below is the creation and use of a temporary array (section 1.11.5) for the colors. I choose the color from this array by indexing on the colorindex variable created above.
For help on the annotate facility, see Contents; SAS Products; SAS/GRAPH; The Annotate Facility. Colors are discussed in section 5.3.11, but we use them in a slightly different way, here. For help with color specification, see Contents; SAS Products; SAS/GRAPH; Concepts; Colors.
data annoride2;
set aride;
if elevation__feet_ ne .;
retain xsys '2' ysys '2' hsys '6';
array mycolor [6] $ _temporary_
("CX252525" "CX636363" "CX969696" "CXBDBDBD" "CXD9D9D9" "CXF7F7F7");
x = longitude;
y = latitude;
/* If this is the first observation, we need to move the pen to
the first point. Otherwise we draw a line */
if _n_ ne 1 then function = "DRAW";
else function = "MOVE";
color = mycolor[colorindex];
line = 1;
size = speed__miles_h_ * 2;
output;
run;
Finally, we can plot the figure. I use the symbol statement (section 5.2.2) to plot nice dots where the GPS connected, and the axis statement (section 5.3.7) to remove the values of latitude and longitude, which don't interest me much. Analysis of the graph I leave for the R version.
axis1 major=none minor=none value=none;
symbol1 i=none v=dot c=black h=.2;
proc gplot data = ride;
plot latitude * longitude /
vaxis=axis1 haxis=axis1 annotate=annoride2;
run;
quit;
The resulting plot is shown above.
R
In R, I'm going to modify my vectorized version of the lines() function to additionally assign colors to each line. As in the SAS example, I will use categories of elevation to do this. Assigning the elevations to categories is a little trickier if I want to avoid for() loops. To do it, I will first find the category boundaries. I'll then use the which() function (as in section 1.13.2) to figure out the category boundaries smaller than the observation, and the max() function (section 1.8.1) to select the largest of these. Unfortunately, I also need the sapply() function (section B.5.3) so that I can repeat this process for each elevation and return a vector. I've annotated below to explain how this works. Finally, I use the RColorBrewer package to generate a vector of colors. (I also used it to find the colors I put in the SAS code.)
veclines2 = function(x, y, z, c) {
require(RColorBrewer)
breaks = min(c) + (0:5 * (max(c) - min(c))/6)
# The sapply() function applies the function named in the
# second argument to each element of the vector named in the
# first argument. Since I don't need this function again,
# I write it out in within the sapply() function call and don't
# even have to name it.
ccat = sapply(c, function (r) {max(which(r>= breaks))} )
mypalette = brewer.pal(6,"Greys")
for (i in 1:(length(x)-1)) {
lines(x[i:(i+1)], y[i:(i+1)], lwd=z[i], col=mypalette[7 - ccat[i]])
}
}
Reader JanOosting pointed out that the segments() function will do exactly what my for-looped lines() function does. The segments() function requires "to" and "from" x and y locations. Here's a version of the above function that uses it.
bikeseg = function(x,y,z,c) {
l=length(x)
require(RColorBrewer)
mypalette <-brewer.pal(6,"greys")
breaks = min(c) + (0:5 * (max(c) - min(c))/6)
ccat = sapply(c,function (r) {max(which(r>= breaks))})
segments(x0 = x[1:(l-1)], y0 = y[1:(l-1)],
x1 = x[2:l], y1 = y[2:l],lwd=z,col=mypalette[7-ccat])
}
Then I call the function, after creating an empty plot and making the RColorBrewer library available. Finally, I add the points and some street names (see sections 5.2.3 and 5.2.11), to aid in interpretation. The way R draws the names live, as you enter commands, makes it much easier to place and rotate the names than in SAS, where you would have to re-make the annotate data set and run it to see the results.
plot(Longitude, Latitude, type="n")
library(RColorBrewer)
bikeseg(Longitude, Latitude, Speed..miles.h./2, Elevation..feet.)
points(Longitude, Latitude, cex=2, pch='.')
text(-72.495, 42.34, "Station Road")
text(-72.5035, 42.33, "Middle Street", srt=90)
text(-72.5035, 42.36, "Bike Path", srt=-39)
text(-72.54, 42.36, "Bike Path", srt=15)
text(-72.55, 42.347, "Maple", srt=107)
text(-72.54, 42.338, "Pomeroy")
text(-72.515, 42.331, "Potwine")
In the final image, thicker lines show greater speed, darker colors indicate lower elevations, and the dots are where the GPS connects. My best speed is achieved along Station Road, which is a long, straight drop. There aren't enough categories of color to explain most of the other variations in speed, although you can see me get slower as I near Middle Street along Potwine. The dots are fairly evenly spaced except along the bike path, where there is a lot of tree cover in some spots. This causes the long straight pieces near the top of the plot. Actually, since the phone is also sitting in my pocket as I ride, so I'm more pleased than disappointed with the perfomance of the iPhone and Cyclemeter.
Saturday, March 27, 2010
Example 7.29: Bubble plots colored by a fourth variable
[フレーム]
In Example 7.28, we generated a bubble plot showing the relationship among CESD, age, and number of drinks, for women. An anonymous commenter asked whether it would be possible to color the circles according to gender. In the comments, we showed simple code for this in R and hinted at a SAS solution for two colors. Here we show in detail what the SAS code would look like, and revisit the R code.SAS
For SAS, we have to make two separate variables-- one with the CESD for the females, and another for the males. For the other gender, these gender-specific variables will have missing values. We'll do this using conditioning (section 1.11.2).
libname k "c:\book";
data twocolors;
set k.help;
if female eq 1 then femalecesd = cesd;
else malecesd = cesd;
run;
Now we can use the bubble2 statement (close kin of the plot2 statement, section 5.1.2) to add both gender-specific variables to the plot. While we're at it, we relabel the x-axis to no longer be gender specific and specify that the right y-axis is not to be labeled.
proc gplot data = twocolors;
bubble malecesd*age=i1 / bscale = radius bsize=200
bcolor = blue bfill = solid;
bubble2 femalecesd*age=i1 / bscale = radius bsize = 200
bcolor = pink bfill = solid noaxis;
label malecesd="CESD";
run;
As in the previous bubble plot example, the scale is manipulated arbitrarily so that the SAS and R figures are similar.
We're somewhat fortunate here that the range of the two gendered CESD scores are similar
R
In the comments for Example 7.28, we suggested the following simple R code.
load(url("http://www.math.smith.edu/sasr/datasets/savedfile"))
femalealc = subset(ds, female==1 & substance=="alcohol")
malealc = subset(ds, female==0 & substance=="alcohol")
with(malealc, symbols(age, cesd, circles=i1,
inches=1/5, bg="blue"))
with(femalealc, symbols(age, cesd, circles=i1,
inches=1/5, bg="pink", add=TRUE))
While this does generate a plot, it could be misleading, in that the scale of the circle sizes is relative to the largest value within each symbols() call. While this could be desirable, it's more likely that we'd like a single scale for the circles. R code for this can be made in a single statement:
load(url("http://www.math.smith.edu/sasr/datasets/savedfile"))
attach(ds)
symbols(age, cesd, circles=i1,inches=1/5,
bg=ifelse(female==1,"pink","blue"))
Here the ifelse() function (section 1.11.2) generates a different circle fill color depending on the value of female.
The resulting plots are shown below.
Labels:
3D plots,
circles,
plot colors
6
comments
Tuesday, October 13, 2009
Example 7.15: A more complex sales graphic
[フレーム]
The plot of Amazon sales rank over time generated in example 7.14 leaves questions. From a software perspective, we'd like to make the plot prettier, while we can embellish the plot to inform our interpretation about how the rank is calculated.For the latter purpose, we'll create an indicator of whether the rank was recorded in nighttime (eastern US time) or not. Then we'll color the nighttime ranks differently than the daytime ranks.
SAS
In SAS, we use the timepart function to extract the time of day from the salestime variable which holds the date and time. This is a value measured in seconds since midnight, and we use some conditional logic (section 1.4.11) to identify hours before 8 AM or after 6 PM.
data sales2;
set sales;
if timepart(salestime) lt (8 * 60 * 60) or
timepart(salestime) gt (18 * 60 * 60) then night=1;
else night = 0;
run;
Then we can make the plot. We use the axis statement (sections 5.3.7 and 5.3.8) to specify the axis ranges, rotate some labels and headers, and remove the minor tick marks. Note that since the x-axis is a date-time variable, we have to specify the axis range using date-time data, here read in using formats (section A.6.4) and request labels every three days by requesting an interval of three days' worth of seconds. We also use the symbol statement (section 5.2.2, 5.3.11) to specify shapes and colors for the plotted points.
axis1 order = ("09AUG2009/12:00:00"dt to
"27AUG2009/12:00:00"dt by 259200)
minor = none;
axis2 order=(30000 to 290000 by 130000) label=(angle=90)
value=(angle=90) minor=none;
symbol1 i=none v=dot c=red h=.3;
symbol2 i=none v=dot c=black h=.3;
Finally, we request the plot using proc gplot. The a*b=c syntax (as in section 5.6.2) will result in different symbols for each value of the new night variable, and the symbols we just defined will be used. The haxis and vaxis options are used to associate each axis with the axis definitions specified in the axis statements.
proc gplot data=sales2;
plot rank*salestime=night / haxis=axis1 vaxis=axis2;
format salestime dtdate5.;
run; quit;
R
In R, we make a new variable reflecting the date-time at the midnight before we started collecting data. We then coerce the time values to numeric values using the as.numeric() function (section 1.4.2), while subtracting that midnight value. Next, we mod by 24 (using the %% operator, section B.4.3) and lastly round to the integer value (section 1.8.4) to get the hour of measurement. There's probably a more elegant way of doing this in R, but this works.
midnight <- as.POSIXlt("2009-08-09 00:00:00 EDT")
timeofday <- round(as.numeric(timeval-midnight)%%24,0)
Next, we prepare for making a nighttime indicator by intializing a vector with 0. Then we assign a value of 1 when the corresponding element of the hour of measurement vector has a value in the correct range.
night <- rep(0,length(timeofday))
night[timeofday < 8 | timeofday> 18] <- 1
Finally, we're ready to make the plot. We begin by setting up the axes, using the type="n" option (section 5.1.1) to prevent any data being plotted. Next, we plot the nighttime ranks by conditioning the plot vector on the value of the nighttime indicator vector; we then repeact for the daytime values, additionally specifying a color for these points. Lastly, we add a legend to the plot (section 5.2.14).
plot(timeval, rank, type="n")
points(timeval[night==1], rank[night==1], pch=20)
points(timeval[night==0], rank[night==0], pch=20,
col="red")
legend(as.POSIXlt("2009-08-22 00:00:00 EDT"), 250000,
legend=c("day", "night"), col=c("black", "red"),
pch=c(20, 20))
Interpretation: It appears that Amazon's ranking function adjusts for the pre-dawn hours, most likely to reflect a general lack of activity. (Note that these ranks are from Amazon.com. In all likelihood, Amazon.co.uk and other local Amazon sites adjust for local time similarly.) Perhaps some recency in sales allows a decline in rank for some books during these hours? In addition, we see that most sales of this book, as inferred from the discontinuous drops (improvement) in rank, tend to happen near the beginning of the day, or mid-day, rather than at night.
Subscribe to:
Comments (Atom)