|
| 1 | +Calculating standard Stats |
| 2 | + |
| 3 | +Anscombe %>% |
| 4 | + group_by(set) %>% |
| 5 | + summarize(N = n(), mean(x), sd(x), mean(y), sd(y), cor(x, y)) |
| 6 | + |
| 7 | +Correlation between different variables |
| 8 | + |
| 9 | +# Correlation for all baseball players |
| 10 | +mlbBat10 %>% |
| 11 | + summarize(N = n(), r = cor(OBP, SLG)) |
| 12 | + |
| 13 | +# Correlation for all players with at least 200 ABs |
| 14 | +mlbBat10 %>% |
| 15 | + filter(AB >= 200) %>% |
| 16 | + summarize(N = n(), r = cor(OBP, SLG)) |
| 17 | + |
| 18 | +# Correlation of body dimensions |
| 19 | +bdims %>% |
| 20 | + group_by(sex) %>% |
| 21 | + summarize(N = n(), r = cor(hgt, wgt)) |
| 22 | + |
| 23 | +# Correlation among mammals, with and without log |
| 24 | +mammals %>% |
| 25 | + summarize(N = n(), |
| 26 | + r = cor(BodyWt, BrainWt), |
| 27 | + r_log = cor(log(BodyWt), log(BrainWt))) |
| 28 | + |
| 29 | +spurious_cor |
| 30 | + |
| 31 | +# Create faceted scatterplot |
| 32 | +ggplot(data = noise, aes(x = x, y = y)) + |
| 33 | + geom_point() + |
| 34 | + facet_wrap(~ z) |
| 35 | + |
| 36 | +# Compute correlations for each dataset |
| 37 | +noise_summary <- noise %>% |
| 38 | + group_by(z) %>% |
| 39 | + summarize(N = n(), spurious_cor = cor(x, y)) |
| 40 | + |
| 41 | +# Isolate sets with correlations above 0.2 in absolute strength |
| 42 | +noise_summary %>% |
| 43 | + filter(abs(spurious_cor) > 0.2) |
| 44 | + |
| 45 | +Visualization of linear models |
| 46 | +# Scatterplot with regression line |
| 47 | +ggplot(data = bdims, aes(x = hgt, y = wgt)) + |
| 48 | + geom_point() + |
| 49 | + geom_smooth(method = "lm", se = FALSE) |
| 50 | + |
| 51 | + |
| 52 | +# Estimate optimal value of my_slope |
| 53 | +add_line(my_slope = 1) |
| 54 | + |
| 55 | + |
| 56 | +# Print bdims_summary |
| 57 | +bdims_summary |
| 58 | + |
| 59 | +# Add slope and intercept |
| 60 | +bdims_summary %>% |
| 61 | + mutate(slope = r * sd_wgt / sd_hgt, |
| 62 | + intercept = mean_wgt - slope * mean_hgt) |
| 63 | + |
| 64 | +Fitting a linear model "by hand" |
| 65 | + |
| 66 | +# Print bdims_summary |
| 67 | +bdims_summary |
| 68 | + |
| 69 | +# Add slope and intercept |
| 70 | +bdims_summary %>% |
| 71 | + mutate(slope = r * sd_wgt / sd_hgt, |
| 72 | + intercept = mean_wgt - slope * mean_hgt) |
| 73 | + |
| 74 | +Regression to the mean |
| 75 | +Create a scatterplot of the height of men as a function of their father's height. Add the simple linear regression line and a diagonal line (with slope equal to 1 and intercept equal to 0) to the plot. |
| 76 | +Create a scatterplot of the height of women as a function of their mother's height. Add the simple linear regression line and a diagonal line to the plot. |
| 77 | + |
| 78 | +# Height of children vs. height of father |
| 79 | +ggplot(data = Galton_men, aes(x = father, y = height)) + |
| 80 | + geom_point() + |
| 81 | + geom_abline(slope = 1, intercept = 0) + |
| 82 | + geom_smooth(method = "lm", se = FALSE) |
| 83 | + |
| 84 | +# Height of children vs. height of mother |
| 85 | +ggplot(data = Galton_women, aes(x = mother, y = height)) + |
| 86 | + geom_point() + |
| 87 | + geom_abline(slope = 1, intercept = 0) + |
| 88 | + geom_smooth(method = "lm", se = FALSE) |
| 89 | + |
| 90 | + |
| 91 | +Interpreting regression models |
| 92 | +Fitting simple linear models |
| 93 | +Using the bdims dataset, create a linear model for the weight of people as a function of their height. |
| 94 | +Using the mlbBat10 dataset, create a linear model for SLG as a function of OBP. |
| 95 | +Using the mammals dataset, create a linear model for the body weight of mammals as a function of their brain weight, after taking the natural log of both variables. |
| 96 | +# Linear model for weight as a function of height |
| 97 | +lm(wgt ~ hgt, data = bdims) |
| 98 | + |
| 99 | +# Linear model for SLG as a function of OBP |
| 100 | +lm(SLG ~ OBP, data = mlbBat10) |
| 101 | + |
| 102 | +# Log-linear model for body weight as a function of brain weight |
| 103 | +lm(log(BodyWt) ~ log(BrainWt), data = mammals) |
| 104 | + |
| 105 | +The lm summary output |
| 106 | +We have already created the mod object, a linear model for the weight of individuals as a function of their height, using the bdimsdataset and the code |
| 107 | +mod <- lm(wgt ~ hgt, data = bdims) |
| 108 | + |
| 109 | + |
| 110 | +Now, you will: |
| 111 | +Use coef() to display the coefficients of mod. |
| 112 | +Use summary() to display the full regression output of mod. |
| 113 | + |
| 114 | +# Show the coefficients |
| 115 | +coef(mod) |
| 116 | + |
| 117 | +# Show the full output |
| 118 | +summary(mod) |
| 119 | + |
| 120 | +Fitted values and residuals |
| 121 | +Once you have fit a regression model, you are often interested in the fitted values (y^iy^i) and the residuals (eiei), where ii indexes the observations. Recall that: |
| 122 | +ei=yi−y^iei=yi−y^i |
| 123 | +The least squares fitting procedure guarantees that the mean of the residuals is zero (n.b., numerical instability may result in the computed values not being exactly zero). At the same time, the mean of the fitted values must equal the mean of the response variable. |
| 124 | +In this exercise, we will confirm these two mathematical facts by accessing the fitted values and residuals with the fitted.values() and residuals() functions, respectively, for the following model: |
| 125 | +mod <- lm(wgt ~ hgt, data = bdims) |
| 126 | + |
| 127 | +mod (defined above) is available in your workspace. |
| 128 | +Confirm that the mean of the body weights equals the mean of the fitted values of mod. |
| 129 | +Compute the mean of the residuals of mod. |
| 130 | + |
| 131 | +# Mean of weights equal to mean of fitted values? |
| 132 | +mean(bdims$wgt) == mean(fitted.values(mod)) |
| 133 | + |
| 134 | +# Mean of the residuals |
| 135 | +mean(residuals(mod)) |
| 136 | + |
| 137 | +Tidying your linear model |
| 138 | +As you fit a regression model, there are some quantities (e.g. R2R2) that apply to the model as a whole, while others apply to each observation (e.g. y^iy^i). If there are several of these per-observation quantities, it is sometimes convenient to attach them to the original data as new variables. |
| 139 | +The augment() function from the broompackage does exactly this. It takes a model object as an argument and returns a data frame that contains the data on which the model was fit, along with several quantities specific to the regression model, including the fitted values, residuals, leverage scores, and standardized residuals. |
| 140 | +The same linear model from the last exercise, mod, is available in your workspace. |
| 141 | +Load the broom package. |
| 142 | +Create a new data frame called bdims_tidy that is the augmentation of the mod linear model. |
| 143 | +View the bdims_tidy data frame using glimpse() |
| 144 | +# Load broom |
| 145 | +library(broom) |
| 146 | + |
| 147 | +# Create bdims_tidy |
| 148 | +bdims_tidy <- augment(mod) |
| 149 | + |
| 150 | +# Glimpse the resulting data frame |
| 151 | +glimpse(bdims_tidy) |
| 152 | + |
| 153 | +Making predictions |
| 154 | +The fitted.values() function or the augment()-ed data frame provides us with the fitted values for the observations that were in the original data. However, once we have fit the model, we may want to compute expected values for observations that were not present in the data on which the model was fit. These types of predictions are called out-of-sample. |
| 155 | +The ben data frame contains a height and weight observation for one person. The mod object contains the fitted model for weight as a function of height for the observations in the bdims dataset. We can use the predict() function to generate expected values for the weight of new individuals. We must pass the data frame of new observations through the newdata argument. |
| 156 | +The same linear model, mod, is defined in your workspace. |
| 157 | +Print ben to the console. |
| 158 | +Use predict() with the newdataargument to compute the expected height of the individual in the bendata frame. |
| 159 | +# Print ben |
| 160 | +ben |
| 161 | + |
| 162 | +# Predict the weight of ben |
| 163 | +predict(mod, newdata = ben) |
| 164 | + |
| 165 | +Adding a regression line to a plot manually |
| 166 | + |
| 167 | +The geom_smooth() function makes it easy to add a simple linear regression line to a scatterplot of the corresponding variables. And in fact, there are more complicated regression models that can be visualized in the data space with geom_smooth(). However, there may still be times when we will want to add regression lines to our scatterplot manually. To do this, we will use the geom_abline()function, which takes slope and interceptarguments. Naturally, we have to compute those values ahead of time, but we already saw how to do this (e.g. using coef()). |
| 168 | +The coefs data frame contains the model estimates retrieved from coef(). Passing this to geom_abline() as the data argument will enable you to draw a straight line on your |
| 169 | + |
| 170 | +Use geom_abline() to add a line defined in the coefs data frame to a scatterplot of weight vs. height for individuals in the bdims dataset. |
| 171 | + |
| 172 | +# Add the line to the scatterplot |
| 173 | +ggplot(data = bdims, aes(x = hgt, y = wgt)) + |
| 174 | + geom_point() + |
| 175 | + geom_abline(data = coefs, |
| 176 | + aes(intercept = `(Intercept)`, slope = hgt), |
| 177 | + color = "dodgerblue") |
| 178 | + |
| 179 | +5Model Fit |
| 180 | +Standard error of residuals |
| 181 | +One way to assess strength of fit is to consider how far off the model is for a typical case. That is, for some observations, the fitted value will be very close to the actual value, while for others it will not. The magnitude of a typical residual can give us a sense of generally how close our estimates are. |
| 182 | +However, recall that some of the residuals are positive, while others are negative. In fact, it is guaranteed by the least squares fitting procedure that the mean of the residuals is zero. Thus, it makes more sense to compute the square root of the mean squared residual, or root mean squared error (RMSERMSE). R calls this quantity the residual standard error. |
| 183 | +To make this estimate unbiased, you have to divide the sum of the squared residuals by the degrees of freedom in the model. Thus, |
| 184 | +RMSE=∑ie2id.f.−−−−−√=SSEd.f.−−−−−√RMSE=∑iei2d.f.=SSEd.f. |
| 185 | +You can recover the residuals from mod with residuals(), and the degrees of freedom with df.residual(). |
| 186 | + |
| 187 | +View a summary() of mod. |
| 188 | +Compute the mean of the residuals() and verify that it is approximately zero. |
| 189 | +Use residuals() and df.residual() to compute the root mean squared error (RMSE), a.k.a. residual standard error. |
| 190 | + |
| 191 | +# View summary of model |
| 192 | +summary(mod) |
| 193 | + |
| 194 | +# Compute the mean of the residuals |
| 195 | +mean(residuals(mod)) |
| 196 | + |
| 197 | +# Compute RMSE |
| 198 | +sqrt(sum(residuals(mod)^2) / df.residual(mod)) |
| 199 | + |
| 200 | +Assessing simple linear model fit |
| 201 | +Recall that the coefficient of determination (R2R2), can be computed as |
| 202 | +R2=1−SSESST=1−Var(e)Var(y),R2=1−SSESST=1−Var(e)Var(y), |
| 203 | +where ee is the vector of residuals and yy is the response variable. This gives us the interpretation of R2R2 as the percentage of the variability in the response that is explained by the model, since the residuals are the part of that variability that remains unexplained by the model. |
| 204 | +The bdims_tidy data frame is the result of augment()-ing the bdims data frame with the mod for wgt as a function of hgt. |
| 205 | +Use the summary() function to view the full results of mod. |
| 206 | +Use the bdims_tidy data frame to compute the R2R2 of mod manually using the formula above, by computing the ratio of the variance of the residuals to the variance of the response variable. |
| 207 | +# View model summary |
| 208 | +summary(mod) |
| 209 | + |
| 210 | +# Compute R-squared |
| 211 | +bdims_tidy %>% |
| 212 | + summarize(var_y = var(wgt), var_e = var(.resid)) %>% |
| 213 | + mutate(R_squared = 1 - var_e / var_y) |
| 214 | + |
| 215 | +Interpretation of R^2 |
| 216 | +The R2R2 reported for the regression model for poverty rate of U.S. counties in terms of high school graduation rate is 0.464. |
| 217 | +lm(formula = poverty ~ hs_grad, data = countyComplete) %>% |
| 218 | + summary() |
| 219 | + |
| 220 | + |
| 221 | +How should this result be interpreted? |
| 222 | +46.4% of the variability in poverty rate among U.S. counties can be explained by high school graduation rate. |
| 223 | + |
| 224 | +Linear vs. average |
| 225 | +The R2R2 gives us a numerical measurement of the strength of fit relative to a null model based on the average of the response variable: |
| 226 | +y^null=y ̄y^null=y ̄ |
| 227 | +This model has an R2R2 of zero because SSE=SSTSSE=SST. That is, since the fitted values (y^nully^null) are all equal to the average (y ̄y ̄), the residual for each observation is the distance between that observation and the mean of the response. Since we can always fit the null model, it serves as a baseline against which all other models will be compared. |
| 228 | +In the graphic, we visualize the residuals for the null model (mod_null at left) vs. the simple linear regression model (mod_hgt at right) with height as a single explanatory variable. Try to convince yourself that, if you squared the lengths of the grey arrows on the left and summed them up, you would get a larger value than if you performed the same operation on the grey arrows on the right. |
| 229 | +It may be useful to preview these augment()-ed data frames with glimpse(): |
| 230 | +glimpse(mod_null) |
| 231 | +glimpse(mod_hgt) |
| 232 | +Compute the sum of the squared residuals (SSE) for the null model mod_null. |
| 233 | +Compute the sum of the squared residuals (SSE) for the regression model mod_hgt. |
| 234 | +# Compute SSE for null model |
| 235 | +mod_null %>% |
| 236 | + summarize(SSE = var(.resid)) |
| 237 | + |
| 238 | +# Compute SSE for regression model |
| 239 | +mod_hgt %>% |
| 240 | + summarize(SSE = var(.resid)) |
| 241 | + |
| 242 | +Leverage |
| 243 | +The leverage of an observation in a regression model is defined entirely in terms of the distance of that observation from the mean of the explanatory variable. That is, observations close to the mean of the explanatory variable have low leverage, while observations far from the mean of the explanatory variable have high leverage. Points of high leverage may or may not be influential. |
| 244 | +The augment() function from the broom package will add the leverage scores (.hat) to a model data frame. |
| 245 | + |
| 246 | +Use augment() to list the top 6 observations by their leverage scores, in descending order. |
| 247 | + |
| 248 | +# Rank points of high leverage |
| 249 | +mod %>% |
| 250 | + augment() %>% |
| 251 | + arrange(desc(.hat)) %>% |
| 252 | + head() |
| 253 | + |
| 254 | +Influence |
| 255 | +As noted previously, observations of high leverage may or may not be influential. The influence of an observation depends not only on its leverage, but also on the magnitude of its residual. Recall that while leverage only takes into account the explanatory variable (xx), the residual depends on the response variable (yy) and the fitted value (y^y^). |
| 256 | +Influential points are likely to have high leverage and deviate from the general relationship between the two variables. We measure influence using Cook's distance, which incorporates both the leverage and residual of each observation. |
| 257 | + |
| 258 | +Use augment() to list the top 6 observations by their Cook's distance (.cooksd), in descending order. |
| 259 | + |
| 260 | +# Rank influential points |
| 261 | +mod %>% |
| 262 | + augment() %>% |
| 263 | + arrange(desc(.cooksd)) %>% |
| 264 | + head() |
| 265 | + |
| 266 | + |
| 267 | +Removing outliers |
| 268 | +Observations can be outliers for a number of different reasons. Statisticians must always be careful—and more importantly, transparent—when dealing with outliers. Sometimes, a better model fit can be achieved by simply removing outliers and re-fitting the model. However, one must have strong justification for doing this. A desire to have a higher R2R2 is not a good enough reason! |
| 269 | +In the mlbBat10 data, the outlier with an OBP of 0.550 is Bobby Scales, an infielder who had four hits in 13 at-bats for the Chicago Cubs. Scales also walked seven times, resulting in his unusually high OBP. The justification for removing Scales here is weak. While his performance was unusual, there is nothing to suggest that it is not a valid data point, nor is there a good reason to think that somehow we will learn more about Major League Baseball players by excluding him. |
| 270 | +Nevertheless, we can demonstrate how removing him will affect our model. |
| 271 | +Instructions |
| 272 | +Use filter() to create a subset of mlbBat10 called nontrivial_players consisting of only those players with at least 10 at-bats and OBP of below 0.500. |
| 273 | +Fit the linear model for SLG as a function of OBP for the nontrivial_players. Save the result as mod_cleaner. |
| 274 | +View the summary() of the new model and compare the slope and R2R2 to those of mod, the original model fit to the data on all players. |
| 275 | +Visualize the new model with ggplot() and the appropriate geom_*() functions. |
| 276 | +# Create nontrivial_players |
| 277 | +nontrivial_players <- mlbBat10 %>% |
| 278 | + filter(AB >= 10, OBP < 0.5) |
| 279 | + |
| 280 | +# Fit model to new data |
| 281 | +mod_cleaner <- lm(SLG ~ OBP, data = nontrivial_players) |
| 282 | + |
| 283 | +# View model summary |
| 284 | +summary(mod_cleaner) |
| 285 | + |
| 286 | +# Visualize new model |
| 287 | +ggplot(data = nontrivial_players, aes(x = OBP, y = SLG)) + |
| 288 | + geom_point() + |
| 289 | + geom_smooth(method = "lm") |
| 290 | + |
| 291 | +High leverage points |
| 292 | +Not all points of high leverage are influential. While the high leverage observation corresponding to Bobby Scales in the previous exercise is influential, the three observations for players with OBP and SLG values of 0 are not influential. |
| 293 | +This is because they happen to lie right near the regression anyway. Thus, while their extremely low OBP gives them the power to exert influence over the slope of the regression line, their low SLG prevents them from using it. |
| 294 | +Instructions |
| 295 | +The linear model, mod, is available in your workspace. Use a combination of augment(), arrange() with two arguments, and head() to find the top 6 observations with the highest leverage but the lowest Cook's distance. |
| 296 | +# Rank high leverage points |
| 297 | +mod %>% |
| 298 | + augment() %>% |
| 299 | + arrange(desc(.hat), .cooksd) %>% |
| 300 | + head() |
| 301 | + |
0 commit comments