Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit e6fb0df

Browse files
author
Jae Yeon Kim
committed
Change example from merge to bind
1 parent 00b04c4 commit e6fb0df

File tree

8 files changed

+179
-262
lines changed

8 files changed

+179
-262
lines changed

‎lecture_notes/01_why_map.Rmd‎

Lines changed: 69 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -13,76 +13,79 @@ output:
1313
toc: yes
1414
---
1515

16-
# Setup
16+
# Setup
1717

1818
```{r}
1919
# Install packages
2020
if (!require("pacman")) install.packages("pacman")
21+
2122
pacman::p_load(tidyverse, # tidyverse pkgs including purrr
2223
tictoc, # performance test
2324
broom, # tidy modeling
2425
patchwork) # arranging ggplots
26+
2527
```
2628

27-
# Objectives
29+
# Objectives
30+
31+
- How to use `purrr` to automate workflow in a cleaner, faster, and more extendable way
2832

29-
- How to use `purrr` to automate workflow in a cleaner, faster, and more extendable way
33+
# Copy-and-paste programming
3034

31-
# Copy-and-paste programming
35+
> Copy-and-paste programming, sometimes referred to as just pasting, is the production of highly repetitive computer programming code, as produced by copy and paste operations. It is primarily a pejorative term; those who use the term are often implying a lack of programming competence. It may also be the result of technology limitations (e.g., an insufficiently expressive development environment) as subroutines or libraries would normally be used instead. However, there are occasions when copy-and-paste programming is considered acceptable or necessary, such as for boilerplate, loop unrolling (when not supported automatically by the compiler), or certain programming idioms, and it is supported by some source code editors in the form of snippets. - Wikipedia
3236
33-
> Copy-and-paste programming, sometimes referred to as just pasting, is the production of highly repetitive computer programming code, as produced by copy and paste operations. It is primarily a pejorative term; those who use the term are often implying a lack of programming competence. It may also be the result of technology limitations (e.g., an insufficiently expressive development environment) as subroutines or libraries would normally be used instead. However, there are occasions when copy-and-paste programming is considered acceptable or necessary, such as for boilerplate, loop unrolling (when not supported automatically by the compiler), or certain programming idioms, and it is supported by some source code editors in the form of snippets. - Wikipedia
37+
- The following exercise was inspired by [Wickham's example](http://adv-r.had.co.nz/Functional-programming.html).
3438

35-
- The following exercise was inspired by [Wickham's example](http://adv-r.had.co.nz/Functional-programming.html).
39+
- Let's imagine `df` is a survey data.
3640

37-
- Let's imagine `df` is a survey data.
41+
- a, b, c, d = Survey questions
3842

39-
- a, b, c, d = Survey respondents
43+
- -99: non-responses
44+
45+
- Your goal: replace -99 with NA
4046

41-
- -99: non-responses
42-
43-
- Your goal: replace -99 with NA
44-
4547
```{r}
4648
# Data
4749
df <- tibble("a" = -99,
4850
"b" = -99,
4951
"c" = -99,
5052
"d" = -99)
51-
53+
54+
```
55+
56+
57+
```{r}
5258
# Copy and paste
5359
df$a[df$a == -99] <- NA
5460
df$b[df$b == -99] <- NA
5561
df$c[df$c == -99] <- NA
5662
df$d[df$d == -99] <- NA
5763
58-
df
59-
6064
```
6165

62-
- **Challenge 1**. Explain why this solution is not very efficient. (e.g., If `df$a[df$a == -99] <- NA` has an error, how are you going to fix it?) A solution is not scalable if it's not automatable and, thus, scalable.
66+
- **Challenge 1**. Explain why this solution is not very efficient. (e.g., If `df$a[df$a == -99] <- NA` has an error, how are you going to fix it?) A solution is not scalable if it's not automatable and, thus, scalable.
6367

64-
# Using a function
68+
# Using a function
6569

66-
- Let's recall what's function in R: input + computation + output
70+
- Let's recall what's function in R: input + computation + output
6771

68-
- If you write a function, you gain efficiency because you don't need to copy and paste the computation part.
72+
- If you write a function, you gain efficiency because you don't need to copy and paste the computation part.
6973

70-
`
71-
function(input){
72-
73-
computation
74-
75-
return(output)
74+
\` function(input){
7675

77-
}
78-
`
76+
computation
77+
78+
return(output)
79+
80+
} \`
7981

8082
```{r}
8183
8284
# Function
8385
fix_missing <- function(x) {
8486
x[x == -99] <- NA
85-
x
87+
# This is better
88+
return(x)
8689
}
8790
8891
# Apply function to each column (vector)
@@ -91,34 +94,30 @@ df$b <- fix_missing(df$b)
9194
df$c <- fix_missing(df$c)
9295
df$d <- fix_missing(df$d)
9396
94-
df
95-
9697
```
9798

98-
- **Challenge 2** Why using function is more efficient than 100% copying and pasting? Can you think about a way we can automate the process?
99+
- **Challenge 2** Why using function is more efficient than 100% copying and pasting? Can you think about a way we can automate the process?
99100

100-
- Many options for automation in R: `for loop`, `apply` family, etc.
101+
- Many options for automation in R: `for loop`, `apply` family, etc.
101102

102-
- Here's a tidy solution comes from `purrr` package.
103+
- Here's a tidy solution comes from `purrr` package.
103104

104-
- The power and joy of one-liner.
105+
- The power and joy of one-liner.
105106

106107
```{r}
107108
108-
df <- purrr::map_df(df, fix_missing)
109-
110-
df
109+
purrr::map_df(df[,column], fix_missing)
111110
112111
```
113112

114-
`map()` is a [higher-order function](https://en.wikipedia.org/wiki/Map_(higher-order_function)) that applies a given function to each element of a list/vector.
113+
`map()` is a [higher-order function](https://en.wikipedia.org/wiki/Map_(higher-order_function)) that applies a given function to each element of a list/vector.
115114

116115
![This is how map() works. It's easier to understand with a picture.](https://d33wubrfki0l68.cloudfront.net/f0494d020aa517ae7b1011cea4c4a9f21702df8b/2577b/diagrams/functionals/map.png)
117116

118117
- Input: Takes a vector/list.
119-
118+
120119
- Computation: Calls the function once for each element of the vector
121-
120+
122121
- Output: Returns in a list or whatever data format you prefer (e.g., `_df helper: dataframe`)
123122

124123
**Challenge 3** If you run the code below, what's going to be the data type of the output?
@@ -128,13 +127,16 @@ df
128127
map_chr(df, fix_missing)
129128
130129
```
131-
- Why `map()` is a good alternative to `for loop`. (For more information, watch Hadley Wickam's talk titled ["The Joy of Functional Programming (for Data Science)"](https://www.youtube.com/watch?v=bzUmK0Y07ck&ab_channel=AssociationforComputingMachinery%28ACM%29).)
132130

133-
```{r}
131+
- Why `map()` is a good alternative to `for loop`. (For more information, watch Hadley Wickam's talk titled ["The Joy of Functional Programming (for Data Science)"](https://www.youtube.com/watch?v=bzUmK0Y07ck&ab_channel=AssociationforComputingMachinery%28ACM%29).)
134132

133+
```{r}
135134
# Built-in data
136135
data("airquality")
137136
137+
```
138+
139+
```{r}
138140
# 0.029 sec elapsed
139141
tic()
140142
@@ -147,8 +149,11 @@ for (i in seq_along(airquality)) { # Sequence variable
147149
}
148150
149151
toc()
152+
```
150153

151-
# 0.004 sec elapsed
154+
155+
```{r}
156+
# 0.011 sec elapsed
152157
153158
tic()
154159
@@ -157,18 +162,18 @@ out1 <- airquality %>% map_dbl(mean, na.rm = TRUE)
157162
toc()
158163
159164
```
160-
- In short, `map()` is more readable, faster, and easily extensive with other data science tasks (e.g., wrangling, modeling, and visualization) using `%>%`.
161165

162-
- Final point: Why not base R `apply` family?
166+
- In short, `map()` is more readable, faster, and easily extensive with other data science tasks (e.g., wrangling, modeling, and visualization) using `%>%`.
167+
168+
- Final point: Why not base R `apply` family?
163169

164-
Short answer: `purrr::map()` is simpler to write. For instance,
170+
Short answer: `purrr::map()` is simpler to write. For instance,
165171

166172
`map_dbl(x, mean, na.rm = TRUE)` = `vapply(x, mean, na.rm = TRUE, FUN.VALUE = double(1))`
167173

168174
**Additional tips**
169175

170-
Performance testing (profiling) is an important part of programming. `tictic()` measures the time that needs to take to run a target function for once. If you want a more robust measure of timing as well as information on memory (**speed** and **space** both matter for performance testing), consider using the [`bench` package](https://github.com/r-lib/bench) that is designed for high precising timing of R expressions.
171-
176+
Performance testing (profiling) is an important part of programming. `tictic()` measures the time that needs to take to run a target function for once. If you want a more robust measure of timing as well as information on memory (**speed** and **space** both matter for performance testing), consider using the [`bench` package](https://github.com/r-lib/bench) that is designed for high precising timing of R expressions.
172177

173178
```{r}
174179
map_mark <- bench::mark(
@@ -180,11 +185,11 @@ map_mark <- bench::mark(
180185
map_mark
181186
```
182187

183-
# Applications
188+
# Applications
184189

185-
1. Many models
190+
1. Many models
186191

187-
- One popular application of `map()` is to run regression models (or whatever model you want to run) on list-columns. No more copying and pasting for running many regression models on subgroups!
192+
- One popular application of `map()` is to run regression models (or whatever model you want to run) on list-columns. No more copying and pasting for running many regression models on subgroups!
188193

189194
```{r eval = FALSE}
190195
# Have you ever tried this?
@@ -195,21 +200,28 @@ lm_D <- lm(y ~ x, subset(data, subgroup == "group_D"))
195200
lm_E <- lm(y ~ x, subset(data, subgroup == "group_E"))
196201
```
197202

198-
- For more information on this technique, read the Many Models subchapter of the [R for Data Science](https://r4ds.had.co.nz/many-models.html#creating-list-columns).
203+
- For more information on this technique, read the Many Models subchapter of the [R for Data Science](https://r4ds.had.co.nz/many-models.html#creating-list-columns).
199204

200205
```{r}
201206
# Function
202207
lm_model <- function(df) {
203208
lm(Temp ~ Ozone, data = df)
204209
}
210+
```
211+
205212

213+
```{r}
206214
# Map
207215
models <- airquality %>%
216+
# Determines group variable
208217
group_by(Month) %>%
209218
nest() %>% # Create list-columns
210219
mutate(ols = map(data, lm_model)) # Map
211-
models$ols[1]
212220
221+
```
222+
223+
224+
```{r}
213225
# Add tidying
214226
tidy_lm_model <- purrr::compose( # compose multiple functions
215227
broom::tidy, # convert lm objects into tidy tibbles
@@ -224,11 +236,11 @@ tidied_models <- airquality %>%
224236
tidied_models$ols[1]
225237
```
226238

227-
2. Simulations
239+
2. Simulations
228240

229-
A good friend of `map()` function is `rerun()` function. This comibination is really useful for simulations. Consider the following example.
241+
A good friend of `map()` function is `rerun()` function. This comibination is really useful for simulations. Consider the following example.
230242

231-
*Base R approach
243+
-Base R approach
232244

233245
```{r}
234246
@@ -250,7 +262,7 @@ qplot(y_means) +
250262
geom_vline(xintercept = 500, linetype = "dotted", color = "red")
251263
```
252264

253-
* rerun() + map()
265+
- rerun() + map()
254266

255267
```{r}
256268
@@ -266,4 +278,3 @@ y_means_tidy <- map_dbl(y_tidy, mean)
266278
(qplot(y_means_tidy) +
267279
geom_vline(xintercept = 500, linetype = "dotted", color = "red"))
268280
```
269-

‎lecture_notes/02_more_inputs.Rmd‎

Lines changed: 13 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -51,12 +51,12 @@ pacman::p_load(tidyverse, # tidyverse pkgs including purrr
5151
```{r}
5252
paste("University = Berkeley | Department = CS")
5353
```
54+
5455
# For loop
5556

5657
- A slightly more efficient way: using a for loop.
5758

5859
- Think about which part of the statement is constant and which part varies ( = parameters).
59-
6060
- Do we need a placeholder? No. We don't need a placeholder because we don't store the result of iterations.
6161

6262
**Challenge 1**: How many parameters do you need to solve the problem below?
@@ -65,11 +65,11 @@ paste("University = Berkeley | Department = CS")
6565

6666
```{r}
6767
68-
# Outer loop
68+
# Outer loop for univ variable
6969
7070
for (univ in c("Berkeley", "Stanford")) {
7171
72-
# Inner loop
72+
# Inner loop for dept variable
7373
for (dept in c("waterbenders", "earthbenders", "firebenders", "airbenders")) {
7474
7575
print(paste("University = ", univ, "|", "Department = ", dept))
@@ -79,6 +79,7 @@ for (univ in c("Berkeley", "Stanford")) {
7979
}
8080
8181
```
82+
8283
- This is not bad, but ... n arguments -> n nested for loops. As a scale of your problem grows, your code gets really complicated.
8384

8485
> To become significantly more reliable, code must become more transparent. In particular, nested conditions and loops must be viewed with great suspicion. Complicated control flows confuse programmers. Messy code often hides bugs. — [Bjarne Stroustrup](https://en.wikipedia.org/wiki/Bjarne_Stroustrup)
@@ -90,30 +91,37 @@ for (univ in c("Berkeley", "Stanford")) {
9091
**Challenge 2** Why are we using `rep()` to create input vectors? For instance, for `univ_list` why not just use `c("Berkeley", "Stanford")`?
9192

9293
```{r}
93-
9494
# Inputs (remember the length of these inputs should be identical)
9595
univ_list <- rep(c("Berkeley", "Stanford"),4)
9696
9797
dept_list <- rep(c("waterbenders", "earthbenders", "firebenders", "airbenders"),2)
98+
```
99+
98100

101+
```{r}
99102
# Function
100103
print_lists <- function(univ, dept){
101104
102105
print(paste("University = ", univ, "|", "Department = ", dept))
103106
104107
}
108+
```
109+
105110

111+
```{r}
106112
# Test
107113
print_lists(univ_list[1], dept_list[1])
108114
109115
```
116+
110117
- Step2: Using `map2()` or `pmap()`
111118

112119
![](https://dcl-prog.stanford.edu/images/map2.png)
113120
```{r}
114121
115122
# 2 arguments
116-
map2_output <- map2(univ_list, dept_list, print_lists)
123+
map2_output <- map2(univ_list, dept_list,
124+
print_lists)
117125
118126
```
119127

0 commit comments

Comments
(0)

AltStyle によって変換されたページ (->オリジナル) /