I'm working on a script where I want to create one dataframe from a series of files with annual baseball data in them.
I'm relatively new to R and I feel like the way I wrote get_seasons_range()
is probably more like writing C# in R, rather than doing it the idiomatic R way. Is there a cleaner way to do this?
read_season <- function(yearID) {
season = read.csv(paste("../retrosheetData/gamelog/GL", yearID, ".TXT", sep=""))
glheaders = read.csv("../retrosheetData/gamelog/game_log_header.csv")
names(season) = names(glheaders)
return(season)
}
read_season_range <- function(year_range) {
seasons = read_season(year_range[1])
for(y in year_range) {
if(y == year_range[1])
next
s = read_season(y)
seasons = rbind(seasons, s)
}
return(seasons)
}
sixties = read_season_range(1960:1969)
2 Answers 2
The for
loop can be replaced with a lapply
statement.
Using base R functions, your read_season_range
example is equivalent to the following one-liner:
sixties <- do.call(rbind, lapply(1960:1969, read_season))
or to wrap it in a function:
read_season_range <- function(year_range) {
do.call(rbind, lapply(year_range, read_season))
}
Other solutions:
data.table::rbindlist(lapply(1960:1969, read_season))
dplyr::bind_rows(lapply(1960:1969, read_season))
plyr::rbind.fill(lapply(1960:1969, read_season))
The fastest is rbindlist
from the data.table
package.
Comparison of the performance:
LDF <- list(
data.frame(V1 = runif(1000), V2 = sample(LETTERS, 1000, replace = TRUE)),
data.frame(V1 = runif(1000), V2 = sample(LETTERS, 1000, replace = TRUE)),
data.frame(V1 = runif(1000), V2 = sample(LETTERS, 1000, replace = TRUE)),
data.frame(V1 = runif(1000), V2 = sample(LETTERS, 1000, replace = TRUE)))
microbenchmark::microbenchmark(
do.call(rbind, LDF),
plyr::rbind.fill(LDF),
dplyr::bind_rows(LDF),
data.table::rbindlist(LDF))
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> do.call(rbind, LDF) 822.387 908.9395 1008.4699 949.0085 987.5020 2800.581 100 b
#> plyr::rbind.fill(LDF) 751.549 837.6055 960.6077 867.8145 932.1825 2639.683 100 b
#> dplyr::bind_rows(LDF) 165.354 196.5525 218.4784 214.0425 236.4690 400.057 100 a
#> data.table::rbindlist(LDF) 214.878 250.4435 278.0317 270.5885 295.2610 438.430 100 a
=
can be replaced by<-
what is often considered good style, and you might consider reading glheaders outside of the function as it is not supposed to change. Also you could write the for loop asfor (y in year_range[-1])
so you can skip the nasty if. \$\endgroup\$<-
is preferable to=
but I haven't seen it explained why. I find it annoying to type the arrow (two characters and the shift key) when my fingers are already trained to hit the equals key from every other programming language. Admittedly, that's not a really good reason... \$\endgroup\$=
is the comparison operator, also<-
and=
are evaluated differently when in the argument of a function. Most editors for R offer some kind of shortcut to type<-
maybe you can check the help. \$\endgroup\$