3
\$\begingroup\$

I'm working on a script where I want to create one dataframe from a series of files with annual baseball data in them.

I'm relatively new to R and I feel like the way I wrote get_seasons_range() is probably more like writing C# in R, rather than doing it the idiomatic R way. Is there a cleaner way to do this?

read_season <- function(yearID) {
 season = read.csv(paste("../retrosheetData/gamelog/GL", yearID, ".TXT", sep=""))
 glheaders = read.csv("../retrosheetData/gamelog/game_log_header.csv")
 names(season) = names(glheaders)
 return(season)
}
read_season_range <- function(year_range) {
 seasons = read_season(year_range[1])
 for(y in year_range) {
 if(y == year_range[1])
 next
 s = read_season(y)
 seasons = rbind(seasons, s)
 }
 return(seasons)
}
sixties = read_season_range(1960:1969)
asked Dec 9, 2015 at 14:22
\$\endgroup\$
3
  • \$\begingroup\$ just small notes - many of your = can be replaced by <- what is often considered good style, and you might consider reading glheaders outside of the function as it is not supposed to change. Also you could write the for loop as for (y in year_range[-1]) so you can skip the nasty if. \$\endgroup\$ Commented Dec 10, 2015 at 9:26
  • \$\begingroup\$ I've seen it written other places that <- is preferable to = but I haven't seen it explained why. I find it annoying to type the arrow (two characters and the shift key) when my fingers are already trained to hit the equals key from every other programming language. Admittedly, that's not a really good reason... \$\endgroup\$ Commented Dec 10, 2015 at 13:38
  • 1
    \$\begingroup\$ of course - it's not an error just a matter of style guideline to follow. eg. this. I think it originates from the fact that in some statistical languages = is the comparison operator, also <- and = are evaluated differently when in the argument of a function. Most editors for R offer some kind of shortcut to type <- maybe you can check the help. \$\endgroup\$ Commented Dec 10, 2015 at 15:42

2 Answers 2

5
\$\begingroup\$

The for loop can be replaced with a lapply statement. Using base R functions, your read_season_range example is equivalent to the following one-liner:

sixties <- do.call(rbind, lapply(1960:1969, read_season))

or to wrap it in a function:

read_season_range <- function(year_range) {
 do.call(rbind, lapply(year_range, read_season))
}
answered Dec 9, 2015 at 20:23
\$\endgroup\$
1
\$\begingroup\$

Other solutions:

  • data.table::rbindlist(lapply(1960:1969, read_season))
  • dplyr::bind_rows(lapply(1960:1969, read_season))
  • plyr::rbind.fill(lapply(1960:1969, read_season))

The fastest is rbindlist from the data.table package.

Comparison of the performance:

LDF <- list(
 data.frame(V1 = runif(1000), V2 = sample(LETTERS, 1000, replace = TRUE)),
 data.frame(V1 = runif(1000), V2 = sample(LETTERS, 1000, replace = TRUE)),
 data.frame(V1 = runif(1000), V2 = sample(LETTERS, 1000, replace = TRUE)),
 data.frame(V1 = runif(1000), V2 = sample(LETTERS, 1000, replace = TRUE)))
microbenchmark::microbenchmark(
 do.call(rbind, LDF),
 plyr::rbind.fill(LDF), 
 dplyr::bind_rows(LDF),
 data.table::rbindlist(LDF))
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> do.call(rbind, LDF) 822.387 908.9395 1008.4699 949.0085 987.5020 2800.581 100 b
#> plyr::rbind.fill(LDF) 751.549 837.6055 960.6077 867.8145 932.1825 2639.683 100 b
#> dplyr::bind_rows(LDF) 165.354 196.5525 218.4784 214.0425 236.4690 400.057 100 a 
#> data.table::rbindlist(LDF) 214.878 250.4435 278.0317 270.5885 295.2610 438.430 100 a 
answered Jan 8, 2016 at 20:25
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.