Merge together multiple dataframes

Question 1

I'm working on a script where I want to create one dataframe from a series of files with annual baseball data in them.

I'm relatively new to R and I feel like the way I wrote get_seasons_range() is probably more like writing C# in R, rather than doing it the idiomatic R way. Is there a cleaner way to do this?

read_season <- function(yearID) {
 season = read.csv(paste("../retrosheetData/gamelog/GL", yearID, ".TXT", sep=""))
 glheaders = read.csv("../retrosheetData/gamelog/game_log_header.csv")
 names(season) = names(glheaders)
 return(season)
}
read_season_range <- function(year_range) {
 seasons = read_season(year_range[1])
 for(y in year_range) {
 if(y == year_range[1])
 next
 s = read_season(y)
 seasons = rbind(seasons, s)
 }
 return(seasons)
}
sixties = read_season_range(1960:1969)

Question 2

just small notes - many of your = can be replaced by <- what is often considered good style, and you might consider reading glheaders outside of the function as it is not supposed to change. Also you could write the for loop as for (y in year_range[-1]) so you can skip the nasty if.

Question 3

I've seen it written other places that <- is preferable to = but I haven't seen it explained why. I find it annoying to type the arrow (two characters and the shift key) when my fingers are already trained to hit the equals key from every other programming language. Admittedly, that's not a really good reason...

Question 4

of course - it's not an error just a matter of style guideline to follow. eg. this. I think it originates from the fact that in some statistical languages = is the comparison operator, also <- and = are evaluated differently when in the argument of a function. Most editors for R offer some kind of shortcut to type <- maybe you can check the help.

Question 5

The for loop can be replaced with a lapply statement. Using base R functions, your read_season_range example is equivalent to the following one-liner:

sixties <- do.call(rbind, lapply(1960:1969, read_season))

or to wrap it in a function:

read_season_range <- function(year_range) {
 do.call(rbind, lapply(year_range, read_season))
}

Question 6

Other solutions:

data.table::rbindlist(lapply(1960:1969, read_season))
dplyr::bind_rows(lapply(1960:1969, read_season))
plyr::rbind.fill(lapply(1960:1969, read_season))

The fastest is rbindlist from the data.table package.

Comparison of the performance:

LDF <- list(
 data.frame(V1 = runif(1000), V2 = sample(LETTERS, 1000, replace = TRUE)),
 data.frame(V1 = runif(1000), V2 = sample(LETTERS, 1000, replace = TRUE)),
 data.frame(V1 = runif(1000), V2 = sample(LETTERS, 1000, replace = TRUE)),
 data.frame(V1 = runif(1000), V2 = sample(LETTERS, 1000, replace = TRUE)))
microbenchmark::microbenchmark(
 do.call(rbind, LDF),
 plyr::rbind.fill(LDF), 
 dplyr::bind_rows(LDF),
 data.table::rbindlist(LDF))
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> do.call(rbind, LDF) 822.387 908.9395 1008.4699 949.0085 987.5020 2800.581 100 b
#> plyr::rbind.fill(LDF) 751.549 837.6055 960.6077 867.8145 932.1825 2639.683 100 b
#> dplyr::bind_rows(LDF) 165.354 196.5525 218.4784 214.0425 236.4690 400.057 100 a 
#> data.table::rbindlist(LDF) 214.878 250.4435 278.0317 270.5885 295.2610 438.430 100 a

rcs rcs 3841 gold badge3 silver badges8 bronze badges · Accepted Answer · 2015-12-09 20:23:56Z

The for loop can be replaced with a lapply statement. Using base R functions, your read_season_range example is equivalent to the following one-liner:

sixties <- do.call(rbind, lapply(1960:1969, read_season))

or to wrap it in a function:

read_season_range <- function(year_range) {
 do.call(rbind, lapply(year_range, read_season))
}

Stack Exchange Network

Merge together multiple dataframes

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Merge together multiple dataframes

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions