How can I read multiple csv files from within multiple zip files with sometimes different columns using R fread?

Question 1

I'm trying to read in a lot of data as efficiently as possible; the data are in about 1,400 CSV files within 6 individual ZIP files. The CSV files are all similar timeseries data with the same columns, except some CSV files have an extra column ("yield_c" in this example). I need to import all data and would prefer to not unzip the ZIP files. I've had success with a few approaches but both have their problems; we'll call them A and B. Approach A seems to run much faster than B, but it is adding each CSV header as a new data row and (likely related) is causing all columns to be read as character. Approach B processes the full dataset correctly but is abominably slow.

Any tips on coding this better for efficiency and a proper data pull? I think I am missing something simple about how freadr processes the cmd input but I can't figure out how to make approach A work correctly without adding the header rows every time.

Here is a sample dataset of "only" 240 CSV files among 6 ZIP files with my two data processor functions datprocessorA() and datprocessorB(). Currently (for me) approach A takes a few seconds and approach B takes 2.5 minutes.

Apologies for any formatting or etiquette issues, I don't post here as I can usually find a solution. Similarly, apologies if this is a dupe but I've scoured the forums and haven't found anything to help this specific case; plenty of related issues but not this one in particular.

library(data.table)
# Create sample CSV files and zip them in a folder
tempath <- "temp"
dir.create(file.path(tempath), showWarnings = FALSE)
setwd(tempath)
dfs <- list()
num_dfs <- 240
for (i in 1:num_dfs) {
 
 data <- data.frame(date = seq(as.Date("1980年10月01日"), as.Date("2020年09月01日"), by = "month"),
 Subset = paste0("Subset", i),
 yield_a = sample(0:20,480, replace = TRUE),
 yield_b = sample(0:5,480, replace = TRUE))
 
 df_name <- paste0("df", i)
 
 dfs[[df_name]] <- data
}
# randomly (ish) add "yield_c" column to a subset of dataframes
seqz <- sort(unique(c(1,sample(seq(1,num_dfs, by = 5), size = num_dfs%/%3, replace = TRUE))))
for (j in seqz) {
 dfs[[j]]$yield_c = sample(1:10,480, replace = TRUE)
}
# write all to csv, zip up, delete csvs
for (k in seq_along(dfs)) {
 file_name <- paste0("file", k, ".csv") # Create a unique filename
 write.csv(dfs[[k]], file = file_name, row.names = FALSE) # Save to CSV
}
zipcount <- 6
for (l in 1:zipcount) {
 zip(paste0("zip",l,".zip"), paste0("file", seq(num_dfs/zipcount*(l-1)+1,num_dfs/zipcount*l,1), ".csv"))
}
file.remove(list.files(pattern = "\\.csv$", full.names = TRUE))
# import
# return to parent directory to more closely imitate my scenario
setwd("..")
#get path for all zip files
zips <- paste0(tempath, "/", list.files(path = tempath, full.names = F, recursive = F))
# first function to import all csvs into R as a table
datprocessorA <- function(zipdir) {
 
 connz <- paste("unzip -p",zipdir)
 datz <- rbindlist(lapply(connz, function(x) fread(x, sep=',', header = TRUE,
 stringsAsFactors=FALSE, fill = Inf)),
 use.names = TRUE, fill=TRUE)
 datz
}
# alternative function to import all csvs into R as a table
datprocessorB <- function(zipdir) {
 
 datz <- lapply(setNames(nm = unzip(zipdir, list = TRUE)$Name),
 function(fn) fread(cmd = paste("unzip -p -a",zipdir, fn), header = TRUE,
 stringsAsFactors=FALSE, fill = Inf)) %>%
 rbindlist(use.names = TRUE, fill=TRUE)
 datz
}
# for loop to process import function; choose datprocessorA or datprocessorB
funx <- datprocessorA
begin <- Sys.time()
for (i in 1:length(zips)) {
 
 zipdir <- zips[i]
 
 if (i == 1) {
 import <- funx(zipdir)
 } else {
 dat2 <- funx(zipdir)
 import <- rbind(import, dat2, fill = TRUE)
 rm(dat2)
 }
}
end <- Sys.time()
print(paste0("Approach A length: ",round(end-begin), " seconds"))
funx <- datprocessorB
begin <- Sys.time()
for (i in 1:length(zips)) {
 
 zipdir <- zips[i]
 
 if (i == 1) {
 import <- funx(zipdir)
 } else {
 dat2 <- funx(zipdir)
 import <- rbind(import, dat2, fill = TRUE)
 rm(dat2)
 }
}
end <- Sys.time()
print(paste0("Approach B length: ",round(end-begin), " minutes"))

Question 2

The following function seems to do what you want and is fast.
It reads all files in the zips passed in a vector zip_vector, with no need for a for loop.

datprocessorC <- function(zip_vector) {
 f <- function(zipfile) {
 fls <- unzip(zipfile, list = TRUE)$Name
 lapply(fls, \(f) fread(unzip(zipfile, files = f))) |> rbindlist(fill = TRUE)
 }
 lapply(zip_vector, f) |> rbindlist(fill = TRUE)
}
# run on the zip files vector returned by list.files()
importC <- datprocessorC(zips)

Edit

Here is another function that reads the files with unz/readr::read_csv without unzipping. No file1.csv, file2.csv, etc are created.

This function takes about twice the time of datprocessorC to extract the data.

datprocessorD <- function(zip_vector) {
 f <- function(zipfile) {
 fls <- unzip(zipfile, list = TRUE)$Name
 out <- vector("list", length(fls))
 for(i in seq_along(fls)) {
 tmp <- unz(zipfile, filename = fls[i])
 out[[i]] <- readr::read_csv(file = tmp, show_col_types = FALSE)
 }
 data.table::rbindlist(out, fill = TRUE)
 }
 lapply(zip_vector, f) |> data.table::rbindlist(fill = TRUE)
}

Question 3

30 seconds on my real dataset, columns are set to their proper classes, and no repeating header rows. This is great except it does physically unzip all of the files into the working directory. I can either: edit to unzip to a temp directory, write a line of code to delete after, or get over it. Thanks!

Question 4

@Riley Glad it helped. I have edited the answer with another function that doesn't unzip the files but is slower. See if it can help to solve your use case.

Question 5

"with no need for a for loop."--there are hidden loops (lapply()s), where one could be avoided.

Question 6

@Friede That's true, thanks. Anyway, the OP doesn't want to unzip the files and unz only extracts one file within the .zip at a time, so the inner loop is needed.

Question 7

Thanks Rui, I'm finding it easier and faster to just unzip to a temp directory and delete after. This was very helpful, much appreciated!

Question 8

Sorry, no fread(), but DuckDB combined with zipfs community extension is also reasonably fast, even though zipfs devs warn that its more about convenience than high performance - Performance considerations . And globbing for both zip files and archive content is indeed nice.

DuckDB through duckplyr, timing stops after full materialization:

library(duckplyr)
db_exec("INSTALL zipfs FROM community")
db_exec("LOAD zipfs")
tictoc::tic()
import_dd <-
 read_csv_duckdb(
 path = "zip://temp/*.zip/*.csv", 
 options = list(union_by_name = TRUE)
 ) |> 
 data.table::setDT()
tictoc::toc()
#> 2.71 sec elapsed
import_dd
#> date Subset yield_a yield_b yield_c
#> <Date> <char> <num> <num> <num>
#> 1: 1980年10月01日 Subset1 4 2 3
#> 2: 1980年11月01日 Subset1 8 1 3
#> 3: 1980年12月01日 Subset1 3 1 9
#> 4: 1981年01月01日 Subset1 1 1 2
#> 5: 1981年02月01日 Subset1 17 0 9
#> --- 
#> 115196: 2020年05月01日 Subset240 18 0 NA
#> 115197: 2020年06月01日 Subset240 3 3 NA
#> 115198: 2020年07月01日 Subset240 5 3 NA
#> 115199: 2020年08月01日 Subset240 15 0 NA
#> 115200: 2020年09月01日 Subset240 6 4 NA

This was with default DuckDB settings, it sets thread count to the number of CPU cores, 8 in my case (Core i7).

Question 9

Using the external tar command (which also unzips and is included with Windows 11 and most Linux distributions) iterate over the zip files running tar and excluding header rows and then read that in with fread. This is for Windows 11. On Linux use the commented out lines instead of the 2 lines prior to them.

library(data.table)
p <- proc.time()
col.names <- c("date", "Subset", "yield_a", "yield_b", "yield_c")
tar <- Sys.which("tar")[[1]]
fmt <- paste(tar, "-xOf %s | findstr /b /v .date")
# unzip <- Sys.which("unzip")[[1]]
# fmt <- paste(unzip, "-p %s | grep -v date") # uncomment on Linux
res <- Sys.glob("*.zip") |>
 lapply(\(x) fread(cmd = sprintf(fmt,x), col.names = col.names, fill = TRUE)) |>
 rbindlist()
proc.time() - p
## user system elapsed 
## 0.02 0.14 0.85

Old

Using the external 7z program this takes 7 seconds on a Windows 11 system (i7 processor). It does not create any intermediate files. It first reads and concatenates in all the files in each zip file and then splits out the header lines keeping only the longest which it puts up front and then reads that text as a csv.

library(magrittr)
p <- proc.time()
dat <- zips %>%
 lapply(function(file) shell(paste("7z x -so", file), intern = TRUE)) %>%
 unlist %>%
 split(!grepl("date", .)) %>%
 within(`FALSE` <- `FALSE`[which.max(nchar(`FALSE`))]) %>%
 unlist %>%
 read.csv(text = .)
proc.time() - p
## user system elapsed 
## 0.87 4.80 7.02

Using gawk we can process this in under a second on my PC. This is for Widows but it woud not be hard to modify it for Linux.

dir/b *.zip | gawk -e "{ system(\"7z.exe x -so \" 0ドル ) }" > all.csv 
findstr /b .date all.csv | sort /r | gawk "NR==1" > final.csv
findstr /b "[0-9]" all.csv >> final.csv

Question 10

Revised to use tar.

Rui Barradas 78k8 gold badges41 silver badges75 bronze badges · Accepted Answer · 2025-07-28 17:32:50Z

The following function seems to do what you want and is fast.
It reads all files in the zips passed in a vector zip_vector, with no need for a for loop.

datprocessorC <- function(zip_vector) {
 f <- function(zipfile) {
 fls <- unzip(zipfile, list = TRUE)$Name
 lapply(fls, \(f) fread(unzip(zipfile, files = f))) |> rbindlist(fill = TRUE)
 }
 lapply(zip_vector, f) |> rbindlist(fill = TRUE)
}
# run on the zip files vector returned by list.files()
importC <- datprocessorC(zips)

Edit

Here is another function that reads the files with unz/readr::read_csv without unzipping. No file1.csv, file2.csv, etc are created.

This function takes about twice the time of datprocessorC to extract the data.

datprocessorD <- function(zip_vector) {
 f <- function(zipfile) {
 fls <- unzip(zipfile, list = TRUE)$Name
 out <- vector("list", length(fls))
 for(i in seq_along(fls)) {
 tmp <- unz(zipfile, filename = fls[i])
 out[[i]] <- readr::read_csv(file = tmp, show_col_types = FALSE)
 }
 data.table::rbindlist(out, fill = TRUE)
 }
 lapply(zip_vector, f) |> data.table::rbindlist(fill = TRUE)
}

30 seconds on my real dataset, columns are set to their proper classes, and no repeating header rows. This is great except it does physically unzip all of the files into the working directory. I can either: edit to unzip to a temp directory, write a line of code to delete after, or get over it. Thanks!
@Riley Glad it helped. I have edited the answer with another function that doesn't unzip the files but is slower. See if it can help to solve your use case.
"with no need for a for loop."--there are hidden loops (lapply()s), where one could be avoided.
@Friede That's true, thanks. Anyway, the OP doesn't want to unzip the files and unz only extracts one file within the .zip at a time, so the inner loop is needed.
Thanks Rui, I'm finding it easier and faster to just unzip to a temp directory and delete after. This was very helpful, much appreciated!

CollectivesTM on Stack Overflow

How can I read multiple csv files from within multiple zip files with sometimes different columns using R fread?

3 Answers 3

Edit

5 Comments

Comments

Old

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

3 Answers 3

Edit

5 Comments

Comments

Old

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related