The fastest way to import and edit data in R

Question 1

So I want to import and edit a huge amount of data. I have a file with multiple data sets from time series that are chronologically ordered. Now I want to open them and edit to have a final dataset where the files are edited and ordered.

The code looks like this but it is taking some time to load the data:

setwd <-("C:/Users/D60378/Desktop/DATA")
path_data <- 'test'
files_data <- list.files(path_data)
length(files_data)
for (i in 1:length(files_data)){
 # use intermediary path if nested dir
tempPath <- file.path(getwd(), path_data, files_data[i])
 # name of the dataser
 name <- gsub('-', '_', substr(files_data[i], start = 0, stop = (nchar(files_data[i])-4)))
 print(name)
 df <- read.csv(tempPath, skip=5, header=FALSE, sep = ';', dec = ',') #sheet 1 is the
 # column names 
 df_names <- read.csv(tempPath, skip=4, nrows=1, header=FALSE, sep = ';', dec=',')
 dfnames_test_MS7 <- c()
 for (i in 1:length(df_names)){
 print(as.character(df_names[[i]]))
 dfnames_test_MS7[i] <- as.character(df_names[[i]])
 }
 dfnames_test_MS7[1] <- "DateTime"
 for (i in 1:length(dfnames_test_MS7)){
 dfnames_test_MS7[i] <- gsub(' ', '_', dfnames_test_MS7[i])
 }
 dfnames_test_MS7
 names(df) <- dfnames_test_MS7
 assign(name, df)
}

Question 2

how many csv file you have? how large they are? with how many time series and how long?

Question 3

1) Firstly you could use fread from data.table to speed up reading of .csv files.

2) It looks like that you do not need the two inner loops, you could do that with vectorization.

If you could provide some example data of df and df_names(using dput) then I could write the necessary code, and test the timings...

Update:

It looks like for smaller files (200 cols and 2000 rows) read_csv2 from readr is faster than data.table's fread:

Unit: milliseconds
 expr min lq mean median uq max neval cld
 read.csv(tempPath, skip = 5, header = FALSE, sep = ";", dec = ",") 475.7547 480.2411 488.7772 484.3644 487.7255 515.8005 5 c
 read_csv2(tempPath, skip = 4) 179.2461 181.9832 182.2904 182.3569 182.4955 185.3702 5 a 
 fread(tempPath, skip = 4, dec = ",") 463.4811 468.1232 470.4556 468.3920 469.5664 482.7155 5 b

You should test it yourself and see if there is significant difference.

The end code could look something like this:

files_data <- list.files(path_data, full.names = T)
files_data
files_data <- grep(".csv", files_data, value = T, fixed = T)
length(files_data)
files_data # paths to files
myAproach <- function(i) {
 require(readr)
 tempPath <- files_data[i]
 name <- basename(tempPath)
 name <- gsub('-', '_', substr(name, start = 0,
 stop = (nchar(name) - 4)),
 fixed = T)
 print(name)
 df <- read_csv2(tempPath, skip = 4)
 dfnames_test_MS7 <- colnames(df)
 dfnames_test_MS7[1] <- "DateTime"
 dfnames_test_MS7 <- gsub('V', 'x', dfnames_test_MS7, fixed = T)
 #fixed for speed
 colnames(df) <- dfnames_test_MS7
 df
}
resDflist <- lapply(1:length(files_data), myAproach)

resDflist is list of data.frames. In my opinion it is easier to work with lists than assign the data.frames to global environment.

Question 4

+1 but a couple comments (in the spirit of improving coding standards). You should make myAproach take a filename as input, so it does not rely on an object defined outside its scope (files_data). Prefer TRUE to T; the latter is frowned upon as it can be overwritten (T <- FALSE).

minem minem 1,0021 gold badge9 silver badges12 bronze badges · Accepted Answer · 2017-07-31 07:13:27Z

1) Firstly you could use fread from data.table to speed up reading of .csv files.

2) It looks like that you do not need the two inner loops, you could do that with vectorization.

If you could provide some example data of df and df_names(using dput) then I could write the necessary code, and test the timings...

Update:

It looks like for smaller files (200 cols and 2000 rows) read_csv2 from readr is faster than data.table's fread:

Unit: milliseconds
 expr min lq mean median uq max neval cld
 read.csv(tempPath, skip = 5, header = FALSE, sep = ";", dec = ",") 475.7547 480.2411 488.7772 484.3644 487.7255 515.8005 5 c
 read_csv2(tempPath, skip = 4) 179.2461 181.9832 182.2904 182.3569 182.4955 185.3702 5 a 
 fread(tempPath, skip = 4, dec = ",") 463.4811 468.1232 470.4556 468.3920 469.5664 482.7155 5 b

You should test it yourself and see if there is significant difference.

The end code could look something like this:

files_data <- list.files(path_data, full.names = T)
files_data
files_data <- grep(".csv", files_data, value = T, fixed = T)
length(files_data)
files_data # paths to files
myAproach <- function(i) {
 require(readr)
 tempPath <- files_data[i]
 name <- basename(tempPath)
 name <- gsub('-', '_', substr(name, start = 0,
 stop = (nchar(name) - 4)),
 fixed = T)
 print(name)
 df <- read_csv2(tempPath, skip = 4)
 dfnames_test_MS7 <- colnames(df)
 dfnames_test_MS7[1] <- "DateTime"
 dfnames_test_MS7 <- gsub('V', 'x', dfnames_test_MS7, fixed = T)
 #fixed for speed
 colnames(df) <- dfnames_test_MS7
 df
}
resDflist <- lapply(1:length(files_data), myAproach)

resDflist is list of data.frames. In my opinion it is easier to work with lists than assign the data.frames to global environment.

+1 but a couple comments (in the spirit of improving coding standards). You should make myAproach take a filename as input, so it does not rely on an object defined outside its scope (files_data). Prefer TRUE to T; the latter is frowned upon as it can be overwritten (T <- FALSE).

Stack Exchange Network

The fastest way to import and edit data in R

1 Answer 1

Update:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

The fastest way to import and edit data in R

1 Answer 1

Update:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions