2
\$\begingroup\$

So I want to import and edit a huge amount of data. I have a file with multiple data sets from time series that are chronologically ordered. Now I want to open them and edit to have a final dataset where the files are edited and ordered.

The code looks like this but it is taking some time to load the data:

setwd <-("C:/Users/D60378/Desktop/DATA")
path_data <- 'test'
files_data <- list.files(path_data)
length(files_data)
for (i in 1:length(files_data)){
 # use intermediary path if nested dir
tempPath <- file.path(getwd(), path_data, files_data[i])
 # name of the dataser
 name <- gsub('-', '_', substr(files_data[i], start = 0, stop = (nchar(files_data[i])-4)))
 print(name)
 df <- read.csv(tempPath, skip=5, header=FALSE, sep = ';', dec = ',') #sheet 1 is the
 # column names 
 df_names <- read.csv(tempPath, skip=4, nrows=1, header=FALSE, sep = ';', dec=',')
 dfnames_test_MS7 <- c()
 for (i in 1:length(df_names)){
 print(as.character(df_names[[i]]))
 dfnames_test_MS7[i] <- as.character(df_names[[i]])
 }
 dfnames_test_MS7[1] <- "DateTime"
 for (i in 1:length(dfnames_test_MS7)){
 dfnames_test_MS7[i] <- gsub(' ', '_', dfnames_test_MS7[i])
 }
 dfnames_test_MS7
 names(df) <- dfnames_test_MS7
 assign(name, df)
}
minem
1,0021 gold badge9 silver badges12 bronze badges
asked Jul 27, 2017 at 12:45
\$\endgroup\$
1
  • 1
    \$\begingroup\$ how many csv file you have? how large they are? with how many time series and how long? \$\endgroup\$ Commented Jul 31, 2017 at 8:10

1 Answer 1

1
\$\begingroup\$

1) Firstly you could use fread from data.table to speed up reading of .csv files.

2) It looks like that you do not need the two inner loops, you could do that with vectorization.

If you could provide some example data of df and df_names(using dput) then I could write the necessary code, and test the timings...


Update:

It looks like for smaller files (200 cols and 2000 rows) read_csv2 from readr is faster than data.table's fread:

Unit: milliseconds
 expr min lq mean median uq max neval cld
 read.csv(tempPath, skip = 5, header = FALSE, sep = ";", dec = ",") 475.7547 480.2411 488.7772 484.3644 487.7255 515.8005 5 c
 read_csv2(tempPath, skip = 4) 179.2461 181.9832 182.2904 182.3569 182.4955 185.3702 5 a 
 fread(tempPath, skip = 4, dec = ",") 463.4811 468.1232 470.4556 468.3920 469.5664 482.7155 5 b 

You should test it yourself and see if there is significant difference.

The end code could look something like this:

files_data <- list.files(path_data, full.names = T)
files_data
files_data <- grep(".csv", files_data, value = T, fixed = T)
length(files_data)
files_data # paths to files
myAproach <- function(i) {
 require(readr)
 tempPath <- files_data[i]
 name <- basename(tempPath)
 name <- gsub('-', '_', substr(name, start = 0,
 stop = (nchar(name) - 4)),
 fixed = T)
 print(name)
 df <- read_csv2(tempPath, skip = 4)
 dfnames_test_MS7 <- colnames(df)
 dfnames_test_MS7[1] <- "DateTime"
 dfnames_test_MS7 <- gsub('V', 'x', dfnames_test_MS7, fixed = T)
 #fixed for speed
 colnames(df) <- dfnames_test_MS7
 df
}
resDflist <- lapply(1:length(files_data), myAproach)

resDflist is list of data.frames. In my opinion it is easier to work with lists than assign the data.frames to global environment.

answered Jul 31, 2017 at 7:13
\$\endgroup\$
1
  • 2
    \$\begingroup\$ +1 but a couple comments (in the spirit of improving coding standards). You should make myAproach take a filename as input, so it does not rely on an object defined outside its scope (files_data). Prefer TRUE to T; the latter is frowned upon as it can be overwritten (T <- FALSE). \$\endgroup\$ Commented Aug 1, 2017 at 23:03

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.