I've got a huge dataset in R which contains (among other things) 2 columns indicating "Term Code", which are "Term Code" and "Term Code1".
"Term Code" is the mostly-correct column, which usually contains a 6-digit code for year and academic term:
> head(data$Term_code)
[1] 201230 201230 201230 201230 201230 201230
however, over the years that this field was populated, on occassion this field was either left blank or populated with only the last 2 digits:
> head(data$Term_code[nchar(data$Term_code) < 6], 100)
[1] NA NA NA 70 NA NA 10 NA NA 30 NA 30 NA 40 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[35] NA 10 NA 30 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[69] NA NA NA 50 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
A SAS analyst did his best to fill construct another field, "Term_code1", with the preferred 6-digit version of the code where it's missing or truncated in "Term_code".
Now I need to combine these 2 columns into a single, maximally-complete column vector within the dataframe.
I did this using a for loop, which works on the small scale, but is slow at best with Big Data and at worse breaks by hitting the max memory size (even when that value is maxed out).
This is how I wrote it:
for(i in 1:nrow(data)){
if(is.na(data$Term_code[i]) && !(is.na(data$Term_code1[i]))){
data$Term_code[i] <- data$Term_code1[i]
}
} # Took a very long time to run, but improves data a lot
for(i in 1:nrow(data)){
if(nchar(data$Term_code[i]) < (nchar(data$Term_code1[i]))){
data$Term_code[i] <- data$Term_code1[i]
}
} # Broke after hours due to memory constraint
How should I have written it?
While I can't provide the actual data for reproducibility, if the description of these 2 vectors is insufficient perhaps this can approximate it close enough for testing purposes:
a <- data.frame(matrix(nrow = 999999, ncol = 2))
a[, 1] <- runif(min = 201210, max = 201560, n = 999999)
a[,1] <- round(a[,1])
b <- sample(a[,1], 1000)
a[a[,1] %in% b, 1] <- NA
b <- sample(a[,1], 250)
a[a[,1] %in% b, 1] <- substr(b, 5, 6)
a[,2] <- runif(min = 201210, max = 201560, n = 999999)
b <- sample(a[,2], 150)
a[a[,2] %in% b, 2] <- NA
1 Answer 1
R is well-suited for vector operations. It's not well-suited for loops, which you should avoid as much as possible.
Consider this condition in your code:
if(is.na(data$Term_code[i]) && !(is.na(data$Term_code1[i]))){ data$Term_code[i] <- data$Term_code1[i] }
How about if we remove the second condition after the &&
:
if(is.na(data$Term_code[i])){ data$Term_code[i] <- data$Term_code1[i] }
The outcome won't change, right? If Term_code1
is not NA, then we'll get its value, if it's NA, well, we already had NA, so it makes no difference.
The good thing about this transformation is that now we can convert this directly to a vector operation without a loop:
no.term.code <- is.na(data$Term_code)
data$Term_code[no.term.code] <- data$Term_code1[no.term.code]
That is, set the values of Term_code
to the values of Term_code1
where Term_code
is NA.
-
1\$\begingroup\$ +1 for the vectorization, that's the way. Though your solution needs improvement to cover all the cases from the OP's code. I feel
ifelse
is a good tool to introduce here since we are clearly choosing between two vectors. I am thinkingdata$Term_code <- with(data, ifelse(is.na(Term_code) | nchar(Term_code) < nchar(Term_code1), Term_code1, Term_code)
. In plain english, that's <<whereTerm_code
isNA
or it has fewer characters thanTerm_code1
, useTerm_code1
, otherwise useTerm_code
>>. \$\endgroup\$flodel– flodel2015年06月12日 23:21:41 +00:00Commented Jun 12, 2015 at 23:21 -
\$\begingroup\$ Nice improvement! +1 \$\endgroup\$Hack-R– Hack-R2015年06月13日 00:36:59 +00:00Commented Jun 13, 2015 at 0:36
-
\$\begingroup\$ @flodel looks like you covered the harder part of the OP's code. You should make that an answer, and frankly, it should get accepted too, instead of mine. Do it man, fair and square. \$\endgroup\$RoboSanta– RoboSanta2015年06月13日 05:28:04 +00:00Commented Jun 13, 2015 at 5:28