Scaling up the intelligent concatenation of 2 columns in R

Question 1

I've got a huge dataset in R which contains (among other things) 2 columns indicating "Term Code", which are "Term Code" and "Term Code1".

"Term Code" is the mostly-correct column, which usually contains a 6-digit code for year and academic term:

> head(data$Term_code)
[1] 201230 201230 201230 201230 201230 201230

however, over the years that this field was populated, on occassion this field was either left blank or populated with only the last 2 digits:

> head(data$Term_code[nchar(data$Term_code) < 6], 100)
 [1] NA NA NA 70 NA NA 10 NA NA 30 NA 30 NA 40 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
 [35] NA 10 NA 30 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
 [69] NA NA NA 50 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

A SAS analyst did his best to fill construct another field, "Term_code1", with the preferred 6-digit version of the code where it's missing or truncated in "Term_code".

Now I need to combine these 2 columns into a single, maximally-complete column vector within the dataframe.

I did this using a for loop, which works on the small scale, but is slow at best with Big Data and at worse breaks by hitting the max memory size (even when that value is maxed out).

This is how I wrote it:

 for(i in 1:nrow(data)){
 if(is.na(data$Term_code[i]) && !(is.na(data$Term_code1[i]))){
 data$Term_code[i] <- data$Term_code1[i]
 }
 } # Took a very long time to run, but improves data a lot
 for(i in 1:nrow(data)){
 if(nchar(data$Term_code[i]) < (nchar(data$Term_code1[i]))){
 data$Term_code[i] <- data$Term_code1[i]
 }
 } # Broke after hours due to memory constraint

How should I have written it?

While I can't provide the actual data for reproducibility, if the description of these 2 vectors is insufficient perhaps this can approximate it close enough for testing purposes:

a <- data.frame(matrix(nrow = 999999, ncol = 2))
a[, 1] <- runif(min = 201210, max = 201560, n = 999999)
a[,1] <- round(a[,1])
b <- sample(a[,1], 1000)
a[a[,1] %in% b, 1] <- NA
b <- sample(a[,1], 250)
a[a[,1] %in% b, 1] <- substr(b, 5, 6)
a[,2] <- runif(min = 201210, max = 201560, n = 999999)
b <- sample(a[,2], 150)
a[a[,2] %in% b, 2] <- NA

Question 2

R is well-suited for vector operations. It's not well-suited for loops, which you should avoid as much as possible.

Consider this condition in your code:

if(is.na(data$Term_code[i]) && !(is.na(data$Term_code1[i]))){
 data$Term_code[i] <- data$Term_code1[i]
}

How about if we remove the second condition after the &&:

if(is.na(data$Term_code[i])){
 data$Term_code[i] <- data$Term_code1[i]
}

The outcome won't change, right? If Term_code1 is not NA, then we'll get its value, if it's NA, well, we already had NA, so it makes no difference.

The good thing about this transformation is that now we can convert this directly to a vector operation without a loop:

no.term.code <- is.na(data$Term_code)
data$Term_code[no.term.code] <- data$Term_code1[no.term.code]

That is, set the values of Term_code to the values of Term_code1 where Term_code is NA.

Question 3

+1 for the vectorization, that's the way. Though your solution needs improvement to cover all the cases from the OP's code. I feel ifelse is a good tool to introduce here since we are clearly choosing between two vectors. I am thinking data$Term_code <- with(data, ifelse(is.na(Term_code) | nchar(Term_code) < nchar(Term_code1), Term_code1, Term_code). In plain english, that's <<where Term_code is NA or it has fewer characters than Term_code1, use Term_code1, otherwise use Term_code>>.

Question 4

Nice improvement! +1

Question 5

@flodel looks like you covered the harder part of the OP's code. You should make that an answer, and frankly, it should get accepted too, instead of mine. Do it man, fair and square.

RoboSanta RoboSanta 3631 silver badge8 bronze badges · Accepted Answer · 2015-06-12 20:43:43Z

R is well-suited for vector operations. It's not well-suited for loops, which you should avoid as much as possible.

Consider this condition in your code:

if(is.na(data$Term_code[i]) && !(is.na(data$Term_code1[i]))){
 data$Term_code[i] <- data$Term_code1[i]
}

How about if we remove the second condition after the &&:

if(is.na(data$Term_code[i])){
 data$Term_code[i] <- data$Term_code1[i]
}

The outcome won't change, right? If Term_code1 is not NA, then we'll get its value, if it's NA, well, we already had NA, so it makes no difference.

The good thing about this transformation is that now we can convert this directly to a vector operation without a loop:

no.term.code <- is.na(data$Term_code)
data$Term_code[no.term.code] <- data$Term_code1[no.term.code]

That is, set the values of Term_code to the values of Term_code1 where Term_code is NA.

+1 for the vectorization, that's the way. Though your solution needs improvement to cover all the cases from the OP's code. I feel ifelse is a good tool to introduce here since we are clearly choosing between two vectors. I am thinking data$Term_code <- with(data, ifelse(is.na(Term_code) | nchar(Term_code) < nchar(Term_code1), Term_code1, Term_code). In plain english, that's <<where Term_code is NA or it has fewer characters than Term_code1, use Term_code1, otherwise use Term_code>>.
@flodel looks like you covered the harder part of the OP's code. You should make that an answer, and frankly, it should get accepted too, instead of mine. Do it man, fair and square.

Stack Exchange Network

Scaling up the intelligent concatenation of 2 columns in R

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Scaling up the intelligent concatenation of 2 columns in R

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions