Speed up sorting algorithm in R: make one column "smaller" than the other

Question 1

Consider the following sorting algorithm:

df <- data.frame(food_1 = c("APPLE 1534", "PEAR 2525", "BANANA 3045", "WATERMELON 5000"),
 food_2 = c("ORANGE 2035", "BROCCOLI 5000", "BLUEBERRY 2000", "TOMATO 3000"),
 stringsAsFactors = FALSE)
# Sorting
for (i in 1:nrow(df)){
 foods <- sort(c(df$food_1[i], df$food_2[i]))
 df$food_1[i] <- foods[1]
 df$food_2[i] <- foods[2]
}

I have data frames which are of size 250,000+ rows that I've used the code above for, and I'm not sure how to make this more efficient.

Question 2

The other option I can think of would be transposing and sorting with an apply, which is just a loop.

Question 3

Can you confirm that you are only looking to sort two columns as in your example, and that the answer won't need to generalize to more (>2) columns?

Question 4

@flodel Eventually I will need to, but for now, let's just focus on the two-column case. If it generalizes to more than 2, that would be a bonus at this point. I will eventually need to extend to a 4-column case, but no more than that.

Question 5

I would use the vectorized functions pmin and pmax to compute the two vectors of minimum and maximum values respectively:

f1 <- pmin(df$food_1, df$food_2)
f2 <- pmax(df$food_1, df$food_2)
df$food_1 <- f1
df$food_2 <- f2

If you want, you can do it all in one statement:

df[c('food_1', 'food_2')] <- list(pmin(df$food_1, df$food_2),
 pmax(df$food_1, df$food_2))

Another vectorized approach could use ifelse:

f1 <- ifelse(df$food_1 < df$food_2, df$food_1, df$food_2)
f2 <- ifelse(df$food_1 < df$food_2, df$food_2, df$food_1)
df$food_1 <- f1
df$food_2 <- f2

Testing on a large data.frame of 250k rows like you mentioned:

n <- 250000
df <- data.frame(food_1 = sample(c("APPLE 1534", "PEAR 2525",
 "BANANA 3045", "WATERMELON 5000"), n, replace = TRUE),
 food_2 = sample(c("ORANGE 2035", "BROCCOLI 5000",
 "BLUEBERRY 2000", "TOMATO 3000"), n, replace = TRUE),
 stringsAsFactors = FALSE)

both approaches are quite fast, e.g.:

system.time({
 df[c('food_1', 'food_2')] <- list(pmin(df$food_1, df$food_2),
 pmax(df$food_1, df$food_2))
})
# user system elapsed 
# 0.150 0.001 0.151

while Andreas' solution takes ~10 seconds and yours would take over 30 minutes if I extrapolate correctly.

Question 6

You solution does the job, but it is often better to vectorise your code. See the following example. First get the ordering of all elements in column a and b and then use this to rearrange the elements in the data.frame.

library(tictoc) #to get the run time
df <- data.frame(a = runif(10000),
 b = runif(10000))
# your solution
tic()
df.loop <- df
for (i in 1:nrow(df.loop)){
df.loop[i, ] <- sort(df.loop[i, ])
}
toc()
#sort (order) only once
tic()
index.a <- 1:nrow(df)
index.b <- (nrow(df) + 1) : (2*nrow(df))
a.b.ordered <- order(c(df[, 1], df[, 2]))
b.greater.a <- match(index.b, a.b.ordered) < match(index.a, a.b.ordered)
df.index <- df
df.index[b.greater.a, 1] <- df[b.greater.a, 2]
df.index[b.greater.a, 2] <- df[b.greater.a, 1]
toc()
identical(df.loop, df.index)

Question 7

Please write something more about your solution. Posting code-only answers is off-topic. There need to be a review too.

flodel flodel 3,5551 gold badge16 silver badges15 bronze badges · Accepted Answer · 2018-10-19 23:46:58Z

I would use the vectorized functions pmin and pmax to compute the two vectors of minimum and maximum values respectively:

f1 <- pmin(df$food_1, df$food_2)
f2 <- pmax(df$food_1, df$food_2)
df$food_1 <- f1
df$food_2 <- f2

If you want, you can do it all in one statement:

df[c('food_1', 'food_2')] <- list(pmin(df$food_1, df$food_2),
 pmax(df$food_1, df$food_2))

Another vectorized approach could use ifelse:

f1 <- ifelse(df$food_1 < df$food_2, df$food_1, df$food_2)
f2 <- ifelse(df$food_1 < df$food_2, df$food_2, df$food_1)
df$food_1 <- f1
df$food_2 <- f2

Testing on a large data.frame of 250k rows like you mentioned:

n <- 250000
df <- data.frame(food_1 = sample(c("APPLE 1534", "PEAR 2525",
 "BANANA 3045", "WATERMELON 5000"), n, replace = TRUE),
 food_2 = sample(c("ORANGE 2035", "BROCCOLI 5000",
 "BLUEBERRY 2000", "TOMATO 3000"), n, replace = TRUE),
 stringsAsFactors = FALSE)

both approaches are quite fast, e.g.:

system.time({
 df[c('food_1', 'food_2')] <- list(pmin(df$food_1, df$food_2),
 pmax(df$food_1, df$food_2))
})
# user system elapsed 
# 0.150 0.001 0.151

while Andreas' solution takes ~10 seconds and yours would take over 30 minutes if I extrapolate correctly.

Stack Exchange Network

Speed up sorting algorithm in R: make one column "smaller" than the other

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Speed up sorting algorithm in R: make one column "smaller" than the other

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions