1
\$\begingroup\$

Consider the following sorting algorithm:

df <- data.frame(food_1 = c("APPLE 1534", "PEAR 2525", "BANANA 3045", "WATERMELON 5000"),
 food_2 = c("ORANGE 2035", "BROCCOLI 5000", "BLUEBERRY 2000", "TOMATO 3000"),
 stringsAsFactors = FALSE)
# Sorting
for (i in 1:nrow(df)){
 foods <- sort(c(df$food_1[i], df$food_2[i]))
 df$food_1[i] <- foods[1]
 df$food_2[i] <- foods[2]
}

I have data frames which are of size 250,000+ rows that I've used the code above for, and I'm not sure how to make this more efficient.

200_success
145k22 gold badges190 silver badges478 bronze badges
asked Oct 19, 2018 at 16:49
\$\endgroup\$
3
  • \$\begingroup\$ The other option I can think of would be transposing and sorting with an apply, which is just a loop. \$\endgroup\$ Commented Oct 19, 2018 at 18:48
  • 1
    \$\begingroup\$ Can you confirm that you are only looking to sort two columns as in your example, and that the answer won't need to generalize to more (>2) columns? \$\endgroup\$ Commented Oct 19, 2018 at 23:25
  • \$\begingroup\$ @flodel Eventually I will need to, but for now, let's just focus on the two-column case. If it generalizes to more than 2, that would be a bonus at this point. I will eventually need to extend to a 4-column case, but no more than that. \$\endgroup\$ Commented Oct 20, 2018 at 1:55

2 Answers 2

1
\$\begingroup\$

I would use the vectorized functions pmin and pmax to compute the two vectors of minimum and maximum values respectively:

f1 <- pmin(df$food_1, df$food_2)
f2 <- pmax(df$food_1, df$food_2)
df$food_1 <- f1
df$food_2 <- f2

If you want, you can do it all in one statement:

df[c('food_1', 'food_2')] <- list(pmin(df$food_1, df$food_2),
 pmax(df$food_1, df$food_2))

Another vectorized approach could use ifelse:

f1 <- ifelse(df$food_1 < df$food_2, df$food_1, df$food_2)
f2 <- ifelse(df$food_1 < df$food_2, df$food_2, df$food_1)
df$food_1 <- f1
df$food_2 <- f2

Testing on a large data.frame of 250k rows like you mentioned:

n <- 250000
df <- data.frame(food_1 = sample(c("APPLE 1534", "PEAR 2525",
 "BANANA 3045", "WATERMELON 5000"), n, replace = TRUE),
 food_2 = sample(c("ORANGE 2035", "BROCCOLI 5000",
 "BLUEBERRY 2000", "TOMATO 3000"), n, replace = TRUE),
 stringsAsFactors = FALSE)

both approaches are quite fast, e.g.:

system.time({
 df[c('food_1', 'food_2')] <- list(pmin(df$food_1, df$food_2),
 pmax(df$food_1, df$food_2))
})
# user system elapsed 
# 0.150 0.001 0.151 

while Andreas' solution takes ~10 seconds and yours would take over 30 minutes if I extrapolate correctly.

answered Oct 19, 2018 at 23:46
\$\endgroup\$
1
\$\begingroup\$

You solution does the job, but it is often better to vectorise your code. See the following example. First get the ordering of all elements in column a and b and then use this to rearrange the elements in the data.frame.

library(tictoc) #to get the run time
df <- data.frame(a = runif(10000),
 b = runif(10000))
# your solution
tic()
df.loop <- df
for (i in 1:nrow(df.loop)){
df.loop[i, ] <- sort(df.loop[i, ])
}
toc()
#sort (order) only once
tic()
index.a <- 1:nrow(df)
index.b <- (nrow(df) + 1) : (2*nrow(df))
a.b.ordered <- order(c(df[, 1], df[, 2]))
b.greater.a <- match(index.b, a.b.ordered) < match(index.a, a.b.ordered)
df.index <- df
df.index[b.greater.a, 1] <- df[b.greater.a, 2]
df.index[b.greater.a, 2] <- df[b.greater.a, 1]
toc()
identical(df.loop, df.index)
answered Oct 19, 2018 at 18:59
\$\endgroup\$
1
  • 3
    \$\begingroup\$ Please write something more about your solution. Posting code-only answers is off-topic. There need to be a review too. \$\endgroup\$ Commented Oct 19, 2018 at 19:08

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.