Consider the following sorting algorithm:
df <- data.frame(food_1 = c("APPLE 1534", "PEAR 2525", "BANANA 3045", "WATERMELON 5000"),
food_2 = c("ORANGE 2035", "BROCCOLI 5000", "BLUEBERRY 2000", "TOMATO 3000"),
stringsAsFactors = FALSE)
# Sorting
for (i in 1:nrow(df)){
foods <- sort(c(df$food_1[i], df$food_2[i]))
df$food_1[i] <- foods[1]
df$food_2[i] <- foods[2]
}
I have data frames which are of size 250,000+ rows that I've used the code above for, and I'm not sure how to make this more efficient.
-
\$\begingroup\$ The other option I can think of would be transposing and sorting with an apply, which is just a loop. \$\endgroup\$Anonymous coward– Anonymous coward2018年10月19日 18:48:14 +00:00Commented Oct 19, 2018 at 18:48
-
1\$\begingroup\$ Can you confirm that you are only looking to sort two columns as in your example, and that the answer won't need to generalize to more (>2) columns? \$\endgroup\$flodel– flodel2018年10月19日 23:25:45 +00:00Commented Oct 19, 2018 at 23:25
-
\$\begingroup\$ @flodel Eventually I will need to, but for now, let's just focus on the two-column case. If it generalizes to more than 2, that would be a bonus at this point. I will eventually need to extend to a 4-column case, but no more than that. \$\endgroup\$Clarinetist– Clarinetist2018年10月20日 01:55:56 +00:00Commented Oct 20, 2018 at 1:55
2 Answers 2
I would use the vectorized functions pmin
and pmax
to compute the two vectors of minimum and maximum values respectively:
f1 <- pmin(df$food_1, df$food_2)
f2 <- pmax(df$food_1, df$food_2)
df$food_1 <- f1
df$food_2 <- f2
If you want, you can do it all in one statement:
df[c('food_1', 'food_2')] <- list(pmin(df$food_1, df$food_2),
pmax(df$food_1, df$food_2))
Another vectorized approach could use ifelse
:
f1 <- ifelse(df$food_1 < df$food_2, df$food_1, df$food_2)
f2 <- ifelse(df$food_1 < df$food_2, df$food_2, df$food_1)
df$food_1 <- f1
df$food_2 <- f2
Testing on a large data.frame of 250k rows like you mentioned:
n <- 250000
df <- data.frame(food_1 = sample(c("APPLE 1534", "PEAR 2525",
"BANANA 3045", "WATERMELON 5000"), n, replace = TRUE),
food_2 = sample(c("ORANGE 2035", "BROCCOLI 5000",
"BLUEBERRY 2000", "TOMATO 3000"), n, replace = TRUE),
stringsAsFactors = FALSE)
both approaches are quite fast, e.g.:
system.time({
df[c('food_1', 'food_2')] <- list(pmin(df$food_1, df$food_2),
pmax(df$food_1, df$food_2))
})
# user system elapsed
# 0.150 0.001 0.151
while Andreas' solution takes ~10 seconds and yours would take over 30 minutes if I extrapolate correctly.
You solution does the job, but it is often better to vectorise your code. See the following example. First get the ordering of all elements in column a and b and then use this to rearrange the elements in the data.frame.
library(tictoc) #to get the run time
df <- data.frame(a = runif(10000),
b = runif(10000))
# your solution
tic()
df.loop <- df
for (i in 1:nrow(df.loop)){
df.loop[i, ] <- sort(df.loop[i, ])
}
toc()
#sort (order) only once
tic()
index.a <- 1:nrow(df)
index.b <- (nrow(df) + 1) : (2*nrow(df))
a.b.ordered <- order(c(df[, 1], df[, 2]))
b.greater.a <- match(index.b, a.b.ordered) < match(index.a, a.b.ordered)
df.index <- df
df.index[b.greater.a, 1] <- df[b.greater.a, 2]
df.index[b.greater.a, 2] <- df[b.greater.a, 1]
toc()
identical(df.loop, df.index)
-
3\$\begingroup\$ Please write something more about your solution. Posting code-only answers is off-topic. There need to be a review too. \$\endgroup\$t3chb0t– t3chb0t2018年10月19日 19:08:12 +00:00Commented Oct 19, 2018 at 19:08