I have a dataframe with several columns. One of them is an user ID column, in this column, I have several ids that can be repeated several times.
What I want to do is remove the first ID, for instance:
1,2,3,4,3,4,2,1,3,4,6,7,7
I would like to have an output like this:
3,4,2,1,3,4,7
Where is what I have done:
#find first duplicated of the each user
dup <- duplicated(results$user)
#create other data frame, every time vector is TRUE add the row to new dataframe
results1 <- NULL
for(i in 1:length(results$user)){
if (dup[i] == TRUE) {
rbind(results1, results[i,]) -> results1
}
}
Since I'm more used to think in Python, I have a feeling this is a very ugly solution for R. I would like to have some feedback, as well as some pointers on how to improve this piece of code.
2 Answers 2
Here's a more efficient solution:
# an example data frame
results <- data.frame(user = c(1,2,3,4,3,4,2,1,3,4,6,7,7), a = 1)
# the solution
results[duplicated(results$user), ]
How it works: duplicated
returns a logical vector indicating whether a value was also present at a preceding position in the vector (for each value of results$user
).
This logical index is used to choose the appropriate lines of the orginal data frame. This is achieved by using this vector as the first argument for [
and using an empty second argument (to select all columns).
The result:
user a
5 3 1
6 4 1
7 2 1
8 1 1
9 3 1
10 4 1
13 7 1
-
\$\begingroup\$ you're right! It is better. With R I have some tendency to do more complicated stuff.. Thank you for your response \$\endgroup\$psoares– psoares2013年01月11日 11:24:42 +00:00Commented Jan 11, 2013 at 11:24
Well after reading some stuffs, I've come to the conclusion that I could eliminate several lines and do this instead:
rbind(results1, results[dup,]) -> results1
It is much quicker and seems more efficient.
However any suggestions or recommendations are welcome :)