Code for finding repeated entries with different data

Question 1

project is the data frame. For the purpose of the code, HOUSE.NO is a column of the type character, and NO..OF.FAMILY.MEMBER is another column of the type integer. My aim was to find out the house numbers which repeated, and then find out if the no of family members reported for each of the house matched, and identify the sets for which it didn't.

x<-1
matr<-NULL
matr2<-NULL
matr3<-NULL
r<-NULL
index<-NULL
repeat{
 y<-project$HOUSE.NO[-x]==project$HOUSE.NO[x]
 if (any(y)){
 r<-which(grepl(project$HOUSE.NO[x],project$HOUSE.NO))
 if(length(r)==2){
 check<-project$NO..OF.FAMILY.MEMBER[r[1]]!=project$NO..OF.FAMILY.MEMBER[r[2]]
 if(check){matr<-c(matr,r)}
 }
 if (length(r)==3){
 check2<-length(levels(factor(project$NO..OF.FAMILY.MEMBER[c(r[1],r[2],r[3])])))>1
 if(check2){matr2<-c(matr2,r)}
 }
 if (length(r)==4){
 check3<-length(levels(factor(project$NO..OF.FAMILY.MEMBER[c(r[1],r[2],r[3],r[4])])))>1
 if(check3){
 matr3<-c(matr3,r)}}
 if (length(r)>4&project$HOUSE.NO[x]!=""){index<-c(index,r)
 }
 }
 x<-x+1
 if(x>392){
 m1<-matrix(matr, ncol=2, byrow = TRUE)
 m2<-matrix(matr2, ncol=3, byrow = TRUE)
 m3<-matrix(matr3, ncol=4, byrow=TRUE)
 break
 }
}

The extra argument while computing index is to avoid a false entry when the HOUSE.NO is "", which is true in my data frame for 3 entries. There are 393 entries, hence the final caveat before break.

The concerns are:

I am an absolute beginner in R, and the functions used here are almost all I know.
This code only finds if in the case of the same number repeated more than twice, only if the entire set has the same family members. I couldn't find the row indices of only the cases which mismatched. Currently, the output includes the entire set.
Do let me know tips on how to make this simpler. As it stands, I found this code to be quite a bit complicated.

(let me know if more details specific to the data frame/variables I am working with are needed. Or if the question is not suited to the site)

ADDENDUM

 HOUSE.NO NO..OF.FAMILY.MEMBER
1 14/274 6
2 14/259 6
3 14/217 5
4 14/258 4
5 14/306 5
6 14/300 8
7 14/96 4
8 14/166 4
9 14/69 5
10 14/68 2

And the expected output is just the row numbers/house.no. which fulfill the aforementioned criteria. Currently, the matrix outputs are as below. The same set is repeated in the matrix again (twice in m1, thrice in m2..etc).

 m1
 [,1] [,2]
 [1,] 20 380
 [2,] 36 68
 [3,] 37 340
 [4,] 64 191
 [5,] 36 68
 [6,] 72 329
 [7,] 88 218
 [8,] 103 199
 [9,] 111 278
[10,] 125 214
[11,] 135 387
[12,] 149 196
[13,] 64 191
[14,] 149 196
[15,] 103 199
[16,] 125 214
[17,] 215 320
[18,] 88 218
[19,] 248 317
[20,] 111 278
[21,] 310 350
[22,] 248 317
[23,] 319 324
[24,] 215 320
[25,] 319 324
[26,] 72 329
[27,] 37 340
[28,] 310 350
[29,] 20 380
[30,] 135 387
> m2
 [,1] [,2] [,3]
 [1,] 43 258 354
 [2,] 65 219 269
 [3,] 169 322 323
 [4,] 65 219 269
 [5,] 43 258 354
 [6,] 65 219 269
 [7,] 169 322 323
 [8,] 169 322 323
 [9,] 43 258 354
> m3
 [,1] [,2] [,3] [,4]
 [1,] 2 84 211 347
 [2,] 2 84 211 347
 [3,] 99 100 101 363
 [4,] 99 100 101 363
 [5,] 99 100 101 363
 [6,] 180 185 260 263
 [7,] 180 185 260 263
 [8,] 2 84 211 347
 [9,] 180 185 260 263
[10,] 180 185 260 263
[11,] 2 84 211 347
[12,] 99 100 101 363

Question 2

can you give a sample data section together with the expected output?

Question 3

@ZahiroMor Do you want a dput of the concerned variables? I don't think it is relevant. All I want the code to do was to carry out the following objective: Find out the house.no that repeated but with different no. of.family.memebers.

Question 4

yes... please dput... it'll be faster than english communication :)

Question 5

@ZahiroMor Added.

Question 6

Your code uses several constructs that have a bit of smell in R.

Foremost is how you write the loop. An immediate replacement would be to replace the repeat with a for(i in seq_len(nrow(project))) (especially the hard coded 394 reeks)

Also you don't need to initialize the variables you will not use outside the loop - that will just prevent them from getting cleaned up after the loop.

A more R like way would be to use some higher level verbs that operate on the whole table. Such as provided by dplyr. Supposing you have something like an id in the lines you would write something like

left_join(project, project, on=c('HOUSE.NO')) %>% filter(id.1<id.2)

such commands are usually easier to read and usually much faster than looping.

bdecaf bdecaf 4122 silver badges9 bronze badges · Answer 1 · 2016-04-24 11:12:32Z

Your code uses several constructs that have a bit of smell in R.

Foremost is how you write the loop. An immediate replacement would be to replace the repeat with a for(i in seq_len(nrow(project))) (especially the hard coded 394 reeks)

Also you don't need to initialize the variables you will not use outside the loop - that will just prevent them from getting cleaned up after the loop.

A more R like way would be to use some higher level verbs that operate on the whole table. Such as provided by dplyr. Supposing you have something like an id in the lines you would write something like

left_join(project, project, on=c('HOUSE.NO')) %>% filter(id.1<id.2)

such commands are usually easier to read and usually much faster than looping.

Stack Exchange Network

Code for finding repeated entries with different data

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Code for finding repeated entries with different data

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions