1
\$\begingroup\$

0 down vote favorite

I have two data frame loc_df and and city_df (city and country) now loc_df has 5 column but considering only 2 here (Organization.Location.1 and Organization.Location.2) with 35000 row and city_df has 2 column (city and country) with 1000 rows. Now I am taking one value from city cloumn and matching with organisation column using grepl (for text matching ) and for loop(for iteration). I also have to maintain a index that's why I am using for loop. But this is taking huge amount of time.

I am trying to replace each city, state, province name to their country name in organization columns.

Please help me to optimize this code. I am very new to R.

for(k in 1:2){
 if(k==1){
 for (i in 1:nrow(city_df)) {
 x1 <- paste(" ", city_df$City[i], sep = "")
 x2 <- paste(" ", city_df$City[i], " ", sep = "")
 x3 <- paste(city_df$City[i], " ", sep = "")
 # print(x1)
 for (j in 1:nrow(loc_df)) {
 #print(loc_df$Organization.Location.1[j])
 if (grepl(x1, loc_df$Organization.Location.1[j]) |
 grepl(x2, loc_df$Organization.Location.1[j]) |
 grepl(x3, loc_df$Organization.Location.1[j])) {
 loc_df$org_new1[j] <- city_df$Country[i]
 break
 }
 }
 }
 }
 if(k==2){
 for (i in 1:nrow(city_df)) {
 x1 <- paste(" ", city_df$City[i], sep = "")
 x2 <- paste(" ", city_df$City[i], " ", sep = "")
 x3 <- paste(city_df$City[i], " ", sep = "")
 for (j in 1:nrow(loc_df)) {
 if (grepl(x1, loc_df$Organization.Location.2[j]) |
 grepl(x2, loc_df$Organization.Location.2[j]) |
 grepl(x3, loc_df$Organization.Location.3[j])) {
 loc_df$org_new1[j] <- city_df$Country[i]
 break
 }
 }
 }
 }
}

this is sample data I have generated using dput of city_df

structure(list(City = c("zug", "canton of zug", "zimbabwe", 
 "zigong chengdu", "zhuhai guangdong china", "zaragoza spain"), Country = c("switzerland", 
 "switzerland", "zimbabwe", "china", "china", "spain"
 )), .Names = c("City", "Country"), row.names = c(NA, 6L), class = "data.frame")

sample of loc_df

structure(list(Organization.Location.1 = c("zug switzerland", 
"zug canton of zug switzerland", "zimbabwe", "zigong chengdu pr china", 
"zhuhai guangdong china", "zaragoza spain"), Organization.Location.2 = c("", 
"san francisco bay area", "london canada area", "beijing city china", 
"greater atlanta area", "paris area france")), .Names = c("Organization.Location.1", 
"Organization.Location.2"), row.names = c(NA, 6L), class = "data.frame")
asked Mar 22, 2018 at 11:58
\$\endgroup\$
4
  • \$\begingroup\$ you have not supplied case in your data when there is a match \$\endgroup\$ Commented Mar 22, 2018 at 12:15
  • \$\begingroup\$ @minem loc_df$org_new1[j] <- city_df$Country[i] this line of code is supplying data when there is a match, And it present in above code too \$\endgroup\$ Commented Mar 22, 2018 at 12:32
  • \$\begingroup\$ You have not supplied in your example data a valid example when the conditiona are met \$\endgroup\$ Commented Mar 22, 2018 at 12:34
  • \$\begingroup\$ @minem sorry sir, I have updated the question now. \$\endgroup\$ Commented Mar 22, 2018 at 12:41

1 Answer 1

1
\$\begingroup\$

You can try something like this:

# function for string preperation:
preperString <- function(x) {
 require(stringr)
 x <- str_to_lower(x)
 x <- str_trim(x)
 x
}
setDT(loc_df) # convert data.frames to data.table
setDT(city_df)
loc_df <- loc_df[, lapply(.SD, preperString)] # apply string preperation to all columns of loc_df
city_df[, City := preperString(City)]
loc_df <- merge(loc_df, city_df, by.x = 'Organization.Location.1',
 by.y = 'City', all.x = T, sort = F)
loc_df <- merge(loc_df, city_df, by.x = 'Organization.Location.2',
 by.y = 'City', all.x = T, sort = F)
loc_df
# Organization.Location.2 Organization.Location.1 Country.x Country.y
# 1: zug switzerland NA NA
# 2: san francisco bay area zug canton of zug switzerland NA NA
# 3: london canada area zimbabwe zimbabwe NA
# 4: beijing city china zigong chengdu pr china NA NA
# 5: greater atlanta area zhuhai guangdong china china NA
# 6: paris area france zaragoza spain spain NA
# and then you can write rule tu create org_new1, for example:
loc_df[, org_new1 := Country.x]
loc_df[is.na(org_new1), org_new1 := Country.y]
loc_df
# Organization.Location.2 Organization.Location.1 Country.x Country.y org_new1
# 1: zug switzerland NA NA NA
# 2: san francisco bay area zug canton of zug switzerland NA NA NA
# 3: london canada area zimbabwe zimbabwe NA zimbabwe
# 4: beijing city china zigong chengdu pr china NA NA NA
# 5: greater atlanta area zhuhai guangdong china china NA china
# 6: paris area france zaragoza spain spain NA spain
answered Mar 22, 2018 at 12:44
\$\endgroup\$
8
  • \$\begingroup\$ thank you for your answer but when I am trying to run your code I am getting output like this Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in 526312 rows; more than 47285 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. \$\endgroup\$ Commented Mar 22, 2018 at 13:28
  • \$\begingroup\$ @GirijeshSingh As I do not see your data it is hard to help you, but the error message could be a start. There is stated: ''Check for duplicate key values'', maybe try to do that and remove the duplicates? \$\endgroup\$ Commented Mar 22, 2018 at 13:32
  • \$\begingroup\$ here is the sample of data github.com/girijesh18/dataset \$\endgroup\$ Commented Mar 22, 2018 at 13:48
  • \$\begingroup\$ please help me to figure it out \$\endgroup\$ Commented Mar 22, 2018 at 13:48
  • \$\begingroup\$ city_df <- city_df[City != '']; city_df <- unique(city_df) \$\endgroup\$ Commented Mar 22, 2018 at 13:58

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.