0 down vote favorite
I have two data frame loc_df and and city_df (city and country) now loc_df has 5 column but considering only 2 here (Organization.Location.1 and Organization.Location.2) with 35000 row and city_df has 2 column (city and country) with 1000 rows. Now I am taking one value from city cloumn and matching with organisation column using grepl (for text matching ) and for loop(for iteration). I also have to maintain a index that's why I am using for loop. But this is taking huge amount of time.
I am trying to replace each city, state, province name to their country name in organization columns.
Please help me to optimize this code. I am very new to R.
for(k in 1:2){
if(k==1){
for (i in 1:nrow(city_df)) {
x1 <- paste(" ", city_df$City[i], sep = "")
x2 <- paste(" ", city_df$City[i], " ", sep = "")
x3 <- paste(city_df$City[i], " ", sep = "")
# print(x1)
for (j in 1:nrow(loc_df)) {
#print(loc_df$Organization.Location.1[j])
if (grepl(x1, loc_df$Organization.Location.1[j]) |
grepl(x2, loc_df$Organization.Location.1[j]) |
grepl(x3, loc_df$Organization.Location.1[j])) {
loc_df$org_new1[j] <- city_df$Country[i]
break
}
}
}
}
if(k==2){
for (i in 1:nrow(city_df)) {
x1 <- paste(" ", city_df$City[i], sep = "")
x2 <- paste(" ", city_df$City[i], " ", sep = "")
x3 <- paste(city_df$City[i], " ", sep = "")
for (j in 1:nrow(loc_df)) {
if (grepl(x1, loc_df$Organization.Location.2[j]) |
grepl(x2, loc_df$Organization.Location.2[j]) |
grepl(x3, loc_df$Organization.Location.3[j])) {
loc_df$org_new1[j] <- city_df$Country[i]
break
}
}
}
}
}
this is sample data I have generated using dput of city_df
structure(list(City = c("zug", "canton of zug", "zimbabwe",
"zigong chengdu", "zhuhai guangdong china", "zaragoza spain"), Country = c("switzerland",
"switzerland", "zimbabwe", "china", "china", "spain"
)), .Names = c("City", "Country"), row.names = c(NA, 6L), class = "data.frame")
sample of loc_df
structure(list(Organization.Location.1 = c("zug switzerland",
"zug canton of zug switzerland", "zimbabwe", "zigong chengdu pr china",
"zhuhai guangdong china", "zaragoza spain"), Organization.Location.2 = c("",
"san francisco bay area", "london canada area", "beijing city china",
"greater atlanta area", "paris area france")), .Names = c("Organization.Location.1",
"Organization.Location.2"), row.names = c(NA, 6L), class = "data.frame")
-
\$\begingroup\$ you have not supplied case in your data when there is a match \$\endgroup\$minem– minem2018年03月22日 12:15:45 +00:00Commented Mar 22, 2018 at 12:15
-
\$\begingroup\$ @minem loc_df$org_new1[j] <- city_df$Country[i] this line of code is supplying data when there is a match, And it present in above code too \$\endgroup\$Girijesh Singh– Girijesh Singh2018年03月22日 12:32:37 +00:00Commented Mar 22, 2018 at 12:32
-
\$\begingroup\$ You have not supplied in your example data a valid example when the conditiona are met \$\endgroup\$minem– minem2018年03月22日 12:34:04 +00:00Commented Mar 22, 2018 at 12:34
-
\$\begingroup\$ @minem sorry sir, I have updated the question now. \$\endgroup\$Girijesh Singh– Girijesh Singh2018年03月22日 12:41:44 +00:00Commented Mar 22, 2018 at 12:41
1 Answer 1
You can try something like this:
# function for string preperation:
preperString <- function(x) {
require(stringr)
x <- str_to_lower(x)
x <- str_trim(x)
x
}
setDT(loc_df) # convert data.frames to data.table
setDT(city_df)
loc_df <- loc_df[, lapply(.SD, preperString)] # apply string preperation to all columns of loc_df
city_df[, City := preperString(City)]
loc_df <- merge(loc_df, city_df, by.x = 'Organization.Location.1',
by.y = 'City', all.x = T, sort = F)
loc_df <- merge(loc_df, city_df, by.x = 'Organization.Location.2',
by.y = 'City', all.x = T, sort = F)
loc_df
# Organization.Location.2 Organization.Location.1 Country.x Country.y
# 1: zug switzerland NA NA
# 2: san francisco bay area zug canton of zug switzerland NA NA
# 3: london canada area zimbabwe zimbabwe NA
# 4: beijing city china zigong chengdu pr china NA NA
# 5: greater atlanta area zhuhai guangdong china china NA
# 6: paris area france zaragoza spain spain NA
# and then you can write rule tu create org_new1, for example:
loc_df[, org_new1 := Country.x]
loc_df[is.na(org_new1), org_new1 := Country.y]
loc_df
# Organization.Location.2 Organization.Location.1 Country.x Country.y org_new1
# 1: zug switzerland NA NA NA
# 2: san francisco bay area zug canton of zug switzerland NA NA NA
# 3: london canada area zimbabwe zimbabwe NA zimbabwe
# 4: beijing city china zigong chengdu pr china NA NA NA
# 5: greater atlanta area zhuhai guangdong china china NA china
# 6: paris area france zaragoza spain spain NA spain
-
\$\begingroup\$ thank you for your answer but when I am trying to run your code I am getting output like this Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in 526312 rows; more than 47285 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. \$\endgroup\$Girijesh Singh– Girijesh Singh2018年03月22日 13:28:57 +00:00Commented Mar 22, 2018 at 13:28
-
\$\begingroup\$ @GirijeshSingh As I do not see your data it is hard to help you, but the error message could be a start. There is stated: ''Check for duplicate key values'', maybe try to do that and remove the duplicates? \$\endgroup\$minem– minem2018年03月22日 13:32:00 +00:00Commented Mar 22, 2018 at 13:32
-
\$\begingroup\$ here is the sample of data github.com/girijesh18/dataset \$\endgroup\$Girijesh Singh– Girijesh Singh2018年03月22日 13:48:09 +00:00Commented Mar 22, 2018 at 13:48
-
\$\begingroup\$ please help me to figure it out \$\endgroup\$Girijesh Singh– Girijesh Singh2018年03月22日 13:48:42 +00:00Commented Mar 22, 2018 at 13:48
-
\$\begingroup\$ city_df <- city_df[City != '']; city_df <- unique(city_df) \$\endgroup\$minem– minem2018年03月22日 13:58:53 +00:00Commented Mar 22, 2018 at 13:58