I'm new to R and have created a dictionary function. A user will input a phrase into a text field and the server will read it and see if there's a match in a back-end data table and if so, output the abbreviation of the phrase.
This dictionary data table does not have a 1 to 1 mapping. The dictionary has logical_word
and abbreviation
columns; here's a small sample:
238 abstract ABST NA NA NA
239 abstraction ABSTN NA NA NA
383 aggregate consumer economic ACE NA NA NA
876 business-to-consumer B2C NA NA NA
1309 consumer CNSMR NA NA NA
1310 Consumer Credit Counseling Service CCCS NA NA NA
1311 Consumer Indebtedness Index CII NA NA NA
1312 Consumer Lending CNL NA NA NA
1313 consumer loan application CLAPP NA NA NA
1314 consumer operating as business CSOAB NA NA NA
1315 Consumer Price Index
5582 time zone TZN NA NA NA
6041 Wholesale Consumer Information System WCIS NA NA NA
6119 zone ZONE NA NA NA
6121 ZORRO ZR NA NA NA
Suppose a user inputs "Consumer Credit Counseling Service zorro zone". This program will take the input and switch it to upper case and same w/dictionary DT.
I split the user input on whitespace into a vector, then run a loop and grep for the first word (CONSUMER
) to see if there's a match in the dictionary. If there is a match, then I keep going and grep for 2 words (CONSUMER CREDIT
). I do the same for all words until there's no match returned from the grep or if the match returned is == 1; in that case I pull the abbreviation value using merge()
and the phrase (searchStringDT[i])
.
When looking for "CONSUMER CREDIT COUNSELING SERVICE ZORRO"
the search result comes back as empty, so then I run a merge()
on the last iterated phrase "CONSUMER CREDIT COUNSELING SERVICE"
and the dictionary and get back the abbreviation "CCCS"
.
Then I set a flag (reset <- true
) and run the loop against "ZORRO"
and see if there's a match; since there is, I keep going and grep "ZORRO ZONE"
again. There's no value returned and I run a merge()
on "ZORRO"
and the dictionary and get back "ZR"
.
Then I run the loop on "ZONE"
, and since it's the last value, I run the merge()
and get back "ZONE"
.
I have a couple of questions:
Can someone recommend different approaches to improve speed/accuracy and make my code cleaner? Right now, it seems like a mess IMO.
If not, can someone recommend a way to fix my current approach, since this doesn't completely work.
In the grep I added
"^"
to make sure that I only return values that start with the user input, but when I tested with"abstract"
the grep returned both"abstract"
and"abstraction"
since the patterns match. My logic is based on only 1 value being returned to pull the abbreviation using themerge()
.If I add
"$"
to the grep then my code breaks since"CONSUMER CREDIT"
doesn't have an exact match. Is there some other condition I can add to see if the dictionary contains a phrase? So if I search"CONSUMER CREDIT"
I should get back"CONSUMER CREDIT COUNSELING SERVICE"
and if I search"ABSTRACT"
I should only get"ABSTRACT"
and not"ABSTRACTION"
?
The code:
searchDictionary <- function(userInput=NULL,dictionary=NULL){
userInput <- data.table(logical_word=userInput)
listTest <- data.frame(matrix(ncol = 5, nrow = nrow(userInput)))
names(listTest) <- c("logical_word","abbreviation","V3","V4","V5")
searchString <- ""
reset <- "FALSE"
searchStringDT <- data.table(matrix(ncol = 1, nrow = nrow(userInput)))
names(searchStringDT) <- c("logical_word")
i <- 1
while(i <= nrow(userInput)){
if(i==1){
searchString <- userInput[i]
searchStringDT$logical_word[i] <- unlist(searchString)
}
else if(reset == "TRUE"){
searchString <- userInput[i]
searchStringDT$logical_word[i] <- unlist(searchString)
}
else{
searchString <- paste(searchString,userInput[i])
searchStringDT$logical_word[i] <- unlist(searchString)
}
reset <- "FALSE"
searchResult <- dictionary[grep( paste0("^",searchStringDT[i]), dictionary$logical_word), ]
if(nrow(searchResult) == 0){
if( grepl("\\s", searchStringDT[i]) ){
listTest[i-1,] <- merge(searchStringDT[i-1], dictionary, "logical_word", all.x = TRUE, sort = FALSE)
i <- i
}
else{
listTest[i,] <- searchStringDT[i]
i <- i+1
}
reset <- "TRUE"
}
else if(nrow(searchResult) == 1){
listTest[i,] <- merge(searchStringDT[i], dictionary, "logical_word", all.x = TRUE, sort = FALSE)
i <- i+1
}
else{
i <-i+1
}
}
print("out for loop listTest")
print(listTest)
}
```
1 Answer 1
R supports Perl-compatible regular expressions, and these support \b
, which matches at word boundaries. You can use this to avoid matching abstraction
:
grep(paste0("^", word, "\\b"), haystack, perl = TRUE)
Instead of the string literals "TRUE"
and "FALSE"
, you should use the logical literals for the reset
variable, just remove the quotes.
The spacing in your code is inconsistent. Have a look at https://style.tidyverse.org/ and either apply these rules manually to your code, or use an automatic formatter. RStudio certainly has one, and since a few days IntelliJ has an R plugin with a good formatter.
Instead of printing the result at the end of the function, you should rather just return it by leaving out the print
and the parentheses. This makes it easier to write unit tests for it. If you haven't done so already, have a look at the testthat
package.
The if
branches for i == 1
and for reset == TRUE
are the same. You should merge them by making the condition i == 1 || reset
(after you removed the quotes, as I suggested above).
-
\$\begingroup\$ thank you! Worked like a charm. But I still have a doubt: when searching for "consumer" the grep returns vals 1309-1315 but I would like to only return the exact match. If I add "$" then when searching for "consumer credit" I wouldn't get anything back. Basically what I'm trying to do is search the dictionary DT one word at a time and if there is only 1 match then retrieve that abbreviation, if not then add the next word to the existing and search for the new val. If this is out of scope, can you recommend documentation to look into? I looked into some but still struggling with regex \$\endgroup\$rNewb23– rNewb232019年10月30日 19:49:59 +00:00Commented Oct 30, 2019 at 19:49
-
\$\begingroup\$ As I suggested in my answer, you should write
testthat
tests. You already know how the code should behave in all the situations, therefore it makes sense to write down your expectations so that they can be checked automatically. Next, use Git or another version control system, so that you cannot lose any code that worked in the past. With that done, it's no longer risky to throw away your code, completely rewrite it or try out new ideas. \$\endgroup\$Roland Illig– Roland Illig2019年10月30日 19:55:38 +00:00Commented Oct 30, 2019 at 19:55