I am interested in seeing how people would refactor this code to make it more readable, remove local variables, etc. The code checks a string against a number of arrays. The first (0) field of each array is the root word and the rest of the words are its synonyms. The idea is to replace any word in the user-provided string with the root word or provide the user-supplied word if none were found.
import java.util.Arrays;
public class Synonymiser {
private final static int ROOT_WORD = 0;
private String[][] mSynonyms;
public Synonymiser(String[]... synonyms) {
super();
this.mSynonyms = synonyms;
}
public String normaliseString(String inStr){
String[] inArr = inStr.split(" ");
boolean matchFound;
for (int i = 0; i < inArr.length; i++){
matchFound=false;
for (String[] synonymArr : mSynonyms){
for(int j = 0 ; j < synonymArr.length; j++){
if (inArr[i].equalsIgnoreCase(synonymArr[j])){
inArr[i] = synonymArr[ROOT_WORD];
matchFound = true;
}
if (matchFound) break;
}
if (matchFound) break;
}
}
return Arrays.toString(inArr);
}
public static void main(String [] args){
String[] ar1 = new String[] {"one", "two", "three", "four", "five"};
String[] ar2 = new String[] {"six", "seven", "eight"};
Synonymiser s = new Synonymiser(ar1, ar2);
System.out.println(s.normaliseString("one two seven bulb"));
}
}
5 Answers 5
1, You could eliminate the matchFound
flag if you create a label for your outer for
loop and call break
with this label:
matchLoop: for (final String[] synonymArr : mSynonyms) {
for (int j = 0; j < synonymArr.length; j++) {
if (inArr[i].equalsIgnoreCase(synonymArr[j])) {
inArr[i] = synonymArr[ROOT_WORD];
break matchLoop;
}
}
}
Anyway, I think using labels is more or less bad smell, so try to extract out the loops to separate methods.
2, I'd create at least a Word
class which stores the (root)word and its synonyms. I should result more readable the code in the Synonymiser
class too:
public class Word {
private final String rootWord;
private final Set<String> synomins;
public Word(final String rootWord, final Collection<String> synomins) {
super();
// TODO: empty String/null check
this.rootWord = rootWord;
// TODO: convert the input to lowercase (it helps contains())
this.synomins = new HashSet<String>(synomins);
this.synomins.add(rootWord);
}
public static Word createWord(final String rootWord, final String... synonims) {
final List<String> synonimList = Arrays.asList(synonims);
return new Word(rootWord, synonimList);
}
public boolean contains(final String word) {
// TODO: convert word to lowercase for proper comparison
if (synomins.contains(word)) {
return true;
}
return false;
}
public String getRootWord() {
return rootWord;
}
}
I would like to go beyond just making the code prettier, and change the method that you are using. Instead of looping through the arrays looking for matches, you can put the synonyms in a hashed list. That will give you a performance close to O(1) for each lookup, instead of O(n).
Example in C#:
public class Synonymiser {
private Dictionary<string, string> _synonyms = new Dictionary<string, string>();
public Synonymiser AddWord(string word, params string[] synonyms) {
foreach (string synonym in synonyms) {
_synonyms.Add(synonym.ToUpper(), word);
}
return this;
}
public string NormaliseString(string inStr){
string replacement;
return String.Join(
" ",
inStr.Split(' ')
.Select(w => _synonyms.TryGetValue(w.ToUpper(), out replacement) ? replacement : w)
);
}
}
Usage:
Synonymiser s =
new Synonymiser()
.AddWord("one", "two", "three", "four", "five")
.AddWord("six", "seven", "eight");
Console.WriteLine(s.NormaliseString("one two seven bulb"));
Output:
one one six bulb
-
\$\begingroup\$ I think you mean HashMap/Dictionary, not HashSet... \$\endgroup\$Chris Marasti-Georg– Chris Marasti-Georg2011年12月21日 17:45:37 +00:00Commented Dec 21, 2011 at 17:45
-
\$\begingroup\$ @ChrisMarasti-Georg: Yes, you are right. A set would be only keys, so I changed it to read "a hashed list". \$\endgroup\$Guffa– Guffa2011年12月21日 18:20:59 +00:00Commented Dec 21, 2011 at 18:20
This implementation will end up having poor performance for large dictionaries - you will have O(n) performance for n root words. I would recommend converting the synonym array into a HashMap, mapping a key word to its root word value. This will give you a slightly more expensive initialization (essentially, the cost that you pay for each call to normalizeString will be paid once, on construction). The memory footprint will be slightly larger as well - one extra pointer per synonym.
import java.util.Arrays;
import java.util.HashMap;
public class Synonymiser {
private final static int ROOT_WORD = 0;
private HashMap<String, String> mSynonyms;
public Synonymiser(String[]... synonyms) {
super();
mSynonyms = new HashMap<String, String>(synonyms.length*5); //Enough initial storage for 5 synonyms per root
for(String[] root: synonyms) {
for(int i = 0; i<root.length; ++i) {
if(i == ROOT_WORD) {
continue;
}
mSynonyms.put(root[i], root[ROOT_WORD]);
}
}
}
public String normaliseString(String inStr){
String[] inArr = inStr.split(" ");
for (int i = 0; i < inArr.length; i++){
String synonym = mSynonyms.get(inArr[i]);
if(synonym != null) {
inArr[i] = synonym;
}
}
return Arrays.toString(inArr);
}
public static void main(String [] args){
String[] ar1 = new String[] {"one", "two", "three", "four", "five"};
String[] ar2 = new String[] {"six", "seven", "eight"};
Synonymiser s = new Synonymiser(ar1, ar2);
System.out.println(s.normaliseString("one two seven bulb"));
}
}
I'd start by extracting the body of the outermost loop. For any given candidate string, find the synonym.
public class Synonymiser {
...
public String normaliseString(String inStr){
String[] inArr = inStr.split(" ");
for (int i = 0; i < inArr.length; i++) {
inArr[i] = synonym(inArr[i]);
}
return Arrays.toString(inArr);
}
private String synonym(String candidate) {
for (String[] array : mSynonyms) {
for (String s : array) {
if (candidate.equalsIgnoreCase(s) {
return array[ROOT_WORD];
}
}
}
return candidate;
}
...
}
Here's my take; it's basically the same as the others. Notes and caveats, in no order:
- I keep two maps, one of primary => synonyms, one of synonym => word.
- Added error condition if a synonym is registered to two different primaries.
- Ignore if a synonym is added to single primary, multiple times.
- Does not ensure a synonym is never registered as a primary, or vice-versa--in reality there would need to be a decision regarding how to handle that situation.
- Code unit-tested but not used outside of tests has been stripped.
- Comments and parameter validation stripped.
- I return a string (joined with Commons Lang), not an array.
- One bug left in intentionally, for fun. Any others, probably not intentional.
public class Synonymizer {
private Map<String, Set<String>> wordSynonymsMap;
private Map<String, String> synonymWordMap;
public Synonymizer() {
wordSynonymsMap = new HashMap<String, Set<String>>();
synonymWordMap = new HashMap<String, String>();
}
public void addSynonyms(String word, String... synonyms) {
Set<String> wordSynonyms = getSynonyms(word);
for (String synonym : synonyms) {
wordSynonyms.add(synonym);
if (synonymWordMap.containsKey(synonym) && !synonymWordMap.get(synonym).equals(word)) {
throw new RuntimeException(String.format("'%s' already used as synonym for '%s'", synonym, synonymWordMap.get(synonym)));
}
synonymWordMap.put(synonym, word);
}
}
public Set<String> getSynonyms(String word) {
if (!wordSynonymsMap.containsKey(word)) {
wordSynonymsMap.put(word, new HashSet<String>());
}
return wordSynonymsMap.get(word);
}
public String normalizeString(String s) {
List<String> l = new ArrayList<String>();
for (String word : s.split(" ")) {
l.add(synonymize(word));
}
return StringUtils.join(l, " ");
}
public String synonymize(String word) {
if (wordSynonymsMap.containsKey(word)) {
return word;
}
String primary = synonymWordMap.get(word);
if (primary != null) {
return primary;
}
return word;
}
}
for (int j = 0;
loop can be replaced with afor (String possibleSynonym : synonymArr)
\$\endgroup\$