Synonymiser refactoring

Question 1

I am interested in seeing how people would refactor this code to make it more readable, remove local variables, etc. The code checks a string against a number of arrays. The first (0) field of each array is the root word and the rest of the words are its synonyms. The idea is to replace any word in the user-provided string with the root word or provide the user-supplied word if none were found.

import java.util.Arrays;
public class Synonymiser {
 private final static int ROOT_WORD = 0;
 private String[][] mSynonyms;
 public Synonymiser(String[]... synonyms) {
 super();
 this.mSynonyms = synonyms;
 }
 public String normaliseString(String inStr){
 String[] inArr = inStr.split(" ");
 boolean matchFound;
 for (int i = 0; i < inArr.length; i++){
 matchFound=false;
 for (String[] synonymArr : mSynonyms){
 for(int j = 0 ; j < synonymArr.length; j++){
 if (inArr[i].equalsIgnoreCase(synonymArr[j])){
 inArr[i] = synonymArr[ROOT_WORD];
 matchFound = true;
 }
 if (matchFound) break; 
 }
 if (matchFound) break; 
 }
 }
 return Arrays.toString(inArr);
 }
 public static void main(String [] args){
 String[] ar1 = new String[] {"one", "two", "three", "four", "five"};
 String[] ar2 = new String[] {"six", "seven", "eight"};
 Synonymiser s = new Synonymiser(ar1, ar2);
 System.out.println(s.normaliseString("one two seven bulb"));
 }
}

Question 2

I'd start with consistent indentation.

Question 3

sorry that's more to do with me making it display as code on this site, it's got standard eclipse indentation in my project

Question 4

Part of the problem as it seems is that you have mixed spaces and tabs for your indentation. Choose one or the other. Once you do that, you'll make a lot of people happy. :)

Question 5

The for (int j = 0; loop can be replaced with a for (String possibleSynonym : synonymArr)

Question 6

1, You could eliminate the matchFound flag if you create a label for your outer for loop and call break with this label:

matchLoop: for (final String[] synonymArr : mSynonyms) {
 for (int j = 0; j < synonymArr.length; j++) {
 if (inArr[i].equalsIgnoreCase(synonymArr[j])) {
 inArr[i] = synonymArr[ROOT_WORD];
 break matchLoop;
 }
 }
}

Anyway, I think using labels is more or less bad smell, so try to extract out the loops to separate methods.

2, I'd create at least a Word class which stores the (root)word and its synonyms. I should result more readable the code in the Synonymiser class too:

public class Word {
 private final String rootWord;
 private final Set<String> synomins;
 public Word(final String rootWord, final Collection<String> synomins) {
 super();
 // TODO: empty String/null check
 this.rootWord = rootWord;
 // TODO: convert the input to lowercase (it helps contains())
 this.synomins = new HashSet<String>(synomins);
 this.synomins.add(rootWord);
 }
 public static Word createWord(final String rootWord, final String... synonims) {
 final List<String> synonimList = Arrays.asList(synonims);
 return new Word(rootWord, synonimList);
 }
 public boolean contains(final String word) {
 // TODO: convert word to lowercase for proper comparison
 if (synomins.contains(word)) {
 return true;
 }
 return false;
 }
 public String getRootWord() {
 return rootWord;
 }
}

Question 7

I would like to go beyond just making the code prettier, and change the method that you are using. Instead of looping through the arrays looking for matches, you can put the synonyms in a hashed list. That will give you a performance close to O(1) for each lookup, instead of O(n).

Example in C#:

public class Synonymiser {
 private Dictionary<string, string> _synonyms = new Dictionary<string, string>();
 public Synonymiser AddWord(string word, params string[] synonyms) {
 foreach (string synonym in synonyms) {
 _synonyms.Add(synonym.ToUpper(), word);
 }
 return this;
 }
 public string NormaliseString(string inStr){
 string replacement;
 return String.Join(
 " ",
 inStr.Split(' ')
 .Select(w => _synonyms.TryGetValue(w.ToUpper(), out replacement) ? replacement : w)
 );
 }
}

Usage:

Synonymiser s =
 new Synonymiser()
 .AddWord("one", "two", "three", "four", "five")
 .AddWord("six", "seven", "eight");
Console.WriteLine(s.NormaliseString("one two seven bulb"));

Output:

one one six bulb

Question 8

I think you mean HashMap/Dictionary, not HashSet...

Question 9

@ChrisMarasti-Georg: Yes, you are right. A set would be only keys, so I changed it to read "a hashed list".

Question 10

This implementation will end up having poor performance for large dictionaries - you will have O(n) performance for n root words. I would recommend converting the synonym array into a HashMap, mapping a key word to its root word value. This will give you a slightly more expensive initialization (essentially, the cost that you pay for each call to normalizeString will be paid once, on construction). The memory footprint will be slightly larger as well - one extra pointer per synonym.

import java.util.Arrays;
import java.util.HashMap;
public class Synonymiser {
 private final static int ROOT_WORD = 0;
 private HashMap<String, String> mSynonyms;
 public Synonymiser(String[]... synonyms) {
 super();
 mSynonyms = new HashMap<String, String>(synonyms.length*5); //Enough initial storage for 5 synonyms per root
 for(String[] root: synonyms) {
 for(int i = 0; i<root.length; ++i) {
 if(i == ROOT_WORD) {
 continue;
 }
 mSynonyms.put(root[i], root[ROOT_WORD]);
 }
 }
 }
 public String normaliseString(String inStr){
 String[] inArr = inStr.split(" ");
 for (int i = 0; i < inArr.length; i++){
 String synonym = mSynonyms.get(inArr[i]);
 if(synonym != null) {
 inArr[i] = synonym;
 }
 }
 return Arrays.toString(inArr);
 }
 public static void main(String [] args){
 String[] ar1 = new String[] {"one", "two", "three", "four", "five"};
 String[] ar2 = new String[] {"six", "seven", "eight"};
 Synonymiser s = new Synonymiser(ar1, ar2);
 System.out.println(s.normaliseString("one two seven bulb"));
 }
}

Question 11

I'd start by extracting the body of the outermost loop. For any given candidate string, find the synonym.

public class Synonymiser {
 ...
 public String normaliseString(String inStr){
 String[] inArr = inStr.split(" ");
 for (int i = 0; i < inArr.length; i++) {
 inArr[i] = synonym(inArr[i]);
 }
 return Arrays.toString(inArr);
 }
 private String synonym(String candidate) {
 for (String[] array : mSynonyms) {
 for (String s : array) {
 if (candidate.equalsIgnoreCase(s) {
 return array[ROOT_WORD];
 }
 }
 }
 return candidate;
 }
 ...
}

Question 12

Here's my take; it's basically the same as the others. Notes and caveats, in no order:

I keep two maps, one of primary => synonyms, one of synonym => word.
Added error condition if a synonym is registered to two different primaries.
Ignore if a synonym is added to single primary, multiple times.
Does not ensure a synonym is never registered as a primary, or vice-versa--in reality there would need to be a decision regarding how to handle that situation.
Code unit-tested but not used outside of tests has been stripped.
Comments and parameter validation stripped.
I return a string (joined with Commons Lang), not an array.
One bug left in intentionally, for fun. Any others, probably not intentional.

public class Synonymizer {
 private Map<String, Set<String>> wordSynonymsMap;
 private Map<String, String> synonymWordMap;
 public Synonymizer() {
 wordSynonymsMap = new HashMap<String, Set<String>>();
 synonymWordMap = new HashMap<String, String>();
 }
 public void addSynonyms(String word, String... synonyms) {
 Set<String> wordSynonyms = getSynonyms(word);
 for (String synonym : synonyms) {
 wordSynonyms.add(synonym);
 if (synonymWordMap.containsKey(synonym) && !synonymWordMap.get(synonym).equals(word)) {
 throw new RuntimeException(String.format("'%s' already used as synonym for '%s'", synonym, synonymWordMap.get(synonym)));
 }
 synonymWordMap.put(synonym, word);
 }
 }
 public Set<String> getSynonyms(String word) {
 if (!wordSynonymsMap.containsKey(word)) {
 wordSynonymsMap.put(word, new HashSet<String>());
 }
 return wordSynonymsMap.get(word);
 }
 public String normalizeString(String s) {
 List<String> l = new ArrayList<String>();
 for (String word : s.split(" ")) {
 l.add(synonymize(word));
 }
 return StringUtils.join(l, " ");
 }
 public String synonymize(String word) {
 if (wordSynonymsMap.containsKey(word)) {
 return word;
 }
 String primary = synonymWordMap.get(word);
 if (primary != null) {
 return primary;
 }
 return word;
 }
}

palacsint palacsint 30.3k9 gold badges82 silver badges157 bronze badges · Accepted Answer · 2011-12-21 09:20:43Z

1, You could eliminate the matchFound flag if you create a label for your outer for loop and call break with this label:

matchLoop: for (final String[] synonymArr : mSynonyms) {
 for (int j = 0; j < synonymArr.length; j++) {
 if (inArr[i].equalsIgnoreCase(synonymArr[j])) {
 inArr[i] = synonymArr[ROOT_WORD];
 break matchLoop;
 }
 }
}

Anyway, I think using labels is more or less bad smell, so try to extract out the loops to separate methods.

2, I'd create at least a Word class which stores the (root)word and its synonyms. I should result more readable the code in the Synonymiser class too:

public class Word {
 private final String rootWord;
 private final Set<String> synomins;
 public Word(final String rootWord, final Collection<String> synomins) {
 super();
 // TODO: empty String/null check
 this.rootWord = rootWord;
 // TODO: convert the input to lowercase (it helps contains())
 this.synomins = new HashSet<String>(synomins);
 this.synomins.add(rootWord);
 }
 public static Word createWord(final String rootWord, final String... synonims) {
 final List<String> synonimList = Arrays.asList(synonims);
 return new Word(rootWord, synonimList);
 }
 public boolean contains(final String word) {
 // TODO: convert word to lowercase for proper comparison
 if (synomins.contains(word)) {
 return true;
 }
 return false;
 }
 public String getRootWord() {
 return rootWord;
 }
}

Stack Exchange Network

Synonymiser refactoring

5 Answers 5

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Synonymiser refactoring

5 Answers 5

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions