I want to improve this code. I know I'm not fully utilizing the data structure. (This is not a homework assignment, I finished and passed the class this fall semester). I would really like to know for future reference.
/**
* Words indexed by this trie.
*/
ArrayList<String> words;
/**
* Root node of this trie.
*/
TrieNode root;
/**
* Initializes a compressed trie with words to be indexed, and root node set to
* null fields.
*
* @param words
*/
public Trie() {
root = new TrieNode(null, null, null);
words = new ArrayList<String>();
}
/**
* Inserts a word into this trie. Converts to lower case before adding.
* The word is first added to the words array list, then inserted into the trie.
*
* @param word Word to be inserted.
*/
public void insertWord(String word) {
words.add(word);
TrieNode node = new TrieNode(new Indexes(words.size() - 1, (short) 0, (short) word.length()), null, null);
if (root.firstChild == null) {
root.firstChild = node;
return;
}
String prefix = findlargestPrefix(root, word);
TrieNode last = findNode(root.firstChild, word, prefix);
if (last == null)
return;
if (!prefix.isEmpty()) {
short len = (short) prefix.length();
TrieNode child = new TrieNode(new Indexes(last.substr.wordIndex, len, last.substr.endIndex), last.firstChild, node);
TrieNode parent = findNode(root.firstChild, word, findSmallestPrefix(root, word));
node.substr.startIndex = last.substr.endIndex = len;
last.substr.startIndex = parent != last ? parent.substr.endIndex : 0;
if (child.substr.startIndex == child.substr.endIndex) {
node.sibling = last.firstChild;
last.firstChild = node;
} else
last.firstChild = child;
return;
}
last.sibling = node;
}
/**
* Given a string prefix, returns its "completion list", i.e. all the words
* in the trie that start with this prefix. For instance, if the tree had
* the words bear, bull, stock, and bell, the completion list for prefix "b"
* would be bear, bull, and bell; for prefix "be" would be bear and bell;
* and for prefix "bell" would be bell. (The last example shows that a
* prefix can be an entire word.) The order of returned words DOES NOT
* MATTER. So, if the list contains bear and bell, the returned list can be
* either [bear,bell] or [bell,bear]
*
* @param prefix
* Prefix to be completed with words in trie
* @return List of all words in tree that start with the prefix, order of
* words in list does not matter. If there is no word in the tree
* that has this prefix, null is returned.
*/
public ArrayList<String> completionList(String prefix) {
TrieNode node = null;
for (int index = 1; index <= prefix.length(); index++) {
node = findNode(root, prefix, prefix.substring(0, index));
if (node != null)
return completionList(node, prefix);
}
return null;
}
private ArrayList<String> completionList(TrieNode root, String prefix) {
ArrayList<String> common = new ArrayList<String>();
if (root == null)
return null;
ArrayList<String> recursive = completionList(root.sibling, prefix);
if (recursive != null)
common.addAll(recursive);
recursive = completionList(root.firstChild, prefix);
if (recursive != null)
common.addAll(recursive);
if (isPrefixNode(root))
return common;
String rootWord = findLiteralWord(root);
if (rootWord.startsWith(prefix))
common.add(rootWord);
if (common.isEmpty())
return null;
return common;
}
private TrieNode findNode(TrieNode root, String word, String prefix) {
if (root == null)
return null;
TrieNode recurse = findNode(root.sibling, word, prefix);
if (recurse != null)
return recurse;
if (prefix.isEmpty())
return root;
recurse = findNode(root.firstChild, word, prefix);
if (recurse != null)
return recurse;
String rootPrefix = findPrefix(findTrieWord(root), word);
if (rootPrefix.equals(prefix))
return root;
if (!rootPrefix.isEmpty() && prefix.startsWith(rootPrefix)) {
String sub = prefix.substring(rootPrefix.length());
return findNode(root.firstChild, sub, sub);
}
return null;
}
private String findPrefix(TrieNode root, String word, final boolean largest) {
ArrayList<String> prefixes = new ArrayList<String>(words.size());
if (root == null)
return "";
for (TrieNode ptr = root.firstChild; ptr != null; ptr = ptr.sibling) {
String recurse = findPrefix(ptr, word, largest);
if (recurse.isEmpty())
continue;
prefixes.add(recurse);
}
if (!prefixes.isEmpty())
return prefixes.stream().sorted((a, b) -> (largest ? b.length() - a.length() : a.length() - b.length()))
.findFirst().get();
return findPrefix(findLiteralWord(root), word);
}
private String findPrefix(String a, String b) {
int len = Math.min(a.length(), b.length());
for (int index = 0; index < len; index++) {
if (a.charAt(index) != b.charAt(index))
return a.substring(0, index);
}
return a.substring(0, len);
}
private String findlargestPrefix(TrieNode root, String word) {
return findPrefix(root, word, true);
}
private String findSmallestPrefix(TrieNode root, String word) {
return findPrefix(root, word, false);
}
private String findWord(TrieNode root, boolean sub) {
if (root.substr == null)
return "3ROOTWORD";
String word = words.get(root.substr.wordIndex);
return sub ? word.substring(root.substr.startIndex, root.substr.endIndex) : word;
}
private String findTrieWord(TrieNode root) {
return findWord(root, true);
}
private String findLiteralWord(TrieNode root) {
return findWord(root, false);
}
private boolean isPrefixNode(TrieNode root) {
return root.firstChild != null;
}
public void print() {
print(root, 1, words);
}
private static void print(TrieNode root, int indent, ArrayList<String> words) {
if (root == null) {
return;
}
for (int i=0; i < indent-1; i++) {
System.out.print(" ");
}
if (root.substr != null) {
System.out.println(" " + words.get(root.substr.wordIndex).substring(root.substr.startIndex, root.substr.endIndex));
}
for (int i=0; i < indent-1; i++) {
System.out.print(" ");
}
System.out.print(" ---");
System.out.println("(" + root.substr + ")");
for (TrieNode ptr=root.firstChild; ptr != null; ptr=ptr.sibling) {
for (int i=0; i < indent-1; i++) {
System.out.print(" ");
}
System.out.println(" |");
print(ptr, indent+1, words);
}
}
}
package structures;
/**
* This class encapsulates the set of 3 indexes that point to a substring
* stored in an array of strings. The array of strings is the collection of
* words that are indexed by the trie. Each node of the trie will have an
* instance of Indexes.
*
* Example: consider the words "have", "hit", "see", "data" stored in an
* array in that order. Then, the substring "ave" in "have" will be indexed
* by the triplet (0,1,3) ["have" is at position 0 in the array, 1 is the index
* of character 'a' in "have" and 3 is the index of character 'e' in "have"].
* Similarly, substring "ee" in the word "see" will be indexed by the triplet
* (2,1,2).
*
* Substrings may be single characters, as in the first "a" in "data",
* which will be indexed by the triplet (3,1,1), or the second "a" in "data",
* which will be indexes by the triplet (3,3,3)
*
*
* @author Sesh Venugopal
*
*/
class Indexes {
/**
* Index into the word collection array.
*/
int wordIndex;
/**
* Start index of substring in word.
*/
short startIndex;
/**
* End index of substring in word.
*/
short endIndex;
/**
* Initializes this instance with all indexes.
*
* @param wordIndex Index of word in array of words
* @param startIndex Starting index of substring
* @param endIndex Ending index of substring
*/
public Indexes(int wordIndex, short startIndex, short endIndex) {
this.wordIndex = wordIndex;
this.startIndex = startIndex;
this.endIndex = endIndex;
}
/* (non-Javadoc)
* @see java.lang.Object#toString()
*/
public String toString() {
return "(" + wordIndex + "," + startIndex + "," + endIndex + ")";
}
/* (non-Javadoc)
* @see java.lang.Object#equals(java.lang.Object)
*/
public boolean equals(Object o) {
if (o == null || !(o instanceof Indexes)) {
return false;
}
Indexes oi = (Indexes)o;
return wordIndex == oi.wordIndex &&
startIndex == oi.startIndex &&
endIndex == oi.endIndex;
}
}
/**
* This class encapsulates a compressed trie node with fields for the following:
* - an Indexes instance, pointing to the substring that is held at that node
* - the first child node
* - the sibling node
*
* @author Sesh Venugopal
*
*/
public class TrieNode {
/**
* Substring held at this node (could be a single character)
*/
Indexes substr;
/**
* First child of this node
*/
TrieNode firstChild;
/**
* Sibling of this node
*/
TrieNode sibling;
/**
* Initializes this trie node with substring, first child, and sibling
*
* @param substr Substring held at this node
* @param firstChild First child of this node
* @param sibling Sibling of this node
*/
public TrieNode(Indexes substr, TrieNode firstChild, TrieNode sibling) {
this.substr = substr;
this.firstChild = firstChild;
this.sibling = sibling;
}
/* (non-Javadoc)
* @see java.lang.Object#toString()
*/
public String toString() {
return substr.toString();
}
}
2 Answers 2
Prefer interfaces to implementations
ArrayList<String> words;
Normally we'd prefer the interface to the implementation. In this case
List<String> words;
Now we can easily change the implementation so long as it still uses the same interface.
Very rarely we use the implementation directly because there is some method available on the implementation that is not available on the interface. However, that doesn't seem to be true here.
Type inference
ArrayList<String> common = new ArrayList<String>();
Consider
List<String> common = new ArrayList<>();
This saves a bit of typing and fixes things so that you don't accidentally use different types on either side of the equals sign.
Use methods for repeated code
for (int i=0; i < indent-1; i++) { System.out.print(" "); }
You do this three times. Consider
printIndent(indent);
Which you could implement
public void printIndent(int indent) {
for (; indent > 0; indent--) {
System.out.print(" ");
}
}
This also saves doing indent - 1
.
findWord
private String findWord(TrieNode root, boolean sub) { if (root.substr == null) return "3ROOTWORD"; String word = words.get(root.substr.wordIndex); return sub ? word.substring(root.substr.startIndex, root.substr.endIndex) : word; } private String findTrieWord(TrieNode root) { return findWord(root, true); } private String findLiteralWord(TrieNode root) { return findWord(root, false); }
What's sub
?
Consider
private String findWord(TrieNode root, boolean sub) {
return sub ? findTrieWord(root) : findLiteralWord(root);
}
private String findWord(TrieNode root) {
if (root.substr == null) {
return "3ROOTWORD";
}
return words.get(root.substr.wordIndex);
}
private String findTrieWord(TrieNode root) {
return findWord(root).substring(root.substr.startIndex, root.substr.endIndex);
}
private String findLiteralWord(TrieNode root) {
return findWord(root);
}
If the first method is not necessary, this would allow us to get rid of sub
altogether. We could also get rid of findLiteralWord
and just call findWord
directly. Otherwise, this supports the original functionality without the extra check.
I prefer the block form of control structures like if
statements.
-
\$\begingroup\$ The use of an arraylist was a requirement, I couldn't change that. As well as the print method, it was something that was given and I couldn't remove from the class. As for type inference, nifty. Didn't know Java added capability for that! The tips are great and all but I was looking more on how to improve the performance as well. I was told that my implementation was similar to searching every prefix. That is more so my goal. \$\endgroup\$Khaled Jaber– Khaled Jaber2017年01月11日 04:51:19 +00:00Commented Jan 11, 2017 at 4:51
KISS
My first thought when I started looking at the code was that the first few lines seemed unnecessary:
/** * Words indexed by this trie. */ ArrayList<String> words;
Surely the point of the trie is to store the words in the nodes, so why should they be stored elsewhere? If Trie.words
is eliminated and the class Indexes
is eliminated then TrieNode.substr
can just store a direct substring of one of the strings. (Before anyone complains that this would increase memory usage: Java's String.substring
shares the char[]
with the original string, which it can do safely because strings are immutable in Java. The only memory increase would be a boolean
per node to record whether the node ends a word, and that's balanced by the memory saving of 64-plus bytes per word from not having the list).
A search shouldn't force a second search
String prefix = findlargestPrefix(root, word); TrieNode last = findNode(root.firstChild, word, prefix);
That recurses down the tree to find a node, and then returns enough information to recurse down the tree again looking for the same node. That's a clear sign that refactoring is necessary.
Special cases
* @return List of all words in tree that start with the prefix, order of * words in list does not matter. If there is no word in the tree * that has this prefix, null is returned.
Why? If there is no word, returning a list containing no words would be more consistent and would simplify things for the caller.
private String findWord(TrieNode root, boolean sub) { if (root.substr == null) return "3ROOTWORD";
What happens if I decide to call insertWord("3ROOTWORD")
? Does the test suite cover that?
Check the documentation
* @param words */ public Trie() {
Looks like either a refactor or a copy-paste which didn't tidy up the JavaDoc. I would expect the IDE to give a warning.
/** * Inserts a word into this trie. Converts to lower case before adding. * The word is first added to the words array list, then inserted into the trie. * * @param word Word to be inserted. */ public void insertWord(String word) { words.add(word);
I don't see the conversion to lower case. Is the doc buggy or the code?
Also, maybe because the posted code doesn't include the class header for the main class, I don't see a full explanation of the data structure. I see mention of it being a "compressed trie", and it's possible to deduce from the fact that a TrieNode
contains a substring which isn't necessarily a prefix or a suffix roughly what is meant by this, but e.g. what guarantees do we get about common prefixes between siblings? And is there any constraint on the order of the singly linked list implemented with TrieNode.sibling
? That information is essential for the maintenance programmer who has to deal with this code in two years' time, and they shouldn't have to try to reverse engineer the code to work it out.
Optimisation
I mentioned the singly linked list implemented with TrieNode.sibling
. I would prefer to use a suitable Collection<TrieNode> children
for two reasons: firstly, because there doesn't seem to be a good reason for reimplementing something which the standard library already supports; and secondly, because with the right collection (I'm thinking TreeSet<>
) you could use binary chop rather than linear search.
-
\$\begingroup\$ As I said I was required to use all the resources posted above. As for your search within a search, how would you modify it to keep the same functionality. \$\endgroup\$Khaled Jaber– Khaled Jaber2017年01月14日 05:54:30 +00:00Commented Jan 14, 2017 at 5:54
TrieNode
andIndexes
. \$\endgroup\$