Implementation of Trie Structure

Question 1

I want to improve this code. I know I'm not fully utilizing the data structure. (This is not a homework assignment, I finished and passed the class this fall semester). I would really like to know for future reference.

 /**
 * Words indexed by this trie.
 */
 ArrayList<String> words;
 /**
 * Root node of this trie.
 */
 TrieNode root;
 /**
 * Initializes a compressed trie with words to be indexed, and root node set to
 * null fields.
 * 
 * @param words
 */
 public Trie() {
 root = new TrieNode(null, null, null);
 words = new ArrayList<String>();
 }
 /**
 * Inserts a word into this trie. Converts to lower case before adding.
 * The word is first added to the words array list, then inserted into the trie.
 * 
 * @param word Word to be inserted.
 */
 public void insertWord(String word) {
 words.add(word);
 TrieNode node = new TrieNode(new Indexes(words.size() - 1, (short) 0, (short) word.length()), null, null);
 if (root.firstChild == null) {
 root.firstChild = node;
 return;
 }
 String prefix = findlargestPrefix(root, word);
 TrieNode last = findNode(root.firstChild, word, prefix);
 if (last == null)
 return;
 if (!prefix.isEmpty()) {
 short len = (short) prefix.length();
 TrieNode child = new TrieNode(new Indexes(last.substr.wordIndex, len, last.substr.endIndex), last.firstChild, node);
 TrieNode parent = findNode(root.firstChild, word, findSmallestPrefix(root, word));
 node.substr.startIndex = last.substr.endIndex = len;
 last.substr.startIndex = parent != last ? parent.substr.endIndex : 0;
 if (child.substr.startIndex == child.substr.endIndex) {
 node.sibling = last.firstChild;
 last.firstChild = node;
 } else
 last.firstChild = child;
 return;
 }
 last.sibling = node;
 }
 /**
 * Given a string prefix, returns its "completion list", i.e. all the words
 * in the trie that start with this prefix. For instance, if the tree had
 * the words bear, bull, stock, and bell, the completion list for prefix "b"
 * would be bear, bull, and bell; for prefix "be" would be bear and bell;
 * and for prefix "bell" would be bell. (The last example shows that a
 * prefix can be an entire word.) The order of returned words DOES NOT
 * MATTER. So, if the list contains bear and bell, the returned list can be
 * either [bear,bell] or [bell,bear]
 * 
 * @param prefix
 * Prefix to be completed with words in trie
 * @return List of all words in tree that start with the prefix, order of
 * words in list does not matter. If there is no word in the tree
 * that has this prefix, null is returned.
 */
 public ArrayList<String> completionList(String prefix) {
 TrieNode node = null;
 for (int index = 1; index <= prefix.length(); index++) {
 node = findNode(root, prefix, prefix.substring(0, index));
 if (node != null)
 return completionList(node, prefix);
 }
 return null;
 }
 private ArrayList<String> completionList(TrieNode root, String prefix) {
 ArrayList<String> common = new ArrayList<String>();
 if (root == null)
 return null;
 ArrayList<String> recursive = completionList(root.sibling, prefix);
 if (recursive != null)
 common.addAll(recursive);
 recursive = completionList(root.firstChild, prefix);
 if (recursive != null)
 common.addAll(recursive);
 if (isPrefixNode(root))
 return common;
 String rootWord = findLiteralWord(root);
 if (rootWord.startsWith(prefix))
 common.add(rootWord);
 if (common.isEmpty())
 return null;
 return common;
 }
 private TrieNode findNode(TrieNode root, String word, String prefix) {
 if (root == null)
 return null;
 TrieNode recurse = findNode(root.sibling, word, prefix);
 if (recurse != null)
 return recurse;
 if (prefix.isEmpty())
 return root;
 recurse = findNode(root.firstChild, word, prefix);
 if (recurse != null)
 return recurse;
 String rootPrefix = findPrefix(findTrieWord(root), word);
 if (rootPrefix.equals(prefix))
 return root;
 if (!rootPrefix.isEmpty() && prefix.startsWith(rootPrefix)) {
 String sub = prefix.substring(rootPrefix.length());
 return findNode(root.firstChild, sub, sub);
 }
 return null;
 }
 private String findPrefix(TrieNode root, String word, final boolean largest) {
 ArrayList<String> prefixes = new ArrayList<String>(words.size());
 if (root == null)
 return "";
 for (TrieNode ptr = root.firstChild; ptr != null; ptr = ptr.sibling) {
 String recurse = findPrefix(ptr, word, largest);
 if (recurse.isEmpty())
 continue;
 prefixes.add(recurse);
 }
 if (!prefixes.isEmpty())
 return prefixes.stream().sorted((a, b) -> (largest ? b.length() - a.length() : a.length() - b.length()))
 .findFirst().get();
 return findPrefix(findLiteralWord(root), word);
 }
 private String findPrefix(String a, String b) {
 int len = Math.min(a.length(), b.length());
 for (int index = 0; index < len; index++) {
 if (a.charAt(index) != b.charAt(index))
 return a.substring(0, index);
 }
 return a.substring(0, len);
 }
 private String findlargestPrefix(TrieNode root, String word) {
 return findPrefix(root, word, true);
 }
 private String findSmallestPrefix(TrieNode root, String word) {
 return findPrefix(root, word, false);
 }
 private String findWord(TrieNode root, boolean sub) {
 if (root.substr == null)
 return "3ROOTWORD";
 String word = words.get(root.substr.wordIndex);
 return sub ? word.substring(root.substr.startIndex, root.substr.endIndex) : word;
 }
 private String findTrieWord(TrieNode root) {
 return findWord(root, true);
 }
 private String findLiteralWord(TrieNode root) {
 return findWord(root, false);
 }
 private boolean isPrefixNode(TrieNode root) {
 return root.firstChild != null;
 }
 public void print() {
 print(root, 1, words);
 }
 private static void print(TrieNode root, int indent, ArrayList<String> words) {
 if (root == null) {
 return;
 }
 for (int i=0; i < indent-1; i++) {
 System.out.print(" ");
 }
 if (root.substr != null) {
 System.out.println(" " + words.get(root.substr.wordIndex).substring(root.substr.startIndex, root.substr.endIndex));
 }
 for (int i=0; i < indent-1; i++) {
 System.out.print(" ");
 }
 System.out.print(" ---");
 System.out.println("(" + root.substr + ")");
 for (TrieNode ptr=root.firstChild; ptr != null; ptr=ptr.sibling) {
 for (int i=0; i < indent-1; i++) {
 System.out.print(" ");
 }
 System.out.println(" |");
 print(ptr, indent+1, words);
 }
 }
}
package structures;
/**
 * This class encapsulates the set of 3 indexes that point to a substring 
 * stored in an array of strings. The array of strings is the collection of
 * words that are indexed by the trie. Each node of the trie will have an
 * instance of Indexes.
 * 
 * Example: consider the words "have", "hit", "see", "data" stored in an
 * array in that order. Then, the substring "ave" in "have" will be indexed
 * by the triplet (0,1,3) ["have" is at position 0 in the array, 1 is the index
 * of character 'a' in "have" and 3 is the index of character 'e' in "have"]. 
 * Similarly, substring "ee" in the word "see" will be indexed by the triplet
 * (2,1,2). 
 * 
 * Substrings may be single characters, as in the first "a" in "data", 
 * which will be indexed by the triplet (3,1,1), or the second "a" in "data", 
 * which will be indexes by the triplet (3,3,3)
 * 
 * 
 * @author Sesh Venugopal
 *
 */
class Indexes {
 /**
 * Index into the word collection array.
 */
 int wordIndex;
 /**
 * Start index of substring in word.
 */
 short startIndex;
 /**
 * End index of substring in word.
 */
 short endIndex;
 /**
 * Initializes this instance with all indexes.
 * 
 * @param wordIndex Index of word in array of words
 * @param startIndex Starting index of substring
 * @param endIndex Ending index of substring
 */
 public Indexes(int wordIndex, short startIndex, short endIndex) {
 this.wordIndex = wordIndex;
 this.startIndex = startIndex;
 this.endIndex = endIndex;
 }
 /* (non-Javadoc)
 * @see java.lang.Object#toString()
 */
 public String toString() {
 return "(" + wordIndex + "," + startIndex + "," + endIndex + ")";
 }
 /* (non-Javadoc)
 * @see java.lang.Object#equals(java.lang.Object)
 */
 public boolean equals(Object o) {
 if (o == null || !(o instanceof Indexes)) {
 return false;
 }
 Indexes oi = (Indexes)o;
 return wordIndex == oi.wordIndex &&
 startIndex == oi.startIndex &&
 endIndex == oi.endIndex;
 }
}
/**
 * This class encapsulates a compressed trie node with fields for the following:
 * - an Indexes instance, pointing to the substring that is held at that node
 * - the first child node
 * - the sibling node
 * 
 * @author Sesh Venugopal
 *
 */
public class TrieNode {
 /**
 * Substring held at this node (could be a single character)
 */
 Indexes substr;
 /**
 * First child of this node
 */
 TrieNode firstChild;
 /**
 * Sibling of this node
 */
 TrieNode sibling;
 /**
 * Initializes this trie node with substring, first child, and sibling
 * 
 * @param substr Substring held at this node
 * @param firstChild First child of this node
 * @param sibling Sibling of this node
 */
 public TrieNode(Indexes substr, TrieNode firstChild, TrieNode sibling) {
 this.substr = substr;
 this.firstChild = firstChild;
 this.sibling = sibling;
 }
 /* (non-Javadoc)
 * @see java.lang.Object#toString()
 */
 public String toString() {
 return substr.toString();
 }
}

Question 2

This code is not self-contained. We have to guess at the interfaces of the critical classes TrieNode and Indexes.

Question 3

@PeterTaylor I've added them but I just assumed you could guess the function of both classes.

Question 4

Prefer interfaces to implementations

ArrayList<String> words;

Normally we'd prefer the interface to the implementation. In this case

List<String> words;

Now we can easily change the implementation so long as it still uses the same interface.

Very rarely we use the implementation directly because there is some method available on the implementation that is not available on the interface. However, that doesn't seem to be true here.

Type inference

 ArrayList<String> common = new ArrayList<String>();

Consider

 List<String> common = new ArrayList<>();

This saves a bit of typing and fixes things so that you don't accidentally use different types on either side of the equals sign.

Use methods for repeated code

 for (int i=0; i < indent-1; i++) {
 System.out.print(" ");
 }

You do this three times. Consider

 printIndent(indent);

Which you could implement

public void printIndent(int indent) {
 for (; indent > 0; indent--) {
 System.out.print(" ");
 }
}

This also saves doing indent - 1.

`findWord`

private String findWord(TrieNode root, boolean sub) {
 if (root.substr == null)
 return "3ROOTWORD";
 String word = words.get(root.substr.wordIndex);
 return sub ? word.substring(root.substr.startIndex, root.substr.endIndex) : word;
}
private String findTrieWord(TrieNode root) {
 return findWord(root, true);
}
private String findLiteralWord(TrieNode root) {
 return findWord(root, false);
}

What's sub?

Consider

private String findWord(TrieNode root, boolean sub) {
 return sub ? findTrieWord(root) : findLiteralWord(root);
}
private String findWord(TrieNode root) {
 if (root.substr == null) {
 return "3ROOTWORD";
 }
 return words.get(root.substr.wordIndex);
}
private String findTrieWord(TrieNode root) {
 return findWord(root).substring(root.substr.startIndex, root.substr.endIndex);
}
private String findLiteralWord(TrieNode root) {
 return findWord(root);
}

If the first method is not necessary, this would allow us to get rid of sub altogether. We could also get rid of findLiteralWord and just call findWord directly. Otherwise, this supports the original functionality without the extra check.

I prefer the block form of control structures like if statements.

Question 5

The use of an arraylist was a requirement, I couldn't change that. As well as the print method, it was something that was given and I couldn't remove from the class. As for type inference, nifty. Didn't know Java added capability for that! The tips are great and all but I was looking more on how to improve the performance as well. I was told that my implementation was similar to searching every prefix. That is more so my goal.

Question 6

KISS

My first thought when I started looking at the code was that the first few lines seemed unnecessary:

 /**
 * Words indexed by this trie.
 */
 ArrayList<String> words;

Surely the point of the trie is to store the words in the nodes, so why should they be stored elsewhere? If Trie.words is eliminated and the class Indexes is eliminated then TrieNode.substr can just store a direct substring of one of the strings. (Before anyone complains that this would increase memory usage: Java's String.substring shares the char[] with the original string, which it can do safely because strings are immutable in Java. The only memory increase would be a boolean per node to record whether the node ends a word, and that's balanced by the memory saving of 64-plus bytes per word from not having the list).

A search shouldn't force a second search

 String prefix = findlargestPrefix(root, word);
 TrieNode last = findNode(root.firstChild, word, prefix);

That recurses down the tree to find a node, and then returns enough information to recurse down the tree again looking for the same node. That's a clear sign that refactoring is necessary.

Special cases

 * @return List of all words in tree that start with the prefix, order of
 * words in list does not matter. If there is no word in the tree
 * that has this prefix, null is returned.

Why? If there is no word, returning a list containing no words would be more consistent and would simplify things for the caller.

 private String findWord(TrieNode root, boolean sub) {
 if (root.substr == null)
 return "3ROOTWORD";

What happens if I decide to call insertWord("3ROOTWORD")? Does the test suite cover that?

Check the documentation

 * @param words
 */
 public Trie() {

Looks like either a refactor or a copy-paste which didn't tidy up the JavaDoc. I would expect the IDE to give a warning.

 /**
 * Inserts a word into this trie. Converts to lower case before adding.
 * The word is first added to the words array list, then inserted into the trie.
 * 
 * @param word Word to be inserted.
 */
 public void insertWord(String word) {
 words.add(word);

I don't see the conversion to lower case. Is the doc buggy or the code?

Also, maybe because the posted code doesn't include the class header for the main class, I don't see a full explanation of the data structure. I see mention of it being a "compressed trie", and it's possible to deduce from the fact that a TrieNode contains a substring which isn't necessarily a prefix or a suffix roughly what is meant by this, but e.g. what guarantees do we get about common prefixes between siblings? And is there any constraint on the order of the singly linked list implemented with TrieNode.sibling? That information is essential for the maintenance programmer who has to deal with this code in two years' time, and they shouldn't have to try to reverse engineer the code to work it out.

Optimisation

I mentioned the singly linked list implemented with TrieNode.sibling. I would prefer to use a suitable Collection<TrieNode> children for two reasons: firstly, because there doesn't seem to be a good reason for reimplementing something which the standard library already supports; and secondly, because with the right collection (I'm thinking TreeSet<>) you could use binary chop rather than linear search.

Question 7

As I said I was required to use all the resources posted above. As for your search within a search, how would you modify it to keep the same functionality.

mdfst13 mdfst13 22.4k6 gold badges34 silver badges70 bronze badges · Answer 1 · 2017-01-11 03:30:01Z

Prefer interfaces to implementations

ArrayList<String> words;

Normally we'd prefer the interface to the implementation. In this case

List<String> words;

Now we can easily change the implementation so long as it still uses the same interface.

Very rarely we use the implementation directly because there is some method available on the implementation that is not available on the interface. However, that doesn't seem to be true here.

Type inference

 ArrayList<String> common = new ArrayList<String>();

Consider

 List<String> common = new ArrayList<>();

This saves a bit of typing and fixes things so that you don't accidentally use different types on either side of the equals sign.

Use methods for repeated code

 for (int i=0; i < indent-1; i++) {
 System.out.print(" ");
 }

You do this three times. Consider

 printIndent(indent);

Which you could implement

public void printIndent(int indent) {
 for (; indent > 0; indent--) {
 System.out.print(" ");
 }
}

This also saves doing indent - 1.

`findWord`

private String findWord(TrieNode root, boolean sub) {
 if (root.substr == null)
 return "3ROOTWORD";
 String word = words.get(root.substr.wordIndex);
 return sub ? word.substring(root.substr.startIndex, root.substr.endIndex) : word;
}
private String findTrieWord(TrieNode root) {
 return findWord(root, true);
}
private String findLiteralWord(TrieNode root) {
 return findWord(root, false);
}

What's sub?

Consider

private String findWord(TrieNode root, boolean sub) {
 return sub ? findTrieWord(root) : findLiteralWord(root);
}
private String findWord(TrieNode root) {
 if (root.substr == null) {
 return "3ROOTWORD";
 }
 return words.get(root.substr.wordIndex);
}
private String findTrieWord(TrieNode root) {
 return findWord(root).substring(root.substr.startIndex, root.substr.endIndex);
}
private String findLiteralWord(TrieNode root) {
 return findWord(root);
}

If the first method is not necessary, this would allow us to get rid of sub altogether. We could also get rid of findLiteralWord and just call findWord directly. Otherwise, this supports the original functionality without the extra check.

I prefer the block form of control structures like if statements.

The use of an arraylist was a requirement, I couldn't change that. As well as the print method, it was something that was given and I couldn't remove from the class. As for type inference, nifty. Didn't know Java added capability for that! The tips are great and all but I was looking more on how to improve the performance as well. I was told that my implementation was similar to searching every prefix. That is more so my goal.

Peter Taylor Peter Taylor 24.5k1 gold badge49 silver badges94 bronze badges · Answer 2 · 2017-01-12 11:50:35Z

KISS

My first thought when I started looking at the code was that the first few lines seemed unnecessary:

 /**
 * Words indexed by this trie.
 */
 ArrayList<String> words;

Surely the point of the trie is to store the words in the nodes, so why should they be stored elsewhere? If Trie.words is eliminated and the class Indexes is eliminated then TrieNode.substr can just store a direct substring of one of the strings. (Before anyone complains that this would increase memory usage: Java's String.substring shares the char[] with the original string, which it can do safely because strings are immutable in Java. The only memory increase would be a boolean per node to record whether the node ends a word, and that's balanced by the memory saving of 64-plus bytes per word from not having the list).

A search shouldn't force a second search

 String prefix = findlargestPrefix(root, word);
 TrieNode last = findNode(root.firstChild, word, prefix);

That recurses down the tree to find a node, and then returns enough information to recurse down the tree again looking for the same node. That's a clear sign that refactoring is necessary.

Special cases

 * @return List of all words in tree that start with the prefix, order of
 * words in list does not matter. If there is no word in the tree
 * that has this prefix, null is returned.

Why? If there is no word, returning a list containing no words would be more consistent and would simplify things for the caller.

 private String findWord(TrieNode root, boolean sub) {
 if (root.substr == null)
 return "3ROOTWORD";

What happens if I decide to call insertWord("3ROOTWORD")? Does the test suite cover that?

Check the documentation

 * @param words
 */
 public Trie() {

Looks like either a refactor or a copy-paste which didn't tidy up the JavaDoc. I would expect the IDE to give a warning.

 /**
 * Inserts a word into this trie. Converts to lower case before adding.
 * The word is first added to the words array list, then inserted into the trie.
 * 
 * @param word Word to be inserted.
 */
 public void insertWord(String word) {
 words.add(word);

I don't see the conversion to lower case. Is the doc buggy or the code?

Also, maybe because the posted code doesn't include the class header for the main class, I don't see a full explanation of the data structure. I see mention of it being a "compressed trie", and it's possible to deduce from the fact that a TrieNode contains a substring which isn't necessarily a prefix or a suffix roughly what is meant by this, but e.g. what guarantees do we get about common prefixes between siblings? And is there any constraint on the order of the singly linked list implemented with TrieNode.sibling? That information is essential for the maintenance programmer who has to deal with this code in two years' time, and they shouldn't have to try to reverse engineer the code to work it out.

Optimisation

I mentioned the singly linked list implemented with TrieNode.sibling. I would prefer to use a suitable Collection<TrieNode> children for two reasons: firstly, because there doesn't seem to be a good reason for reimplementing something which the standard library already supports; and secondly, because with the right collection (I'm thinking TreeSet<>) you could use binary chop rather than linear search.

As I said I was required to use all the resources posted above. As for your search within a search, how would you modify it to keep the same functionality.

Stack Exchange Network

Implementation of Trie Structure

2 Answers 2

Prefer interfaces to implementations

Type inference

Use methods for repeated code

`findWord`

KISS

A search shouldn't force a second search

Special cases

Check the documentation

Optimisation

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Implementation of Trie Structure

2 Answers 2

Prefer interfaces to implementations

Type inference

Use methods for repeated code

findWord

KISS

A search shouldn't force a second search

Special cases

Check the documentation

Optimisation

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

`findWord`