Name Filtering using a Trie

Question 1

This question suggests a different (improved?) implementation for searching/filtering names based on a search query. This is based on the question here: Name filtering in Java

A Trie allows the prefix of terms to be used as a search mechanism. The Trie requires pre-processing the data in to a data structure that has very efficient lookups. Lookups in the order of \$O(n)\$ where \$n\$ is the size of the prefix, not the amount of data.

For the purpose of this task, the system indexes a number of names. Names like "Rob Smith". The goal will be to filter names in a text box, and to do that really fast. As the user types in a name, we need to filter them. As the user types, we can use the Trie as follows:

Rob Smith

As the user types, starting with 'r' for 'Rob', we can start at the root, then navigate to the 'r' entry. Any names under the 'r' all start with 'r'.

When the user types the 'o', we can go from the root to 'r' and then to 'o'. Any names under that 'o' node start with 'ro'. Wen the user types the 'b', we get to the node containing the name 'Rob Smith'.

The Trie tree above shows only one name, but there are thousands in there. 'Roland', 'Roger', etc. would all be names descendent from the r -> o branch.

Additionally, since people names have first names, and last names, all names are in multiple locations in the Trie. The name would be in one place based on 'Rob', and in another place based on 'Smith'.

By managing the query carefully, when the user enters r s (r space s), we can find all names that have a part starting with 'r', and the other name starting with 's'. This would find names like 'Rob Smith', and also 'Stan Rogers'.

The following is a class which 'ingests' and indexes the names in the Trie, then allows you to filter them based on search terms. It is this code that is the primary focus of this Code Review question.

Of particular interest is:

performance
scalability
memory usage

Of course, all other input is valued too.

package nameit;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Comparator;
import java.util.HashSet;
import java.util.Iterator;
import java.util.List;
import java.util.Set;
public class NameFilter {
 private static final Comparator<String> LONGEST_FIRST = new Comparator<String>() {
 @Override
 public int compare(String a, String b) {
 if (a.length() == b.length()) {
 return a.compareTo(b);
 }
 return Integer.compare(b.length(), a.length());
 }
 };
 private static final List<String> sortBySize(Iterable<String> names) {
 List<String> bySize = new ArrayList<>();
 for (String name : names) {
 if(name.isEmpty()) {
 continue;
 }
 bySize.add(name.toLowerCase());
 }
 bySize.sort(LONGEST_FIRST);
 return bySize;
 }
 private static final class NameControl {
 private final String[] nameParts;
 private final String name;
 public NameControl(String name) {
 this.name = name;
 nameParts = name.trim().split("\\s+");
 for (int i = 0; i < nameParts.length; i++) {
 nameParts[i] = nameParts[i].toLowerCase();
 }
 }
 public int size() {
 return nameParts.length;
 }
 public String getName() {
 return name;
 }
 public String getNamePart(int index) {
 return nameParts[index];
 }
 @Override
 public String toString() {
 return "Control for: " + name;
 }
 @Override
 public int hashCode() {
 return name.hashCode();
 }
 @Override
 public boolean equals(Object other) {
 return (other instanceof NameControl) && name.equals(((NameControl)other).name);
 }
 }
 private static final class NameByPart {
 private final int part;
 private final NameControl name;
 public NameByPart(int part, NameControl name) {
 super();
 this.part = part;
 this.name = name;
 }
 @Override
 public String toString() {
 return String.format("NameByPart: %d (%s) of %s", part, name.getNamePart(part), name);
 }
 @Override
 public int hashCode() {
 return name.hashCode();
 }
 @Override
 public boolean equals(Object other) {
 if (other instanceof NameByPart) {
 NameByPart their = (NameByPart)other;
 return part == their.part && name.equals(their.name);
 }
 return false;
 }
 public NameControl getNameControl() {
 return name;
 }
 }
 private static final class TrieNode {
 private final char key;
 private List<NameByPart> matchingNames = null;
 private TrieNode[] kids = null;
 private int kidSize = 0;
 public TrieNode(char key) {
 this.key = key;
 }
 public TrieNode getOrCreateChild(final char c) {
 if (kids == null) {
 kids = new TrieNode[8];
 }
 int ip = locateIndex(c);
 if (ip < 0) {
 // insert new node.
 ip = -ip - 1;
 if (kidSize == kids.length) {
 kids = Arrays.copyOf(kids, kidSize + 8);
 }
 System.arraycopy(kids, ip, kids, ip + 1, kidSize - ip);
 kids[ip] = new TrieNode(c);
 kidSize++;
 }
 return kids[ip];
 }
 public TrieNode getChild(final char c) {
 if (kids == null) {
 return null;
 }
 int ip = locateIndex(c);
 if (ip < 0) {
 return null;
 }
 return kids[ip];
 }
 private int locateIndex(final char c) {
 // binary search for the kid TrieNode with the right char key.
 // use the same -ip-1 convention from Arrays.binarySearch(...) to indicate a missing entry
 int left = 0;
 int right = kidSize - 1;
 while (right >= left) {
 int mid = left + (right - left) / 2;
 char k = kids[mid].key;
 if (k == c) {
 return mid;
 }
 if (c < k) {
 right = mid - 1;
 } else {
 left = mid + 1;
 }
 }
 return -left - 1;
 }
 public void addName(NameByPart nameByPart) {
 if (matchingNames == null) {
 matchingNames = new ArrayList<>();
 }
 matchingNames.add(nameByPart);
 }
 }
 private final Set<NameControl> allNames = new HashSet<>();
 private final TrieNode root = new TrieNode('0円');
 public NameFilter() {
 // nothing to do in constructor.
 }
 public String[] getSomeNames(int count) {
 int size = Math.min(allNames.size(), count);
 String[] names = new String[size];
 Iterator<NameControl> it = allNames.iterator();
 for (int i = 0; i < size; i++) {
 names[i] = it.next().getName();
 }
 return names;
 }
 public boolean addName(String name) {
 synchronized (allNames) {
 NameControl toAdd = new NameControl(name);
 if (!allNames.add(toAdd)) {
 return false;
 }
 for (int p = 0; p < toAdd.size(); p++) {
 String part = toAdd.getNamePart(p);
 TrieNode node = root;
 for (char c : part.toCharArray()) {
 node = node.getOrCreateChild(c);
 }
 node.addName(new NameByPart(p, toAdd));
 }
 return true;
 }
 }
 public String[] filterNames(Iterable<String> names) {
 final List<String> bySize = sortBySize(names);
 if (bySize.isEmpty()) {
 return new String[0];
 }
 Set<NameByPart> seen = new HashSet<>();
 Set<NameControl> candidates = null;
 for (String prefix : bySize) {
 candidates = refineFilter(prefix, candidates, seen);
 }
 List<String> results = new ArrayList<>(candidates.size());
 for (NameControl c : candidates) {
 results.add(c.getName());
 }
 results.sort(Comparator.naturalOrder());
 return results.toArray(new String[results.size()]);
 }
 private Set<NameControl> refineFilter(String prefix, Set<NameControl> candidates, Set<NameByPart> seen) {
 Set<NameControl> result = candidates == null ? new HashSet<>() : new HashSet<>(candidates.size());
 TrieNode node = root;
 for (char c : prefix.toCharArray()) {
 node = node.getChild(c);
 if (node == null) {
 // nothing matches the prefix.
 return result;
 }
 }
 Set<NameControl> roundHit = new HashSet<>();
 // node, and all its children have this prefix...
 walkTree(result, node, candidates, seen, roundHit);
 return result;
 }
 private void walkTree(final Set<NameControl> result, TrieNode node, Set<NameControl> candidates, Set<NameByPart> seen, Set<NameControl> roundHit) {
 if (node.matchingNames != null) {
 refineNames(result, node.matchingNames, candidates, seen, roundHit);
 }
 for (int i = 0; i < node.kidSize; i++) {
 walkTree(result, node.kids[i], candidates, seen, roundHit);
 }
 }
 private void refineNames(Set<NameControl> result, List<NameByPart> matchingNames, Set<NameControl> candidates, Set<NameByPart> seen, Set<NameControl> roundHit) {
 for (NameByPart found : matchingNames) {
 if (candidates == null || candidates.contains(found.getNameControl())) {
 if (seen.contains(found)) {
 // never reconsider a name part twice.
 continue;
 }
 if (!roundHit.add(found.getNameControl())) {
 // only visit a whole name once per round.
 continue;
 }
 seen.add(found); // mark that a specific name part was seen
 roundHit.add(found.getNameControl()); // mark that a name has been identified .
 result.add(found.getNameControl());
 }
 }
 }
}

The remaining code provides a way to use, play with, and visualize how the Filter above is used.

First, a Swing application:

App

package nameit;
import java.awt.BorderLayout;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.Arrays;
import javax.swing.JFrame;
import javax.swing.JLabel;
import javax.swing.JList;
import javax.swing.JPanel;
import javax.swing.JScrollPane;
import javax.swing.JTextField;
import javax.swing.event.DocumentEvent;
import javax.swing.event.DocumentListener;
public class Namer extends JFrame {
 private static final long serialVersionUID = -7954710587171367489L;
 public static void main(String[] args) {
 NameFilter filter = new NameFilter();
 populateRandomNames(filter, 10000);
 Namer namer = new Namer(filter, filter.getSomeNames(30));
 namer.setSize(400, 800);
 namer.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
 namer.setVisible(true);
 }
 private static void populateRandomNames(NameFilter filter, int count) {
 // From http://www.census.gov/topics/population/genealogy/data/1990_census/1990_census_namefiles.html
 Path[] fnPaths = {
 Paths.get("names/dist.female.first"),
 Paths.get("names/dist.male.first"),
 };
 Path[] lnPaths = {
 Paths.get("names/dist.all.last"),
 };
 long start = System.nanoTime();
 NameBuilder builder = new NameBuilder(Arrays.asList(fnPaths), Arrays.asList(lnPaths));
 long read = System.nanoTime();
 for (String name : builder.getRandomNames(count)) {
 filter.addName(name);
 }
 long done = System.nanoTime();
 System.out.printf("Generated Name Trei: Input names in %.3fms, Trie in %.3fms%n", (read - start) / 1000000.0, (done - read) / 1000000.0);
 }
 private final class QueryListener implements DocumentListener {
 @Override
 public void changedUpdate(DocumentEvent arg0) {
 queryChanged();
 }
 @Override
 public void insertUpdate(DocumentEvent arg0) {
 queryChanged();
 }
 @Override
 public void removeUpdate(DocumentEvent arg0) {
 queryChanged();
 }
 }
 private final JTextField queryBox;
 private final JList<String> resultList;
 private final String[] friends;
 private final NameFilter filter;
 public Namer(NameFilter filter, String[] friends) {
 setTitle("Name Filter");
 this.friends = friends;
 this.filter = filter;
 Arrays.sort(friends);
 queryBox = new JTextField(50);
 queryBox.setText("");
 queryBox.getDocument().addDocumentListener(new QueryListener());
 JPanel queryPane = new JPanel(new BorderLayout());
 queryPane.add(new JLabel("Query for:"), BorderLayout.LINE_START);
 queryPane.add(queryBox, BorderLayout.CENTER);
 resultList = new JList<>();
 JScrollPane listScroller = new JScrollPane(resultList);
 BorderLayout pageLayout = new BorderLayout();
 JPanel page = new JPanel(pageLayout); 
 page.add(queryPane, BorderLayout.PAGE_START);
 page.add(listScroller, BorderLayout.CENTER);
 this.add(page);
 queryChanged();
 }
 private void queryChanged() {
 String query = queryBox.getText().trim();
 String[] names = query.isEmpty() ? friends : filter.filterNames(Arrays.asList(query.split("\\s+")));
 resultList.setListData(names);
 }
}

Finally, to generate some dummy names, I have used Census data:

last names
female first names
male first names

The following code creates random combinations.

package nameit;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import java.util.stream.Stream;
public class NameBuilder {
 private static final String extractName(final String source) {
 int pos = source.indexOf(' ');
 String name = pos < 0 ? source : source.substring(0,pos);
 if (name.length() < 2) {
 return name.toUpperCase();
 }
 name = name.substring(0, 1).toUpperCase() + name.substring(1).toLowerCase();
 return name;
 }
 private static final void addNames(Path source, List<String> dest) {
 try (Stream<String> lines = Files.lines(source)) {
 lines.map(s -> extractName(s)).forEach(name -> dest.add(name));
 } catch (IOException e) {
 // lazy
 e.printStackTrace();
 }
 }
 private static final String[] pullNames(Iterable<Path> sources) {
 List<String> collector = new ArrayList<>();
 for (Path source : sources) {
 addNames(source, collector);
 }
 return collector.toArray(new String[collector.size()]);
 }
 private final String[] firstNames;
 private final String[] lastNames;
 public NameBuilder(Iterable<Path> firstNameSouces, Iterable<Path> lastNameSources) {
 firstNames = pullNames(firstNameSouces);
 lastNames = pullNames(lastNameSources);
 }
 public String[] getRandomNames(final int size) {
 String[] names = new String[size];
 Random rand = new Random(size);
 for (int i = 0; i < size; i++) {
 names[i] = String.format("%s %s", firstNames[rand.nextInt(firstNames.length)], lastNames[rand.nextInt(lastNames.length)]);
 }
 return names;
 }
}

Question 2

Thread safety

NameParts (and thus child nodes) can still be added via NameBuilder.addName whilst refineNames is running. Your code is not completely thread safe. (partial credit to @AdriaanKoster for making me look in greater detail at thread-safety).

 public TrieNode getOrCreateChild(final char c) {
 if (kids == null) {
 kids = new TrieNode[8];
 }
 int ip = locateIndex(c);
 if (ip < 0) {
 // insert new node.
 ip = -ip - 1;
 if (kidSize == kids.length) {
 kids = Arrays.copyOf(kids, kidSize + 8);
 }
 System.arraycopy(kids, ip, kids, ip + 1, kidSize - ip);
 kids[ip] = new TrieNode(c);
 kidSize++;
 }
 return kids[ip];
 }

I have absolutely no idea what's going on here.

If there are no kids, make new ones... Okay, that's the create part...

And then... ip (not an ip address, so what is it?) is locateIndex...?

 private int locateIndex(final char c) {
 // binary search for the kid TrieNode with the right char key.
 // use the same -ip-1 convention from Arrays.binarySearch(...) to indicate a missing entry
 int left = 0;
 int right = kidSize - 1;
 while (right >= left) {
 int mid = left + (right - left) / 2;
 char k = kids[mid].key;
 if (k == c) {
 return mid;
 }
 if (c < k) {
 right = mid - 1;
 } else {
 left = mid + 1;
 }
 }
 return -left - 1;
 }

It.. finds something? Okay, but if you get less than 0, it was not found ... wait, what, Arrays.binarySearch?

Huh?

What does locateIndex even do?

Well, at least we need to add a new node in case it wasn't found. That's what < 0 signals anyway. So then we... set the new index to ... the negative... what the hell?! Why are you reusing an error signal (whose value is arbitrary)? I know you wouldn't do that, so the value is not arbitrary but I still have no idea what it's for. I'm just gonna assume you've translated it into the new spot to put your created node.

Then we check if... kidSize... ... if the array has enough space? And if not, then we... add some space by... copyOf. Yeah! Okay, after that...

Create the child node and stick it in, right...? Wait, we're... doing some serious shifting. From kids to kids starting at ip and placing them at ip+1, but only... Ah, we're shifting the elements of the array by one. And then we stick the new element in the freed position. And then the used element counter goes up by one. And then we return the new element.

I just spent 10 minutes reading your code in an attempt to understand it. I still haven't succeeded.

I think I know why I have troubles decoding your code. It's because it's at the wrong abstraction level.

We're gonna retrieve a child TrieNode, or create one if we don't have the one yet. But then the function ends up actually

Checking if we have children at all
Instantiating a new array with some magic constant
Search for where it is or should be
If it's not there...
Then we're gonna check if we have space
And make some space if there's none
Then we're gonna shift all the boxes to the side
And put our new box there
And then we can say okay I've got your box

This should be documented or split up.

I want to see something like this: (comments for mid code explanation)

 public TrieNode getOrCreateChild(final char c) {
 //no existance check! Make an array of 0 items instead.
 int ip = locateIndex(c);//this is good, although I don't know what ip is.
 if (ip < 0) {
 // insert new node.
 return createChild(-ip -1, c);//Creation can be delegated. In turn, creation should delegate array expansion.
 }
 return kids[ip];
 }

Ahhh. Find or create child. Nice.

I'm still stuck with -ip -1 but that's because I have no idea what it's supposed to do.

If performance is REALLY of key issue as such that you can't split up functions, start putting in some comments. Because the next programmer will have to maintain this and he will not be able to.

Additionally, I saw these:

 public NameControl(String name) {
 this.name = name;
 nameParts = name.trim().split("\\s+");
 for (int i = 0; i < nameParts.length; i++) {
 nameParts[i] = nameParts[i].toLowerCase();
 }
 }

Any reason you can't do name.trim().toLowerCase().split("\\s+"); and be done with it?

System.out.printf("Generated Name Trei: Input names in %.3fms, Trie in %.3fms%n", (read - start) / 1000000.0, (done - read) / 1000000.0);

You've got a typo here - "Trei" -> "Trie".

 for (String name : builder.getRandomNames(count)) {
 filter.addName(name);
 }

Consider adding a helper method boolean addNames(Collection<String> names) to facilitate adding multiple names.

 Set<NameControl> roundHit = new HashSet<>();
 // node, and all its children have this prefix...
 walkTree(result, node, candidates, seen, roundHit);
 return result;

You create a set, give it a name, but don't do anything with it after the function returns. Do you need to keep a reference to it?

 String name = pos < 0 ? source : source.substring(0,pos);

I recommend hitting the auto-formatter in your IDE from time to time. Specifically; before committing to a repository and before posting questions to Code Review.

Question 3

-ip-1 gives you the index the would have had, if it existed. It's a terrible solution to the problem of needed to return two bits of information, but still wanting your return just an int. OP should not use in their implementation just because that is how it is done in a similar method.

Question 4

To clarify, when Arrays.binarySearch doesn't find the desired key, it negates the position where it should be inserted, subtracts 1 (so the 0 index will be negative), and returns that value.

Question 5

My comments focus mainly on readability, so sorry if this is of no interest.

The method sortBySize does not have to be static and can be moved just below filterNames, the only method that uses it.

The default constructor does nothing and there are no other constructors so it can be removed.

I would move the helper classes into separate top level classes to create a more natural reading flow.

Passing seen and roundHit around makes the methods hard to understand. These can be converted to fields. I suggest moving the refining logic into a separate RefineFilter class.

Several methods can be extracted or renamed to make the code easier to understand, see my refactored code below.

The argument passed to filterNames is called names. This is misleading since it actually represents name parts.

The argument passed to filterNames can be changed from Iterable to List. List is a common enough interface, there is no direct need to change to a higher type here. This could be confusing for readers. Also I need access to the size for the next point.

In sortBySize you can initialize the ArrayList with the number of name parts received. Normally there should be no empty name parts received so the capacity will be correct. If an empty namePart is received the only impact will be slight over sizing of the list.

NameBuilder misleadingly suggests the builder pattern. It is better to avoid such unintended references because it can confuse the readers. I suggest the name NameStore.

My suggested renames:

Namer -> TestGui
NameBuilder -> NameStore
NameControl -> Name
NameByPart -> Part
NameFilter -> NameTrie
TrieNode -> NameTrieNode
anonymous Comparator -> LongestFirstComparator

My refactored code looks like this:

import java.util.ArrayList;
import java.util.Comparator;
import java.util.HashSet;
import java.util.Iterator;
import java.util.List;
import java.util.Set;
public class NameTrie {
 private static final Comparator<String> LONGEST_FIRST = new LongestFirstComparator();
 private final Set<Name> names = new HashSet<>();
 private final NameTrieNode root = new NameTrieNode('0円');
 public void addName(final String name) {
 synchronized (names) {
 final Name nameToAdd = new Name(name);
 if (names.add(nameToAdd)) {
 indexName(nameToAdd);
 }
 }
 }
 private void indexName(final Name name) {
 for (int p = 0; p < name.size(); p++) {
 final String part = name.getNamePart(p);
 final NameTrieNode node = lookupNode(part);
 node.addName(new Part(p, name));
 }
 }
 private NameTrieNode lookupNode(final String part) {
 NameTrieNode node = root;
 for (final char c : part.toCharArray()) {
 node = node.getOrCreateChild(c);
 }
 return node;
 }
 public String[] getSomeNames(final int count) {
 final int size = Math.min(names.size(), count);
 final String[] someNames = new String[size];
 final Iterator<Name> it = names.iterator();
 for (int i = 0; i < size; i++) {
 someNames[i] = it.next().getName();
 }
 return someNames;
 }
 public String[] findMatchingNames(final List<String> nameParts) {
 final List<String> namePartsBySize = sortDescendingBySizeAndConvertToLowerCase(nameParts);
 final Set<Name> names = matchNames(namePartsBySize);
 return convertToStringArrayAndSortNatural(names);
 }
 private final List<String> sortDescendingBySizeAndConvertToLowerCase(final List<String> names) {
 final List<String> namesBySize = new ArrayList<>(names.size());
 for (final String name : names) {
 if (!name.isEmpty()) {
 namesBySize.add(name.toLowerCase());
 }
 }
 namesBySize.sort(LONGEST_FIRST);
 return namesBySize;
 }
 private Set<Name> matchNames(final List<String> nameParts) {
 Set<Name> names = new HashSet<>();
 if (!nameParts.isEmpty()) {
 final RefineFilter refineFilter = new RefineFilter(root);
 for (final String prefix : nameParts) {
 names = refineFilter.execute(prefix, names);
 }
 }
 return names;
 }
 private String[] convertToStringArrayAndSortNatural(final Set<Name> names) {
 final List<String> result = new ArrayList<>(names.size());
 for (final Name c : names) {
 result.add(c.getName());
 }
 result.sort(Comparator.naturalOrder());
 return result.toArray(new String[result.size()]);
 }
}

And:

public class RefineFilter {
 private final NameTrieNode root;
 private final Set<Part> seen = new HashSet<>();
 private final Set<Name> roundHit = new HashSet<>();
 public RefineFilter(final NameTrieNode root) {
 this.root = root;
 }
 public Set<Name> execute(final String prefix, final Set<Name> candidates) {
 final Set<Name> result = new HashSet<>(candidates.size());
 NameTrieNode node = root;
 for (final char c : prefix.toCharArray()) {
 node = node.getChild(c);
 if (node == null) {
 // nothing matches the prefix.
 return result;
 }
 }
 // node, and all its children have this prefix...
 walkTree(result, node, candidates);
 return result;
 }
 private void walkTree(final Set<Name> result, final NameTrieNode node, final Set<Name> candidates) {
 if (node.getMatchingNames() != null) {
 refineNames(result, node.getMatchingNames(), candidates);
 }
 for (int i = 0; i < node.getKidSize(); i++) {
 walkTree(result, node.getKids()[i], candidates);
 }
 }
 private void refineNames(final Set<Name> result, final List<Part> matchingNames, final Set<Name> candidates) {
 for (final Part found : matchingNames) {
 if (candidates == null || candidates.contains(found.getNameControl())) {
 if (seen.contains(found)) {
 // never reconsider a name part twice.
 continue;
 }
 if (!roundHit.add(found.getNameControl())) {
 // only visit a whole name once per round.
 continue;
 }
 seen.add(found); // mark that a specific name part was seen
 roundHit.add(found.getNameControl()); // mark that a name has been identified .
 result.add(found.getNameControl());
 }
 }
 }
}

Question 6

Making the seen/roundHit fields means that the search method can only be called from a single thread at a time, and is no longer re-entrant. I agree that removing the static nested classes may help, if the code was in it's own package and the permissions were set appropriately. +1. First post to Code Review after being here for a year - nice to finally coax you out ;-)

Question 7

You can make this approach thread safe by moving the refineFilter logic and the seen/roundHit fields into a separate RefineFilter class. Then in filterNames create a new RefineFilter instance. See my updated code (in a sec).

Question 8

NameParts (and thus child nodes) can still be added via NameBuilder.addName whilst RefineFilter is running. It's not thread safe.

Question 9

@Pimgd: good point. This was also the case in the original code though.

Question 10

Interesting... added it to my answer.

Question 11

I'm here to smack you in the head about the implementation of filterNames. In this method you just used one of the code patterns that I hold a huge grudge, not once, but twice!

Set<NameByPart> seen = new HashSet<>();
Set<NameControl> candidates = null;
for (String prefix : bySize) {
 candidates = refineFilter(prefix, candidates, seen);
}

One of the most annoying things that I've been increasingly hating on, my yet short, programmer career, is calling a method, that receives a list and adds elements to that list. Those methods should be banned (unless they are recursive like your walkTree, that's totally fine). Every method that does this should receive enough information so it can return a list with all elements needed. In your case you could do the following:

public Iterable<String> filterNames(Iterable<String> names) {
 final List<String> bySize = sortBySize(names);
 if (bySize.isEmpty()) {
 return new String[0];
 }
 Set<NameControl> candidates = refineFilter(bySize);
 //... (to continue)
}
private Set<NameControl> refineFilter(List<String> bySize) {
 Set<NameControl> result = new HashSet<>();
 Set<NameControl> candidates = new HashSet<>();
 Set<NameControl> seen = new HashSet<>();
 //Edit: I noticed that I screwed the original algorithm logic.
 for(String prefix : bySize){
 TrieNode node = getNodeWithPrefix(root, prefix);
 if(node == null){
 continue;
 }
 Set<NameControl> roundHit = new HashSet<>();
 // node, and all its children have this prefix...
 walkTree(result, node, candidates, seen, roundHit);
 }
 return result;
}
private Node getNodeWithPrefix(Node root, String prefix){
 TrieNode node = root;
 for (char c : prefix.toCharArray()) {
 node = node.getChild(c);
 if (node == null) {
 return null;
 }
 }
 return node;
}

Much better, isn't it?

In the end end of this method you're doing:

return results.toArray(new String[results.size()]);

This is one of the things that annoys me too, couldn't you return the List, or an Iterable? If you could then there wouldn't be the need to copy the elements to the array. Programmers those days abuse the .ToArray() of linq in .Net when they could perfectly return an IEnumerable, abstracting the collection used behind the scenes.

Question 12

You have String prefix as input parameter and another as for iterator.

Question 13

@Heslacher Oops, right, thanks for the tip - too much copy pasting :p. Fixed.

Question 14

How to convert candidates into List<String>? Right now I'm staring at a nice compiler error for type mismatch.

Question 15

@Pimgd List<String> is an interface that ArrayList implements. It is present in package java.util.List. I just had a bit of luck because I'm so attached to writing List in .Net ;), though :p.

Question 16

@rolfl I'm afraid that isn't true. My code as is is equivalent to yours. Everytime tou call refineFilter you don't store roundHit anywhere. You are always creating a new empty set for it, and so am I. Althought you could write only walkTree(result, node, candidates, seen, new HashSet<NameControl>());

Question 17

NameFilter.filterNames()

If you change the initialization of Set<NameControl> candidates before calling

 Set<NameControl> candidates = null;
 for (String prefix : bySize) {
 candidates = refineFilter(prefix, candidates, seen);
 }

to Set<NameControl> candidates = new HashSet<>(); you can remove the tenary condition at the refineFilter() method. This would also affect NameFilter.refineNames() in a good way.

NameFilter.refineNames()

After initializing candidates like mention above, this method would gain from using a guard clause

if (candidates.isEmpty()) {
 return;
}

The second call to roundHit.add(found.getNameControl()); can be removed. I would also change if (seen.contains(found)) to if (!seen.add(found))

Resulting in

private void refineNames(Set<NameControl> result, List<NameByPart> matchingNames, Set<NameControl> candidates, Set<NameByPart> seen, Set<NameControl> roundHit) {
 if (candidates.isEmpty()) {
 return;
 }
 for (NameByPart found : matchingNames) {
 if (candidates.contains(found.getNameControl())) {
 if (!seen.add(found)) {
 // never reconsider a name part twice.
 continue;
 }
 if (!roundHit.add(found.getNameControl())) {
 // only visit a whole name once per round.
 continue;
 }
 result.add(found.getNameControl());
 }
 }
}

NameControl.getNamePart()

This method should throw an ArgumentOutOfRangeException for the case that the passed int index would be < 0 ||>= nameParts.length

Pimgd Pimgd 22.5k5 gold badges68 silver badges144 bronze badges · Accepted Answer · 2015-01-23 11:48:30Z

Thread safety

NameParts (and thus child nodes) can still be added via NameBuilder.addName whilst refineNames is running. Your code is not completely thread safe. (partial credit to @AdriaanKoster for making me look in greater detail at thread-safety).

 public TrieNode getOrCreateChild(final char c) {
 if (kids == null) {
 kids = new TrieNode[8];
 }
 int ip = locateIndex(c);
 if (ip < 0) {
 // insert new node.
 ip = -ip - 1;
 if (kidSize == kids.length) {
 kids = Arrays.copyOf(kids, kidSize + 8);
 }
 System.arraycopy(kids, ip, kids, ip + 1, kidSize - ip);
 kids[ip] = new TrieNode(c);
 kidSize++;
 }
 return kids[ip];
 }

I have absolutely no idea what's going on here.

If there are no kids, make new ones... Okay, that's the create part...

And then... ip (not an ip address, so what is it?) is locateIndex...?

 private int locateIndex(final char c) {
 // binary search for the kid TrieNode with the right char key.
 // use the same -ip-1 convention from Arrays.binarySearch(...) to indicate a missing entry
 int left = 0;
 int right = kidSize - 1;
 while (right >= left) {
 int mid = left + (right - left) / 2;
 char k = kids[mid].key;
 if (k == c) {
 return mid;
 }
 if (c < k) {
 right = mid - 1;
 } else {
 left = mid + 1;
 }
 }
 return -left - 1;
 }

It.. finds something? Okay, but if you get less than 0, it was not found ... wait, what, Arrays.binarySearch?

Huh?

What does locateIndex even do?

Well, at least we need to add a new node in case it wasn't found. That's what < 0 signals anyway. So then we... set the new index to ... the negative... what the hell?! Why are you reusing an error signal (whose value is arbitrary)? I know you wouldn't do that, so the value is not arbitrary but I still have no idea what it's for. I'm just gonna assume you've translated it into the new spot to put your created node.

Then we check if... kidSize... ... if the array has enough space? And if not, then we... add some space by... copyOf. Yeah! Okay, after that...

Create the child node and stick it in, right...? Wait, we're... doing some serious shifting. From kids to kids starting at ip and placing them at ip+1, but only... Ah, we're shifting the elements of the array by one. And then we stick the new element in the freed position. And then the used element counter goes up by one. And then we return the new element.

I just spent 10 minutes reading your code in an attempt to understand it. I still haven't succeeded.

I think I know why I have troubles decoding your code. It's because it's at the wrong abstraction level.

We're gonna retrieve a child TrieNode, or create one if we don't have the one yet. But then the function ends up actually

Checking if we have children at all
Instantiating a new array with some magic constant
Search for where it is or should be
If it's not there...
Then we're gonna check if we have space
And make some space if there's none
Then we're gonna shift all the boxes to the side
And put our new box there
And then we can say okay I've got your box

This should be documented or split up.

I want to see something like this: (comments for mid code explanation)

 public TrieNode getOrCreateChild(final char c) {
 //no existance check! Make an array of 0 items instead.
 int ip = locateIndex(c);//this is good, although I don't know what ip is.
 if (ip < 0) {
 // insert new node.
 return createChild(-ip -1, c);//Creation can be delegated. In turn, creation should delegate array expansion.
 }
 return kids[ip];
 }

Ahhh. Find or create child. Nice.

I'm still stuck with -ip -1 but that's because I have no idea what it's supposed to do.

If performance is REALLY of key issue as such that you can't split up functions, start putting in some comments. Because the next programmer will have to maintain this and he will not be able to.

Additionally, I saw these:

 public NameControl(String name) {
 this.name = name;
 nameParts = name.trim().split("\\s+");
 for (int i = 0; i < nameParts.length; i++) {
 nameParts[i] = nameParts[i].toLowerCase();
 }
 }

Any reason you can't do name.trim().toLowerCase().split("\\s+"); and be done with it?

System.out.printf("Generated Name Trei: Input names in %.3fms, Trie in %.3fms%n", (read - start) / 1000000.0, (done - read) / 1000000.0);

You've got a typo here - "Trei" -> "Trie".

 for (String name : builder.getRandomNames(count)) {
 filter.addName(name);
 }

Consider adding a helper method boolean addNames(Collection<String> names) to facilitate adding multiple names.

 Set<NameControl> roundHit = new HashSet<>();
 // node, and all its children have this prefix...
 walkTree(result, node, candidates, seen, roundHit);
 return result;

You create a set, give it a name, but don't do anything with it after the function returns. Do you need to keep a reference to it?

 String name = pos < 0 ? source : source.substring(0,pos);

I recommend hitting the auto-formatter in your IDE from time to time. Specifically; before committing to a repository and before posting questions to Code Review.

-ip-1 gives you the index the would have had, if it existed. It's a terrible solution to the problem of needed to return two bits of information, but still wanting your return just an int. OP should not use in their implementation just because that is how it is done in a similar method.
To clarify, when Arrays.binarySearch doesn't find the desired key, it negates the position where it should be inserted, subtracts 1 (so the 0 index will be negative), and returns that value.

Stack Exchange Network

Name Filtering using a Trie

4 Answers 4

Thread safety

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Name Filtering using a Trie

4 Answers 4

Thread safety

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions