I have a large word-file which is fixed with over 240 000 words. I need to check if a specific word exists in this list. I thought that it would be a good idea to create a RadixTree
for this problem.
The text file consists of 240 000 lines, each of which contains one word, which is read into a String
. Each string
is then parsed into a char[]
to access the individual characters, each char
is then placed into the tree separately.
Android puts a limit of 16-26MB of RAM usage per application if I am correct. When my application runs it consistently hits this mark during the initialization of the large Vocabulary
object. It takes around 47 seconds to initialize the Vocabulary
object and about 1/10 of a ms to see if a word exists in this Vocabulary
. Right now I am afraid that I made a mistake by implementing the RadixTree
data-structure since it costs too much RAM and takes way too long with 47 seconds...
Would this every be possible to use on a real device? E.g. that the speed would be under 0.5 seconds instead of 47.
My idea was to load the object to later restore it. However this is even slower, loading an object took about 5x longer. Oh dear.
Vocabulary
Class
package nl.mprog.ghost.datastructure;
import android.app.Activity;
import android.content.Context;
import android.util.Log;
import java.io.Serializable;
import nl.mprog.ghost.enumeration.Language;
public class Vocabulary implements Serializable {
RadixTree mRadixTree;
Language mLanguage;
public Vocabulary(Context context, Language language) {
this.mLanguage = language;
mRadixTree = new RadixTree(context, language);
}
public boolean isWord(String word) {
return mRadixTree.isWord(word);
}
RadixTree
class
package nl.mprog.ghost.datastructure;
import android.content.Context;
import android.util.Log;
import java.io.Serializable;
import java.util.ArrayList;
import java.util.Scanner;
import nl.mprog.ghost.R;
import nl.mprog.ghost.enumeration.Language;
public class RadixTree implements Serializable {
private RadixTreeNode mRoot = new RadixTreeNode();
private ArrayList<RadixTreeNode> mActiveNodes;
public RadixTree(Context context, Language language) {
readInWords(getFileScanner(context, language));
}
public void readInWords(Scanner scanner) {
String word;
while(scanner.hasNextLine()) {
word = scanner.nextLine();
mRoot.insertWord(word, mRoot);
}
scanner.close();
}
public Scanner getFileScanner(Context context, Language language) {
Scanner scanner;
switch (language) {
case DUTCH:
scanner = new Scanner(context.getResources().openRawResource(R.raw.dutch));
break;
case ENGLISH:
scanner = new Scanner(context.getResources().openRawResource(R.raw.english));
break;
default:
scanner = new Scanner(context.getResources().openRawResource(R.raw.english));
break;
}
return scanner;
}
public boolean isWord(String word) {
RadixTreeNode activeNode = mRoot;
char[] chars = word.toCharArray();
for(int i = 0; i < chars.length; i++) {
activeNode = activeNode.findNode(activeNode, chars[i]);
if(activeNode == null) return false;
}
if(activeNode.mIsWordEnd) {
return true;
} else {
return false;
}
}
}
RadixTreeNode
class
import android.util.Log;
import java.io.Serializable;
public class RadixTreeNode implements Serializable {
RadixTreeNode[] mChildren = new RadixTreeNode[0];
char mCharacter = '0';
boolean mIsWordEnd;
public RadixTreeNode() {
}
public RadixTreeNode(char character) {
this.mCharacter = character;
}
public RadixTreeNode(char character, boolean isWordEnd) {
this.mCharacter = character;
this.mIsWordEnd = isWordEnd;
}
public void insert(char character) {
for(RadixTreeNode node : mChildren) {
if(node.mCharacter == character) {
return;
}
}
addChild(new RadixTreeNode(character));
}
public void insert(char character, boolean isWordEnd) {
for(RadixTreeNode node : mChildren) {
if(node.mCharacter == character) {
return;
}
}
addChild(new RadixTreeNode(character, isWordEnd));
}
private void addChild(RadixTreeNode node) {
if(mChildren.length == 0) {
mChildren = new RadixTreeNode[1];
mChildren[0] = node;
} else {
RadixTreeNode[] temp = new RadixTreeNode[mChildren.length + 1];
System.arraycopy(mChildren, 0, temp, 0, mChildren.length);
temp[mChildren.length] = node;
mChildren = temp;
}
}
public RadixTreeNode[] getChildren() {
return mChildren;
}
public void insertWord(String word, RadixTreeNode root) {
RadixTreeNode activeNode = root;
char[] array = word.toCharArray();
for(int i = 0; i < (array.length - 1); i++) {
activeNode.insert(array[i]);
activeNode = findNode(activeNode, array[i]);
}
activeNode.insert(array[array.length - 1], true);
}
public RadixTreeNode findNode(RadixTreeNode startPosition, char character) {
RadixTreeNode[] children = startPosition.getChildren();
if(children.length == 0) {
return null;
} else {
for(RadixTreeNode rtn : children) {
if(rtn.mCharacter == character) return rtn;
}
}
return null;
}
}
-
1\$\begingroup\$ I suggest you read this Stackoverflow question because it involves nearly the same problem and there is a good answer in there. \$\endgroup\$JS1– JS12015年04月27日 18:36:06 +00:00Commented Apr 27, 2015 at 18:36
3 Answers 3
Think in terms of interfaces instead of implementations
Use interface types instead of implementations whenever possible. So instead of this:
private ArrayList<RadixTreeNode> mActiveNodes;
Declare as a List<RadixTreeNode>
instead:
private List<RadixTreeNode> mActiveNodes;
However, on closer look it turns out that mActiveNodes
is unused.
Then it's better to delete it.
Use boolean
values directly
Instead of this:
if(activeNode.mIsWordEnd) { return true; } else { return false; }
Write simply:
return activeNode.mIsWordEnd;
Extract repetitive code to an another block when possible
The cases are very repetitive of this switch
:
switch (language) { case DUTCH: scanner = new Scanner(context.getResources().openRawResource(R.raw.dutch)); break; case ENGLISH: scanner = new Scanner(context.getResources().openRawResource(R.raw.english)); break; default: scanner = new Scanner(context.getResources().openRawResource(R.raw.english)); break; }
It would be better to change the cases to assign to an id the appropriate R.raw.something
value, and then create the scanner after the switch,
typing all the scanner = new Scanner(...)
stuff only once.
Extract common logic to helper methods
The two insert
methods in RadixTreeNode
are almost identical.
Don't copy-paste code. Extract common logic to a private helper method.
Make member variables final
when possible
Since mRoot
is never reassigned, why not make it final
?
Other practical tips
There is a simpler way to write this:
mChildren = new RadixTreeNode[1]; mChildren[0] = node;
Like this:
mChildren = new RadixTreeNode[]{ node };
-
1\$\begingroup\$ This is awesome. I will add these changes right away and I understand the better reasoning. I think I will be a better programmer because of your post. That's the most a post can ever give! \$\endgroup\$Joop– Joop2015年04月27日 18:57:35 +00:00Commented Apr 27, 2015 at 18:57
-
1\$\begingroup\$ Awesome, I'm glad to hear that! \$\endgroup\$janos– janos2015年04月27日 19:08:45 +00:00Commented Apr 27, 2015 at 19:08
Instead of using a raw array to store the children you can instead use a HashMap<Character, RadixTreeNode>
. This will better optimize the allocations that you do.
Using a map will also remove all the for loops for searching a key in a layer.
Map<Character, RadixTreeNode> mChildren = new HashMap<Character, RadixTreeNode>();
public void insert(char character) {
if(mChildren.containsKey(character)) {
return;
}
addChild(new RadixTreeNode(character));
}
private void addChild(RadixTreeNode node) {
mChildren.put(node.mCharacter, node);
}
Also you can get the individual characters in a String by using charAt
, no need to get the char array.
-
1\$\begingroup\$ I tried the OP's code and then replaced the array with a HashMap. Unfortunately it seems that the HashMap used even more memory than the original. It also made the code run slower. My test input was a list of English words (about 170k words). Note: I ran this on a desktop computer not on an Android device. \$\endgroup\$JS1– JS12015年04月27日 18:31:15 +00:00Commented Apr 27, 2015 at 18:31
-
\$\begingroup\$ That's what I was afraid of @JS1 I think using less memory is more important right now.. \$\endgroup\$Joop– Joop2015年04月27日 19:00:51 +00:00Commented Apr 27, 2015 at 19:00
If you really only need to "check if a specific word exists in this list", then you need no radix tree. A HashMap
would be the fastest, a sorted ArrayList
would also perform pretty well.
If you can stick with English (or a 8-bit charset), then you could replace char
by byte
and save half the memory.
If you need to save even more, then store only the indexes and/or hashcodes of the words in memory and use a RandomAccessFile
to get the word. The memory consumption could be something like 8 * 240000 bytes, i.e., 2 MB (not exactly negligible, but your data structure must need much more). A single disk access should be pretty fast (sub-millisecond, and if you're really lucky, your file may get cached by the OS outside of your poor 16 MB).
AFAIK Serializable
is much slower than Parcelable
.
Explore related questions
See similar questions with these tags.