Saving and restoring RadixTree object

Question 1

I have a large word-file which is fixed with over 240 000 words. I need to check if a specific word exists in this list. I thought that it would be a good idea to create a RadixTree for this problem.

The text file consists of 240 000 lines, each of which contains one word, which is read into a String. Each string is then parsed into a char[] to access the individual characters, each char is then placed into the tree separately.

Android puts a limit of 16-26MB of RAM usage per application if I am correct. When my application runs it consistently hits this mark during the initialization of the large Vocabulary object. It takes around 47 seconds to initialize the Vocabulary object and about 1/10 of a ms to see if a word exists in this Vocabulary. Right now I am afraid that I made a mistake by implementing the RadixTree data-structure since it costs too much RAM and takes way too long with 47 seconds...

Would this every be possible to use on a real device? E.g. that the speed would be under 0.5 seconds instead of 47.

My idea was to load the object to later restore it. However this is even slower, loading an object took about 5x longer. Oh dear.

`Vocabulary` Class

package nl.mprog.ghost.datastructure;
import android.app.Activity;
import android.content.Context;
import android.util.Log;
import java.io.Serializable;
import nl.mprog.ghost.enumeration.Language;
public class Vocabulary implements Serializable {
 RadixTree mRadixTree;
 Language mLanguage;
 public Vocabulary(Context context, Language language) {
 this.mLanguage = language;
 mRadixTree = new RadixTree(context, language);
 }
 public boolean isWord(String word) {
 return mRadixTree.isWord(word);
 }

`RadixTree` class

package nl.mprog.ghost.datastructure;
import android.content.Context;
import android.util.Log;
import java.io.Serializable;
import java.util.ArrayList;
import java.util.Scanner;
import nl.mprog.ghost.R;
import nl.mprog.ghost.enumeration.Language;
public class RadixTree implements Serializable {
 private RadixTreeNode mRoot = new RadixTreeNode();
 private ArrayList<RadixTreeNode> mActiveNodes;
 public RadixTree(Context context, Language language) {
 readInWords(getFileScanner(context, language));
 }
 public void readInWords(Scanner scanner) {
 String word;
 while(scanner.hasNextLine()) {
 word = scanner.nextLine();
 mRoot.insertWord(word, mRoot);
 }
 scanner.close();
 }
 public Scanner getFileScanner(Context context, Language language) {
 Scanner scanner;
 switch (language) {
 case DUTCH:
 scanner = new Scanner(context.getResources().openRawResource(R.raw.dutch));
 break;
 case ENGLISH:
 scanner = new Scanner(context.getResources().openRawResource(R.raw.english));
 break;
 default:
 scanner = new Scanner(context.getResources().openRawResource(R.raw.english));
 break;
 }
 return scanner;
 }
 public boolean isWord(String word) {
 RadixTreeNode activeNode = mRoot;
 char[] chars = word.toCharArray();
 for(int i = 0; i < chars.length; i++) {
 activeNode = activeNode.findNode(activeNode, chars[i]);
 if(activeNode == null) return false;
 }
 if(activeNode.mIsWordEnd) {
 return true;
 } else {
 return false;
 }
 }
}

`RadixTreeNode` class

import android.util.Log;
import java.io.Serializable;
public class RadixTreeNode implements Serializable {
 RadixTreeNode[] mChildren = new RadixTreeNode[0];
 char mCharacter = '0';
 boolean mIsWordEnd;
 public RadixTreeNode() {
 }
 public RadixTreeNode(char character) {
 this.mCharacter = character;
 }
 public RadixTreeNode(char character, boolean isWordEnd) {
 this.mCharacter = character;
 this.mIsWordEnd = isWordEnd;
 }
 public void insert(char character) {
 for(RadixTreeNode node : mChildren) {
 if(node.mCharacter == character) {
 return;
 }
 }
 addChild(new RadixTreeNode(character));
 }
 public void insert(char character, boolean isWordEnd) {
 for(RadixTreeNode node : mChildren) {
 if(node.mCharacter == character) {
 return;
 }
 }
 addChild(new RadixTreeNode(character, isWordEnd));
 }
 private void addChild(RadixTreeNode node) {
 if(mChildren.length == 0) {
 mChildren = new RadixTreeNode[1];
 mChildren[0] = node;
 } else {
 RadixTreeNode[] temp = new RadixTreeNode[mChildren.length + 1];
 System.arraycopy(mChildren, 0, temp, 0, mChildren.length);
 temp[mChildren.length] = node;
 mChildren = temp;
 }
 }
 public RadixTreeNode[] getChildren() {
 return mChildren;
 }
 public void insertWord(String word, RadixTreeNode root) {
 RadixTreeNode activeNode = root;
 char[] array = word.toCharArray();
 for(int i = 0; i < (array.length - 1); i++) {
 activeNode.insert(array[i]);
 activeNode = findNode(activeNode, array[i]);
 }
 activeNode.insert(array[array.length - 1], true);
 }
 public RadixTreeNode findNode(RadixTreeNode startPosition, char character) {
 RadixTreeNode[] children = startPosition.getChildren();
 if(children.length == 0) {
 return null;
 } else {
 for(RadixTreeNode rtn : children) {
 if(rtn.mCharacter == character) return rtn;
 }
 }
 return null;
 }
}

Question 2

I suggest you read this Stackoverflow question because it involves nearly the same problem and there is a good answer in there.

Question 3

Think in terms of interfaces instead of implementations

Use interface types instead of implementations whenever possible. So instead of this:

private ArrayList<RadixTreeNode> mActiveNodes;

Declare as a List<RadixTreeNode> instead:

private List<RadixTreeNode> mActiveNodes;

However, on closer look it turns out that mActiveNodes is unused. Then it's better to delete it.

Use `boolean` values directly

Instead of this:

 if(activeNode.mIsWordEnd) {
 return true;
 } else {
 return false;
 }

Write simply:

 return activeNode.mIsWordEnd;

Extract repetitive code to an another block when possible

The cases are very repetitive of this switch:

 switch (language) {
 case DUTCH:
 scanner = new Scanner(context.getResources().openRawResource(R.raw.dutch));
 break;
 case ENGLISH:
 scanner = new Scanner(context.getResources().openRawResource(R.raw.english));
 break;
 default:
 scanner = new Scanner(context.getResources().openRawResource(R.raw.english));
 break;
 }

It would be better to change the cases to assign to an id the appropriate R.raw.something value, and then create the scanner after the switch, typing all the scanner = new Scanner(...) stuff only once.

Extract common logic to helper methods

The two insert methods in RadixTreeNode are almost identical. Don't copy-paste code. Extract common logic to a private helper method.

Make member variables `final` when possible

Since mRoot is never reassigned, why not make it final?

Other practical tips

There is a simpler way to write this:

 mChildren = new RadixTreeNode[1];
 mChildren[0] = node;

Like this:

 mChildren = new RadixTreeNode[]{ node };

Question 4

This is awesome. I will add these changes right away and I understand the better reasoning. I think I will be a better programmer because of your post. That's the most a post can ever give!

Question 5

Awesome, I'm glad to hear that!

Question 6

Instead of using a raw array to store the children you can instead use a HashMap<Character, RadixTreeNode>. This will better optimize the allocations that you do.

Using a map will also remove all the for loops for searching a key in a layer.

Map<Character, RadixTreeNode> mChildren = new HashMap<Character, RadixTreeNode>();
public void insert(char character) {
 if(mChildren.containsKey(character)) {
 return;
 }
 addChild(new RadixTreeNode(character));
}
private void addChild(RadixTreeNode node) {
 mChildren.put(node.mCharacter, node);
}

Also you can get the individual characters in a String by using charAt, no need to get the char array.

Question 7

I tried the OP's code and then replaced the array with a HashMap. Unfortunately it seems that the HashMap used even more memory than the original. It also made the code run slower. My test input was a list of English words (about 170k words). Note: I ran this on a desktop computer not on an Android device.

Question 8

That's what I was afraid of @JS1 I think using less memory is more important right now..

Question 9

If you really only need to "check if a specific word exists in this list", then you need no radix tree. A HashMap would be the fastest, a sorted ArrayList would also perform pretty well.

If you can stick with English (or a 8-bit charset), then you could replace char by byte and save half the memory.

If you need to save even more, then store only the indexes and/or hashcodes of the words in memory and use a RandomAccessFile to get the word. The memory consumption could be something like 8 * 240000 bytes, i.e., 2 MB (not exactly negligible, but your data structure must need much more). A single disk access should be pretty fast (sub-millisecond, and if you're really lucky, your file may get cached by the OS outside of your poor 16 MB).

AFAIK Serializable is much slower than Parcelable.

janos janos 113k15 gold badges154 silver badges396 bronze badges · Accepted Answer · 2015-04-27 18:45:28Z

Think in terms of interfaces instead of implementations

Use interface types instead of implementations whenever possible. So instead of this:

private ArrayList<RadixTreeNode> mActiveNodes;

Declare as a List<RadixTreeNode> instead:

private List<RadixTreeNode> mActiveNodes;

However, on closer look it turns out that mActiveNodes is unused. Then it's better to delete it.

Use `boolean` values directly

Instead of this:

 if(activeNode.mIsWordEnd) {
 return true;
 } else {
 return false;
 }

Write simply:

 return activeNode.mIsWordEnd;

Extract repetitive code to an another block when possible

The cases are very repetitive of this switch:

 switch (language) {
 case DUTCH:
 scanner = new Scanner(context.getResources().openRawResource(R.raw.dutch));
 break;
 case ENGLISH:
 scanner = new Scanner(context.getResources().openRawResource(R.raw.english));
 break;
 default:
 scanner = new Scanner(context.getResources().openRawResource(R.raw.english));
 break;
 }

It would be better to change the cases to assign to an id the appropriate R.raw.something value, and then create the scanner after the switch, typing all the scanner = new Scanner(...) stuff only once.

Extract common logic to helper methods

The two insert methods in RadixTreeNode are almost identical. Don't copy-paste code. Extract common logic to a private helper method.

Make member variables `final` when possible

Since mRoot is never reassigned, why not make it final?

Other practical tips

There is a simpler way to write this:

 mChildren = new RadixTreeNode[1];
 mChildren[0] = node;

Like this:

 mChildren = new RadixTreeNode[]{ node };

This is awesome. I will add these changes right away and I understand the better reasoning. I think I will be a better programmer because of your post. That's the most a post can ever give!

Stack Exchange Network

Saving and restoring RadixTree object

`Vocabulary` Class

`RadixTree` class

`RadixTreeNode` class

3 Answers 3

Think in terms of interfaces instead of implementations

Use `boolean` values directly

Extract repetitive code to an another block when possible

Extract common logic to helper methods

Make member variables `final` when possible

Other practical tips

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Saving and restoring RadixTree object

Vocabulary Class

RadixTree class

RadixTreeNode class

3 Answers 3

Think in terms of interfaces instead of implementations

Use boolean values directly

Extract repetitive code to an another block when possible

Extract common logic to helper methods

Make member variables final when possible

Other practical tips

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

`Vocabulary` Class

`RadixTree` class

`RadixTreeNode` class

Use `boolean` values directly

Make member variables `final` when possible