I wrote a simple huffman coding algorithm for learning and practice. I just used the technique given on the wikipedia page.
Could you tell me my missing points and mistakes?
Node.java
package ged.gont.bst.huffmancode;
public class Node implements Comparable<Node> {
private char letter;
private int freq;
private Node leftChild;
private Node rightChild;
public Node(char letter, int freq, Node leftChild, Node rightChild) {
this.letter = letter;
this.freq = freq;
this.leftChild = leftChild;
this.rightChild = rightChild;
}
public char getLetter() {
return letter;
}
public int getFreq() {
return freq;
}
public Node getLeftChild() {
return leftChild;
}
public Node getRightChild() {
return rightChild;
}
public boolean isLeaf() {
return this.leftChild == null && this.rightChild == null;
}
@Override
public int compareTo(Node arg0) {
return this.freq - arg0.freq;
}
}
HuffmanCode.java
package ged.gont.bst.huffmancode;
import java.util.LinkedHashMap;
import java.util.Map;
import java.util.PriorityQueue;
public class HuffmanCode {
private Node root;
Map<Character, String> charMap = new LinkedHashMap<>();
/**
* Encodes string in which most used characters have min codeword length
*
* @param inputString compressed string
* @return encoded string
* @throws IllegelArgumentException if inputString contains invalid ASCII character
*/
public String encode(String inputString) {
char[] letters = inputString.toCharArray();
Map<Character, Integer> charFreq = new LinkedHashMap<>();
PriorityQueue<Node> priorityQueue = new PriorityQueue<>();
String encodedString = "";
for (char c : letters) {
if ((int) c > 255) {
throw new IllegalArgumentException("Input contains invalid ASCII character");
}
if (charFreq.containsKey(c)) {
charFreq.put(c, charFreq.get(c) + 1);
} else {
charFreq.put(c, 1);
}
}
for (Character c : charFreq.keySet()) {
priorityQueue.offer(new Node(c, charFreq.get(c), null, null));
}
while (priorityQueue.size() > 1) {
Node leftChild = priorityQueue.remove();
Node rightChild = priorityQueue.remove();
priorityQueue.offer(new Node(Character.MIN_VALUE, leftChild.getFreq() + rightChild.getFreq(), leftChild, rightChild));
}
root = priorityQueue.remove();
generatePrefix(root, "");
for (int i = 0; i < inputString.length(); i++) {
encodedString += (charMap.get(inputString.charAt(i)));
}
return encodedString;
}
/**
* Generates prefix code in bit string format
*
* @param root
* @param code
*/
private void generatePrefix(Node root, String prefix) {
if (!root.isLeaf()) {
generatePrefix(root.getLeftChild(), prefix.concat("0"));
generatePrefix(root.getRightChild(), prefix.concat("1"));
} else {
charMap.put(root.getLetter(), prefix);
}
}
/**
* Decodes the given encoded string
*
* @param encodedString
* @return decoded string
*/
public String decode(String encodedString) {
String decodedString = "";
Node currentNode = root;
for (int i = 0; i < encodedString.length(); i++) {
if (encodedString.charAt(i) == '0') {
currentNode = currentNode.getLeftChild();
} else if (encodedString.charAt(i) == '1') {
currentNode = currentNode.getRightChild();
}
if (currentNode.isLeaf()) {
decodedString += currentNode.getLetter();
currentNode = root;
}
}
return decodedString;
}
}
TestHuffmanCode.java
package ged.gont.testbst.testhuffmancode;
import static org.junit.jupiter.api.Assertions.assertEquals;
import org.junit.jupiter.api.*;
import ged.gont.bst.huffmancode.*;
public class TestHuffmanCode {
static HuffmanCode huffmanCode;
static String inputString;
static String expectedEncodedString;
@BeforeAll
public static void init() {
huffmanCode = new HuffmanCode();
inputString = "A_DEAD_DAD_CEDED_A_BAD_BABE_A_BEADED_ABACA_BED";
expectedEncodedString = "1000011101001000110010011101100111001001000111110010011111011111100010001111110100111001001011111011101000111111001";
}
@Test
public void testEncode() {
assertEquals(expectedEncodedString, huffmanCode.encode(inputString));
}
@Test
public void testDecode() {
assertEquals(inputString, huffmanCode.decode(huffmanCode.encode(inputString)));
}
}
1 Answer 1
General
Classes not carefully designed for extension should be marked `final`.root
is problematic as a class-level variable. The API of HuffmanCode implies that decode()
depends only on the encoded string, and that one could call decode
more than once with different values. That will fail, though. The code only works if you call encode
directly before the corresponding decode
.
- It would be better to localize
root
and do whatever computation is necessary fordecode
in that method. This would mean changing the output ofencode
so that the tree can be rebuilt`. - Another option would be to separate out the building of the tree into its own method, and then have
encode
anddecode
take the tree as an argument. The downside here is needing to track both the tree and the encoded string. - A third option is to build the HuffmanCode class using a static
encode
factory method. The class would keep the tree as an instance variable. It could have a publictoString
method to return the encoded value, and a publicdecode
method to return the decoded value.
Which of these is best depends on your needs. But the API right now will result in cranky users calling encode()
multiple times and then trying to decode()
multiple times and getting nonsense results.
Node
This class is immutable, so all the variables can be marked as final
. This conveys the intent that they cannot be changed, and prevents accidental modification after construction time.
This class is only intended for use inside its package, and should therefore have default (package-private) access, not public.
This class has one constructor, but it should have two, one for leaf node construction and one for internal node construction.
arg0
might be better named as node
or otherNode
.
HuffmanCode
`charMap` should be `private`.Abbreviations are confusing and should be avoided. characterFrequency
would be preferable to charFreq
. Likewise charMap
could be characterMap
or characterEncodings
.
characters
can be inlined and still clear to readers.
letters
is not a good variable name, since the values are actually ASCII characters, not letters.
It's not necessary to convert a char to an int for numerical comparisons. The compiler can do it natively.
It would be preferable if the invalid ASCII character exception contained both the invalid character and the input string.
Setting the map value to 1 or adding 1 to the map value is cleaner using Map.merge()
. It would look like charFreq.merge(c, 1, Integer::sum);
The code could use an int[] instead of a map, since ASCII is a constrained set of ints from 0-255. Using an int[256]
and incrementing the values there should be more time and space efficient. Prefer whichever is easier to read unless you have a documented performance bottleneck.
It's preferable to localize variables so they're initialized as close to where they're first used as is reasonable. The convention of declaring them at the top of a method is a holdover from languages where it's not possible to declare them later.
When modifying String
s, it is preferable to use StringBuilder
(or StringBuffer
if thread safety is an issue). This is for efficiency reasons and clarity-of-intent reasons. The compiler will generally figure it out, but the hint doesn't hurt.
This block:
for (int i = 0; i < inputString.length(); i++) {
encodedString.append(characterEncodings.get(inputString.charAt(i)));
}
can be replaced with
for (char c : inputString.toCharArray()) {
encodedString.append(characterEncodings.get(c));
}
In generatePrefix
, it's easier to read if the non-negated case is in the if
part and the negation is in the else
.
generatePrefix
is a confusing name, because it's really generating the character encodings.
In generatePrefix
, prefix
might be better named encoding
. root
might be better named node
.
It would be preferable if decode()
handled invalid input more aggressively, rather than silently continuing if it doesn't see a 0
or 1
.
TestHuffmanCode
Only having one test is not especially useful. There should be tests for all boundary conditions, and a variety of interesting inputs. For instance, zero characters, one character, non-ASCII characters only, non-ascii characters embedded in ASCII characters, etc.-
1\$\begingroup\$
class-level variable
- what is that? Re. Java, I'd accept that designation forstatic
data members. While I'd preferHuffmanCoDec
, what's problematic about instance data memberroot
inHuffmanCode
? I agree that documentation of use and interdependency ofencode()
&decode()
, while present, is lacking. Then again:learning and practice
. \$\endgroup\$greybeard– greybeard2022年03月13日 20:04:33 +00:00Commented Mar 13, 2022 at 20:04 -
1\$\begingroup\$ @greybeard
HuffmanCode hc = new HuffmanCode();
String s1 = hc.encode("abcde");
String s2 = hc.encode("wxyz");
System.out.println(hc.decode(s1));
I would expect the last statement to actually decode s1. It will not, because root is now built to "wxyz". Either HuffmanCode is restricted to the string it's built from, in which case it needs to be constructed with the instance it's encoding, or it's not, in which case the implementation ofencode
needs to change to embed the node tree. \$\endgroup\$Eric Stein– Eric Stein2022年03月14日 00:27:40 +00:00Commented Mar 14, 2022 at 0:27 -
1\$\begingroup\$ And I fixed the typo. You can feel free to edit stuff like that directly in the future. :) \$\endgroup\$Eric Stein– Eric Stein2022年03月14日 11:55:10 +00:00Commented Mar 14, 2022 at 11:55
-
\$\begingroup\$ thank u so much for review and advise, i gonna try refactoring my code in accordance with your answer. \$\endgroup\$gedofgont– gedofgont2022年03月21日 19:58:01 +00:00Commented Mar 21, 2022 at 19:58
Explore related questions
See similar questions with these tags.
code.prefix("0")
? It doesn't look likecode
is defined (but it looks like it used to be a parameter and isn't anymore) \$\endgroup\$for learning and practice
(tagging reinventing-the-wheel is definite where a wheel is well-known). \$\endgroup\$