I was working on writing a fast implementation of a simple Huffman code compression of text. The idea was to write it using only the standard library, but I can't seem to find a way to make it faster. I am also looking for advise on how to write it more "Pythonic", without sacrificing speed.
I am aware that if I want speed I shouldn't use Python, but I've taken it as an exercise to test pure Python performance.
from collections import Counter, defaultdict
def huffman_compress(input_file, output_file, encoding='utf8'):
"""This functions compresses a txt file using Huffman code compression."""
# Store the text in memory since it is faster than reading twice
text = open(input_file, "r", encoding=encoding).read()
# Count the times each letter appears on the text
letter_freq = Counter(text)
alphabet = defaultdict(str)
# Obtain the huffman code for each letter
while len(letter_freq) > 1:
(letter1, count1), (letter2, count2) = letter_freq.most_common(2)
letter_freq[letter1+letter2] = count1 + count2
for bit, combination in enumerate([letter1, letter2]):
for letter in combination:
alphabet[letter] = str(bit) + alphabet[letter]
del letter_freq[combination]
# Save the transformation to ascii for possible the 256 characters
bit_to_ascii = {format(x, '08b'): chr(x) for x in range(256)}
with open(output_file, 'w') as output:
# Transform each letter to its huffman code
me = ''.join(alphabet[ch] for ch in text)
# Add 0's so that the string is multiple of 8
extra_bits = 8 - len(me) % 8
me += extra_bits * '0'
# Write the number of letters compressed and the number of bits added
output.write(f'{chr(len(alphabet))}{extra_bits}')
# Write the letters compressed and their huffman code for the decompression
output.write('|'.join(c for item in alphabet.items() for c in item))
# Transform the huffman bits to ascii and save them on the compressed file.
output.write(''.join(bit_to_ascii[me[j:j+8]] for j in range(0, len(me), 8)))
-
2\$\begingroup\$ How long does it take? Have you tried writing the same algorithm in a different language to check the difference in time? \$\endgroup\$pacmaninbw– pacmaninbw ♦2020年08月22日 12:03:13 +00:00Commented Aug 22, 2020 at 12:03
-
\$\begingroup\$ Use a profiler from the stdlib to see where your code is spending most of its time. Then you know where to optimize. \$\endgroup\$RootTwo– RootTwo2020年08月22日 17:51:05 +00:00Commented Aug 22, 2020 at 17:51
-
4\$\begingroup\$ As an aside note: it's probably better to separate the input reading, output writing and the huffman encoding algorithm separated. You function 'huffman_compress' only need to take text as input and returns text as well. \$\endgroup\$Julien Rousé– Julien Rousé2020年08月22日 18:25:02 +00:00Commented Aug 22, 2020 at 18:25
-
2\$\begingroup\$ There should also be a decompression function. Right now I can't simply do a little test to see whether it might be correct at all. And why would we care about speed when we don't even know whether it's correct? \$\endgroup\$superb rain– superb rain2020年08月22日 21:14:07 +00:00Commented Aug 22, 2020 at 21:14
-
\$\begingroup\$ Do you have an example file to use for testing (input and expected output?) \$\endgroup\$Martin Thoma– Martin Thoma2020年08月23日 14:55:14 +00:00Commented Aug 23, 2020 at 14:55
2 Answers 2
I started with your code, added sys.argv
so I could pass file paths on the
command line, downloaded a big text file (War and Peace, of course), ran your
program, and checked files sizes:
$ curl 'https://www.gutenberg.org/files/2600/2600-0.txt' -o war-peace.txt -k
$ time python huffman.py war-peace.txt encoded
real 0m11.052s
user 0m10.462s
sys 0m0.389s
$ ls -lh
-rw-r--r-- 1 fmc staff 40M Aug 24 13:51 encoded
-rw-r--r-- 1 fmc staff 3.3M Aug 24 13:50 war-peace.txt
Looks like you have inadvertently invented an expansion algorithm: it creates a file roughly 12x bigger! Also, 11 seconds seems slow to process a meager 40M of text. Normally Python can crunch data of that size much more quickly.
I temporarily assigned a short string (huffman
) to the text
variable,
bypassing file reading, and printed out some of your intermediate variables.
Although letter_freq
looked fine, alphabet
was the opposite of what we
want:
f 00000 # The most frequent letter has the longest code.
h 00001
u 0001
m 001
a 01
n 1
The Huffman algorithm combines the 2 elements with the least common frequency, but you are doing the opposite. So I tweaked your code like this:
(letter1, count1), (letter2, count2) = letter_freq.most_common()[:-3:-1]
With that change, alphabet
at least looks more plausible, the output file
ends up being smaller than the input file (although not by as much as I expect,
so there are probably other problems in your code), and it finishes in about 1
second rather than 11 (most likely because it's writing a much smaller output
file).
Some suggestions:
Focus on correctness first. Worry about speed later -- and only if it truly matters (and it might, if for no other reason that educational).
Algorithms and side effects don't mix. Reorganize your code to facilitate testing and debugging. The
huffman_compress()
function itself should not concern itself with file reading and writing. It should take a blob of text and return a blob of bytes, period. Highly algorithmic code (as Huffman is) should never have side effects; it should live in the realm of pure functions.Roundtrip the data. Also write a
huffman_expand()
function: take bytes, return text. Without that, you cannot have any confidence in the process. In particular, you want to be able to do the following:assert original_text == huffman_expand(huffman_compress(original_text))
. That doesn't prove that you've correctly implemented Huffman (perhaps you will invent your own special encoding scheme, which could be cool), but at least it will prove that you can make a lossless roundtrip.
-
\$\begingroup\$ What's the output file size after your change? \$\endgroup\$superb rain– superb rain2020年08月25日 07:24:00 +00:00Commented Aug 25, 2020 at 7:24
-
\$\begingroup\$ @superbrain I don't recall exactly, but I think it was on the order of 70% of original size -- not nearly the amount of compression I would expect. \$\endgroup\$FMc– FMc2020年08月25日 14:45:09 +00:00Commented Aug 25, 2020 at 14:45
-
\$\begingroup\$ Thanks. Should get down to about 47% when not using UTF-8 as I mentioned in my answer. What amount would you expect? \$\endgroup\$superb rain– superb rain2020年08月25日 15:20:06 +00:00Commented Aug 25, 2020 at 15:20
-
\$\begingroup\$ @superbrain I'm far from an expert on this topic, but default
gzip
gets the file down to 36% of original size. Seems like Huffman should be closer togzip
than to the roughly 70% I observed from the partly-fixed OP's code. \$\endgroup\$FMc– FMc2020年08月25日 15:44:54 +00:00Commented Aug 25, 2020 at 15:44
Save the transformation to ascii for possible the 256 characters
ASCII doesn't have 256 characters. It has 128.
And you write with the default encoding, which is UTF-8, so you write the non-ASCII half of your 256 characters as two bytes for no good reason whatsoever, making your file about 1.5 times as large as it should be.
You should really just produce bytes.