Slow Huffman Code in pure Python

Question 1

I was working on writing a fast implementation of a simple Huffman code compression of text. The idea was to write it using only the standard library, but I can't seem to find a way to make it faster. I am also looking for advise on how to write it more "Pythonic", without sacrificing speed.

I am aware that if I want speed I shouldn't use Python, but I've taken it as an exercise to test pure Python performance.

from collections import Counter, defaultdict
def huffman_compress(input_file, output_file, encoding='utf8'):
 """This functions compresses a txt file using Huffman code compression."""
 
 # Store the text in memory since it is faster than reading twice
 text = open(input_file, "r", encoding=encoding).read()
 
 # Count the times each letter appears on the text
 letter_freq = Counter(text)
 alphabet = defaultdict(str)
 
 # Obtain the huffman code for each letter
 while len(letter_freq) > 1:
 (letter1, count1), (letter2, count2) = letter_freq.most_common(2)
 letter_freq[letter1+letter2] = count1 + count2
 for bit, combination in enumerate([letter1, letter2]):
 for letter in combination:
 alphabet[letter] = str(bit) + alphabet[letter]
 del letter_freq[combination]
 
 # Save the transformation to ascii for possible the 256 characters
 bit_to_ascii = {format(x, '08b'): chr(x) for x in range(256)}
 
 with open(output_file, 'w') as output:
 # Transform each letter to its huffman code
 me = ''.join(alphabet[ch] for ch in text)
 
 # Add 0's so that the string is multiple of 8
 extra_bits = 8 - len(me) % 8
 me += extra_bits * '0'
 
 # Write the number of letters compressed and the number of bits added
 output.write(f'{chr(len(alphabet))}{extra_bits}')
 
 # Write the letters compressed and their huffman code for the decompression
 output.write('|'.join(c for item in alphabet.items() for c in item))
 
 # Transform the huffman bits to ascii and save them on the compressed file.
 output.write(''.join(bit_to_ascii[me[j:j+8]] for j in range(0, len(me), 8)))

Question 2

How long does it take? Have you tried writing the same algorithm in a different language to check the difference in time?

Question 3

Use a profiler from the stdlib to see where your code is spending most of its time. Then you know where to optimize.

Question 4

As an aside note: it's probably better to separate the input reading, output writing and the huffman encoding algorithm separated. You function 'huffman_compress' only need to take text as input and returns text as well.

Question 5

There should also be a decompression function. Right now I can't simply do a little test to see whether it might be correct at all. And why would we care about speed when we don't even know whether it's correct?

Question 6

Do you have an example file to use for testing (input and expected output?)

Question 7

I started with your code, added sys.argv so I could pass file paths on the command line, downloaded a big text file (War and Peace, of course), ran your program, and checked files sizes:

$ curl 'https://www.gutenberg.org/files/2600/2600-0.txt' -o war-peace.txt -k
$ time python huffman.py war-peace.txt encoded
real 0m11.052s
user 0m10.462s
sys 0m0.389s
$ ls -lh
-rw-r--r-- 1 fmc staff 40M Aug 24 13:51 encoded
-rw-r--r-- 1 fmc staff 3.3M Aug 24 13:50 war-peace.txt

Looks like you have inadvertently invented an expansion algorithm: it creates a file roughly 12x bigger! Also, 11 seconds seems slow to process a meager 40M of text. Normally Python can crunch data of that size much more quickly.

I temporarily assigned a short string (huffman) to the text variable, bypassing file reading, and printed out some of your intermediate variables. Although letter_freq looked fine, alphabet was the opposite of what we want:

f 00000 # The most frequent letter has the longest code.
h 00001
u 0001
m 001
a 01
n 1

The Huffman algorithm combines the 2 elements with the least common frequency, but you are doing the opposite. So I tweaked your code like this:

(letter1, count1), (letter2, count2) = letter_freq.most_common()[:-3:-1]

With that change, alphabet at least looks more plausible, the output file ends up being smaller than the input file (although not by as much as I expect, so there are probably other problems in your code), and it finishes in about 1 second rather than 11 (most likely because it's writing a much smaller output file).

Some suggestions:

Focus on correctness first. Worry about speed later -- and only if it truly matters (and it might, if for no other reason that educational).
Algorithms and side effects don't mix. Reorganize your code to facilitate testing and debugging. The huffman_compress() function itself should not concern itself with file reading and writing. It should take a blob of text and return a blob of bytes, period. Highly algorithmic code (as Huffman is) should never have side effects; it should live in the realm of pure functions.
Roundtrip the data. Also write a huffman_expand() function: take bytes, return text. Without that, you cannot have any confidence in the process. In particular, you want to be able to do the following: assert original_text == huffman_expand(huffman_compress(original_text)). That doesn't prove that you've correctly implemented Huffman (perhaps you will invent your own special encoding scheme, which could be cool), but at least it will prove that you can make a lossless roundtrip.

Question 8

What's the output file size after your change?

Question 9

@superbrain I don't recall exactly, but I think it was on the order of 70% of original size -- not nearly the amount of compression I would expect.

Question 10

Thanks. Should get down to about 47% when not using UTF-8 as I mentioned in my answer. What amount would you expect?

Question 11

@superbrain I'm far from an expert on this topic, but default gzip gets the file down to 36% of original size. Seems like Huffman should be closer to gzip than to the roughly 70% I observed from the partly-fixed OP's code.

Question 12

Save the transformation to ascii for possible the 256 characters

ASCII doesn't have 256 characters. It has 128.

And you write with the default encoding, which is UTF-8, so you write the non-ASCII half of your 256 characters as two bytes for no good reason whatsoever, making your file about 1.5 times as large as it should be.

You should really just produce bytes.

FMc FMc 13.1k2 gold badges29 silver badges40 bronze badges · Accepted Answer · 2020-08-24 22:08:53Z

I started with your code, added sys.argv so I could pass file paths on the command line, downloaded a big text file (War and Peace, of course), ran your program, and checked files sizes:

$ curl 'https://www.gutenberg.org/files/2600/2600-0.txt' -o war-peace.txt -k
$ time python huffman.py war-peace.txt encoded
real 0m11.052s
user 0m10.462s
sys 0m0.389s
$ ls -lh
-rw-r--r-- 1 fmc staff 40M Aug 24 13:51 encoded
-rw-r--r-- 1 fmc staff 3.3M Aug 24 13:50 war-peace.txt

Looks like you have inadvertently invented an expansion algorithm: it creates a file roughly 12x bigger! Also, 11 seconds seems slow to process a meager 40M of text. Normally Python can crunch data of that size much more quickly.

I temporarily assigned a short string (huffman) to the text variable, bypassing file reading, and printed out some of your intermediate variables. Although letter_freq looked fine, alphabet was the opposite of what we want:

f 00000 # The most frequent letter has the longest code.
h 00001
u 0001
m 001
a 01
n 1

The Huffman algorithm combines the 2 elements with the least common frequency, but you are doing the opposite. So I tweaked your code like this:

(letter1, count1), (letter2, count2) = letter_freq.most_common()[:-3:-1]

With that change, alphabet at least looks more plausible, the output file ends up being smaller than the input file (although not by as much as I expect, so there are probably other problems in your code), and it finishes in about 1 second rather than 11 (most likely because it's writing a much smaller output file).

Some suggestions:

Focus on correctness first. Worry about speed later -- and only if it truly matters (and it might, if for no other reason that educational).
Algorithms and side effects don't mix. Reorganize your code to facilitate testing and debugging. The huffman_compress() function itself should not concern itself with file reading and writing. It should take a blob of text and return a blob of bytes, period. Highly algorithmic code (as Huffman is) should never have side effects; it should live in the realm of pure functions.
Roundtrip the data. Also write a huffman_expand() function: take bytes, return text. Without that, you cannot have any confidence in the process. In particular, you want to be able to do the following: assert original_text == huffman_expand(huffman_compress(original_text)). That doesn't prove that you've correctly implemented Huffman (perhaps you will invent your own special encoding scheme, which could be cool), but at least it will prove that you can make a lossless roundtrip.

@superbrain I don't recall exactly, but I think it was on the order of 70% of original size -- not nearly the amount of compression I would expect.
Thanks. Should get down to about 47% when not using UTF-8 as I mentioned in my answer. What amount would you expect?
@superbrain I'm far from an expert on this topic, but default gzip gets the file down to 36% of original size. Seems like Huffman should be closer to gzip than to the roughly 70% I observed from the partly-fixed OP's code.

Stack Exchange Network

Slow Huffman Code in pure Python

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Slow Huffman Code in pure Python

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions