#!/usr/bin/python # -*- coding: utf-8 -*- """Measuring the effectiveness of huffman coding. Francis Bacon’s invention of Baudot code ---------------------------------------- Francis Bacon invented a code called the "biliteral alphabet", which he described as a means of making anything mean anything, in which the coded message was represented by sequences of two possible letters, which you could represent as "a" and "b"; you could encode, for example, "A" as "aaaaa", "B" as "aaaab", "C" as "aaaba", "D" as "aaabb", and so on. I’m not totally clear on how much of this was explained in Bacon’s work and how much was discovered in the 1890s. There is an excellent article on this in the Winter 2010/11 issue of Cabinet magazine, [How to Make Anything Signify Anything][0|, by William H. Sherman. [0|: http://www.cabinetmagazine.org/issues/40/sherman.php An ingenious aspect of this code is that anything that can appear in sequences, each of which can be in one of two recognizable states, can be used to bear the message. Apparently Bacon used two fonts of type, such that any letter drawn from the first font represented an "a", and any letter drawn from the second font represented a "b". This allowed him to encode a secret message in any printed text with only a fivefold expansion in the message size. Of course, the two fonts were in a single harmonious style, so it might not be obvious from looking which letters belonged to which font; and in cases of handwriting, still more variation is possible. Variable-length codes for better efficiency and the huffman algorithm --------------------------------------------------------------------- From a modern perspective, of course, what Bacon invented was the Baudot code, which represents letters and potentially other characters as binary sequences; and a natural thing to wonder is whether encoding efficiency could have been improved with a variable-length code, like Morse Code, in which the more common letters are encoded with shorter code sequences. Such a code is still eminently practical to encode and decode by hand. The huffman algorithm computes an "optimal" version of such a code. For the letters of the Project Gutenberg version of the King James Bible, a huffman code as computed by this program is: ('000000', 'v') ('0000010000', 'q') ('0000010001', 'x') ('000001001', 'z') ('00000101', 'j') ('0000011', 'k') ('00001', 'm') ('0001', 'd') ('001', 't') ('01000', 'u') ('01001', 'f') ('0101', 'r') ('0110', 's') ('0111', 'i') ('100', 'e') ('101000', 'p') ('101001', 'b') ('101010', 'c') ('101011', 'g') ('1011', 'n') ('1100', 'o') ('110100', 'y') ('110101', 'w') ('11011', 'l') ('1110', 'a') ('1111', 'h') So you see the most common letters (‘e’ and ‘t’) are represented by only 3 bits, other common letters (‘a’, ‘d’, ‘h’, ‘i’, ‘r’, ‘s’, ‘n’, ‘o’) are represented by 4 bits, and the least common letters ‘q’ and ‘z’ have 9-bit codes. And no valid code begins with any other valid code --- for example, ‘e’ has been assigned ‘100’, so no other letter can have a code that begins with ‘100’. This avoids any need for marking where the code for one letter ends and the next begins. Encoding the 3 239 443 letters of this book with this code, we need, on average, only 4.15 bits per letter rather than 5. Context-dependent variable-length codes --------------------------------------- A somewhat more difficult, but still practical, approach is to use a number of different variable-length codes depending on context. For example, in English, after a ‘j’, it is quite unlikely to find any consonant (in Bacon’s day, ‘j’ was not a distinct letter, so this did not apply) and after ‘q’ it is quite unlikely to find anything but ‘u’. If we use 26 different huffman codes instead of one, then encoding and decoding will be trickier, but we can achieve better efficiency. Using such a set of codes for this same version of the Bible, we reduce the space needed from 4.15 bits per letter to 3.5. However, the code used is not complete; because every ‘q’ in the text is followed by a ‘u’, for example, the ‘u’ in that position is assigned the zero-length code ‘’, and it is therefore impossible to represent any other letter following a ‘q’; and similarly, it is only possible to represent vowels following a ‘j’. (‘e’ in that position is assigned the code ‘0’, ‘o’ the code ‘10’.) Making the code complete might worsen its efficiency slightly. For example, 'u' following 'q' would require at least one bit. Optimizing huffman codes for manual use --------------------------------------- Bacon’s original invention was intended for human beings to read and write, not automatic computers. In modern use of huffman codes, it is common to transmit the huffman table along with the encoded text, but I believe that for human beings, it would probably be more practical to standardize on a common code. Today it is more common to use character codes that represent more than just the letters of a text; ASCII includes case, spaces, punctuation, and numbers, for example, while Unicode also includes accents, not to mention other alphabets. If we encode the entire text, including all of these changes, with a single huffman code, it requires 4.58 bits per character, a total of 19939830 bits; if we use separate huffman tables following each character, it's 3.42 bits per character, 14874433 bits in all. This compares to 13450484 bits to represent all the letters with a single huffman table and 11339286 bits to represent them with 26 different tables. I wrote another program (previously posted to kragen-hacks under the rubric "reducing charset size for compressibility with case-shift characters (in Python)" ) to reduce the number of different characters that need to be encoded, by using a single code for each letter, and changing between upper-case and lower-case letters with special "control characters". It uses \x11 to switch to lowercase, \x12 to switch to uppercase, \x13 to switch to digits (0 is a, 1 is b, and so on), and \x14 to switch only the next character to uppercase. This way, the poor monk who’s manually enciphering and deciphering text only needs about 40 characters in his codebook for ordinary text, and only 92 for all of ASCII. You might expect that he would have to do more work, though, since the case-shift control characters are more things he has to encipher. But it also allows the huffman algorithm to use shorter codes for the letters. It turns out that these effects cancel out almost exactly, and the difference on my test file is about 0.08%. The code I came out with, after a little bit of manual tweaking, is as follows: ### Optimized binary code for manual encoding ### ('0000', 'i') ('0001 0', '\n') (line break. Could use for paragraph breaks.) ('0001 1000 00', "'") ('0001 1000 01', 'z') ('0001 1000 10', '?') ('0001 1000 110', 'x') ('0001 1000 1110', 'q') ('0001 1001', 'j') ('0001 101', '.') ('0001 11', 'y') ('001', 'e') ('0100', 'n') ('0101', 'o') ('0110 000', 'v') ('0110 001', '\x13') (switch to numbers: 0 for a, 1 for b, etc.) ('0110 01', 'g') ('0110 1', 'l') ('0111', 'a') ('1000 00', 'w') ('1000 01', ',') ('1000 10', 'c') ('1000 11', 'b') ('1001', 'h') ('1010', 't') ('1011 000', '\x11') (switch to lowercase) ('1011 0010 0', '\x12') (switch to uppercase) ('1011 0010 1', ';') ('1011 0011', 'k') ('1011 01', 'm') ('1011 1', 'r') ('1100 0', 'd') ('1100 10', 'u') ('1100 11', '\x14') (next character is an uppercase letter) ('1101 000', 'p') ('1101 001', ':') ('1101 01', 'f') ('1101 1', 's') ('111', ' ') 0001 1000 1111... was a bunch of poorly-represented characters, which appeared only in the Project Gutenberg blurb. It remains open for further expansion with rarely-used characters. This encodes the test file at 4.45 bits/char, including the control characters, or 4.6 per char if you exclude them. #### Encoding key #### ("'", '0001 1000 00') (' ', '111') (',', '1000 01') ('.', '0001 101') (':', '1101 001') (';', '1011 0010 1') ('?', '0001 1000 10') ('\n', '0001 0') (line break. Could use for paragraph breaks.) ('\x11', '1011 000') (switch to lowercase) ('\x12', '1011 0010 0') (switch to uppercase) ('\x13', '0110 001') (switch to numbers: 0 for a, 1 for b, etc.) ('\x14', '1100 11') (next character is an uppercase letter) ('a', '0111') ('b', '1000 11') ('c', '1000 10') ('d', '1100 0') ('e', '001') ('f', '1101 01') ('g', '0110 01') ('h', '1001') ('i', '0000') ('j', '0001 1001') ('k', '1011 0011') ('l', '0110 1') ('m', '1011 01') ('n', '0100') ('o', '0101') ('p', '1101 000') ('q', '0001 1000 1110') ('r', '1011 1') ('s', '1101 1') ('t', '1010') ('u', '1100 10') ('v', '0110 000') ('w', '1000 00') ('x', '0001 1000 110') ('y', '0001 11') ('z', '0001 1000 01') Other compression algorithms ---------------------------- This is far from the performance of modern compression algorithms: |---------------------------------------------------------+-------------| | encoding | output bits | |---------------------------------------------------------+-------------| | ASCII | 34 814 776 | | Bacon (5-bit, letters only) | 16 197 215 | |---------------------------------------------------------+-------------| | simple huffman with case-shift coding (the table above) | 20 094 861 | | simple huffman | 19 939 830 | | context-sensitive huffman | 14 874 433 | | simple huffman (letters only) | 13 450 484 | | context-sensitive huffman (letters only) | 11 339 286 | |---------------------------------------------------------+-------------| | compress (LZW) | 12 595 576 | | gzip -9 (LZ77, huffman) | 10 968 576 | | LZMA | 8 319 576 | | bzip2 -9 (BWT, MTF, huffman, etc.) | 8 022 672 | |---------------------------------------------------------+-------------| However, I don’t believe that any of the more modern algorithms are practical to carry out by hand --- even LZW, which is roughly as simple as huffman coding in software. Errors ------ When considering human encoding, it's important to think about the effect of errors. Huffman coding differs from fixed-width coding (e.g. ASCII, Baudot) in its response to errors. With fixed-width coding, a bit substitution creates a single-character error, but the rest of the text is unharmed. Huffman coding can be desynchronized by a bit substitution error, but it will eventually resynchronize. Fixed-width coding, by contrast, becomes permanently desynchronized by bit insertion and deletion errors, while Huffman coding will eventually resynchronize in these cases as well. Fixed-width coding errors will tend to be fairly obvious; the substituted letter will usually be an uncommon letter, and one that wouldn’t make sense in context. Huffman coding, by contrast, will tend to prefer common letters, and the context-sensitive huffman coding I suggest using here will additionally tend to produce plausible *sequences* of letters, dissociated-press style, and occasionally even words. The superstitious might amuse themselves by generating random bits from a series of coin flips, decoding the results, and attempting to interpret them as a supernatural message. As an example, using the context-sensitive huffman method, the first verse of Genesis 'inthebeginninggodcreatedtheheavenandtheearth' encodes as ‘000101111101111000111101110000101101101100001001010001100’ ‘10110010111101101000001100101010110101001100000101111101011110101101’ ‘011001000111011101’. If we change the first bit to from ‘0’ to ‘1’, we get the result ‘tanthestjendtteldsomenoondspawndsanergtho’, which consists mostly of words: "tan the stj end tteld some no on d spawn ds an erg tho". If instead of substituting, we prepend a ‘1’, we get instead ‘omthgsulivesndlndtheatedtheheavenandtheearth’ --- it resynchronized partway through, giving us "Omthgsu Live Sndlndth eated the heaven and the earth", a sort of Lovecraft edit of Genesis. Using the context-sensitive huffman tables for the full ASCII version of the PG KJV, inserting a '1' at the beginning transforms "this is a simple piece of text. It should be relatively easy to correct errors in before too long, given the self synchronizing property of huff man codes." into the following: nd, msoug thef perunodix 10 thendethaicowan thth y whersemesof er I the plond ofunewin ileathinse,) t th t r ed themicut y it ghity chit hronizing property of huff man codes. I had to search for a while to find an error that left it desynchronized for such a long time, though. It's not impossible to construct a text that fails to resynchronize for an arbitrarily long time; simple repetition will generally do it. Consequently, although encoding and decoding errors are likely to be easily recoverable in this scheme, substitution errors will have a bigger effect, and the erroneous decoded text will appear much more plausible than with Bacon's less efficient method. """ import re, collections, sys def huff(freqs): "Given a dictionary of letter frequencies, compute a huffman tree for them." items = [(v, k) for k, v in freqs.items()] while len(items)> 1: # find the two smallest items. items.sort() # heapq would be faster, but this is simpler ((firstv, firstk), (secondv, secondk)) = items[:2] # combine them. new_item = (firstv + secondv, (firstk, secondk)) items[0] = new_item items[1] = items[-1] items.pop() return items[0][1] def mkcode(tree, prefix=''): "Given a huffman tree, generate the code for it." if type(tree) is type(()): for item in mkcode(tree[0], prefix + '0'): yield item for item in mkcode(tree[1], prefix + '1'): yield item else: yield prefix, tree def letters_only(infile): "Strip out nonletters and case information from a sequence of lines." for line in infile: yield re.sub('[^a-z]', '', line.lower()) def freq(infile): "Tabulate the frequencies of characters in a sequence of lines." rv = collections.defaultdict(lambda: 0) for line in infile: for char in line: rv[char] += 1 return rv class SimpleEncoder: "Encodes according to a simple huffman tree." def __init__(self, tree): self.dict = dict((letter, code) for code, letter in mkcode(tree)) def encode(self, instring): return ''.join(self.dict[char] for char in instring) # This is kind of a silly way to do this. You could simply calculate # them from the frequency tables used to compute the huffman code in # the first place. def evaluate_code(encoder, infile): "Compute input and output size for an encoder applied to a file." in_chars = 0 out_chars = 0 for line in infile: in_chars += len(line) out_chars += len(encoder.encode(line)) return in_chars, out_chars def freq_follow(infile): "Tabulate character frequencies for single-character contexts." rvs = collections.defaultdict(lambda: collections.defaultdict(lambda: 0)) prev_char = 'a' for line in infile: for char in line: rvs[prev_char][char] += 1 prev_char = char return rvs class AdaptiveEncoder: "Encode with a different huffman table for each single-character context." def __init__(self, freqs): trees = [(prev_char, huff(next_char_freqs)) for prev_char, next_char_freqs in freqs.items()] self.dicts = {} self.decode_dicts = {} for prev_char, tree in trees: self.dicts[prev_char] = dict((letter, code) for code, letter in mkcode(tree)) self.decode_dicts[prev_char] = dict(mkcode(tree)) self.reset() def reset(self): self.prev_char = 'a' def encode(self, instring): rv = [] for char in instring: rv.append(self.dicts[self.prev_char][char]) self.prev_char = char return ''.join(rv) def decode(self, encoded): rv = [] buf = '' bits = iter(encoded) while True: if buf in self.decode_dicts[self.prev_char]: self.prev_char = self.decode_dicts[self.prev_char][buf] rv.append(self.prev_char) buf = '' else: try: bit = bits.next() except StopIteration: return ''.join(rv) buf += bit def test(input_file): "Try different codes on a particular input file." print "Single huffman table:" freqs = freq(input_file()) print freqs tree = huff(freqs) print tree for item in mkcode(tree): print item in_chars, out_chars = evaluate_code(SimpleEncoder(tree), input_file()) print "%d chars in, %d bits out, %.2f bits/char" % ( in_chars, out_chars, float(out_chars)/in_chars) print print "Using one-character context:" ffs = freq_follow(input_file()) print ffs trees = [(prev_char, huff(next_char_freqs)) for prev_char, next_char_freqs in ffs.items()] trees.sort() for prev_char, tree in trees: print "following %r:" % prev_char for item in mkcode(tree): print item print in_chars, out_chars = evaluate_code(AdaptiveEncoder(ffs), input_file()) print "with 1-char adaptive, %d chars in, %d bits out, %.2f bits/char" % ( in_chars, out_chars, float(out_chars)/in_chars) if __name__ == '__main__': print "Letters only:" test(lambda: letters_only(file(sys.argv[1]))) print print "All characters:" test(lambda: file(sys.argv[1])) # Local Variables: # compile-command: "./huffman.py" # End:

AltStyle によって変換されたページ (->オリジナル) /