I think understanding basic data compression and pattern finding / entropy reduction primitives is really important. Mostly the basic algorithms (before the many tweaks and optimizations are made) are really elegant and inspiring. What's even more impressive is how widespread they are, but perhaps also how little they are really understood except outside of people working in compression directly. So I want to understand.
I understand and have implemented the naive implementations of LZ77 (just find the maximal prefix that existed before the current factor) and LZ78 (just see if we can extend a factor that occurred before at least twice, or we make a new record and start a new factor). I also understand the naive version of Burrows-Wheeler Transform (maintain the invariant that the first column is sorted, and keep adding that column and resorting until the matrix is square).
I also understand why all these (parts of) compression algorithms work:
- LZ77 asymptotically optimal since in the long term and with an infinite window all factors that exist are composed of prefixes earlier in the input which LZ77 can find.
- LZ78 is also pretty good since if a factor occurs at least once before it can be compressed the second time, so long as it is not a suffix of another factor found earlier.
- BWT, by grouping same letters on the left hand side (of the matrix), and finding patterns in the preceding letters in the right hand side, can exploit bigram repetitions, recursively.
The BWT code below works, but the inverse is not efficient. What are the optimizations, and how can I understand why they work?
BWT
def rot(v):
return v[-1] + v[:-1]
def bwt_matrix(s):
return sorted(reduce(lambda m,s : m + [rot(m[-1])],xrange(len(s)-1),[s]))
def last_column(m):
return ''.join(zip(*m)[-1])
def bwt(s):
bwtm = bwt_matrix(s)
print 'BWT matrix : ', bwtm
return bwtm.index(s), ''.join(last_column(bwtm))
def transpose(m):
return [''.join(i) for i in zip(*m)]
def ibwt(s):
return reduce(lambda m, s: transpose(sorted(transpose([s]+m))),[s]*len(s),[])
s = 'sixtysix'
index, bwts = bwt(s)
print 'BWT, index : ', bwts, index
print 'iBWT : ', ibwt(bwts)
BWT output
BWT matrix :
ixsixtys
ixtysixs
sixsixty
sixtysix
tysixsix
xsixtysi
xtysixsi
ysixsixt
BWT, index : ssyxxiit 3
iBWT :
ixsixtys
ixtysixs
sixsixty
sixtysix
tysixsix
xsixtysi
xtysixsi
ysixsixt
Which is correct.
LZ77
def lz77(s):
lens = len(s)
factors = []
i = 0
while i < lens:
max_match_len = 0
current_word = i
j = 0
while j < lens-1:
if current_word >= lens or s[current_word] != s[j]:
j -= (current_word - i)
if current_word - i >= max_match_len:
max_match_len = current_word - i
if current_word >= lens:
break
current_word = i
else:
current_word += 1
j += 1
if max_match_len > 1:
factors.append(s[i:i+max_match_len])
i += max_match_len
else:
factors.append(s[i])
i += 1
return ','.join(factors)
LZ78
def lz78(s):
empty = ''
factors = []
word = empty
for c in s:
word += c
if word not in factors:
factors.append(word)
word = empty
if word is not empty:
factors.append(word)
return ','.join(factors)
Correct outputs against 7th Fibonacci word:
LZ77( abaababaabaab ): a,b,a,aba,baaba,ab
LZ78( abaababaabaab ): a,b,aa,ba,baa,baab
1 Answer 1
1. Introduction
The only actual question you asked is about how to make the inverse Burrows–Wheeler transform go faster. But as you'll see below, that's plenty for one question here on Code Review. If you have anything to ask about your LZ77 and LZ78 implementations, I suggest you start a new question.
2. A bug
First, the output of the function ibwt
doesn't match the example output in the question:
>>> from pprint import pprint
>>> pprint(ibwt('ssyxxiit'))
['iisstxxy',
'xxiiysts',
'stxxsiyi',
'iystixsx',
'xsiyxtis',
'tixssyxi',
'yxtiissx',
'ssyxxiit']
You should always prepare your question by copying and pasting input and output from source files and from an interactive session, rather than typing in what you think it should be. It's easy to introduce or cover up mistakes if you do the latter.
So it looks as if there is a missing transpose
:
>>> pprint(transpose(ibwt('ssyxxiit')))
['ixsixtys',
'ixtysixs',
'sixsixty',
'sixtysix',
'tysixsix',
'xsixtysi',
'xtysixsi',
'ysixsixt']
3. Avoiding transpositions
With that bug fixed, let's see how long the function takes on a medium-sized amount of data:
>>> import random, string
>>> from timeit import timeit
>>> s = ''.join(random.choice(string.printable) for _ in range(1024))
>>> t = bwt(s)
>>> timeit(lambda:ibwt(t[1]), number=1)
49.83987808227539
Not so good! What can we change? Well, an obvious idea is to cut out all those calls to transpose
. The function ibwt
transposes the matrix in order to sort it, and then transposes it back again each time round the loop. By keeping the matrix in transposed form throughout, it would be possible to avoid having to call transpose
at all:
def ibwt2(s):
"""Return the inverse Burrows-Wheeler Transform matrix based on the
string s.
>>> ibwt2('SSYXXIIT')[3]
'SIXTYSIX'
"""
matrix = [''] * len(s)
for _ in s:
matrix = sorted(i + j for i, j in zip(s, matrix))
return matrix
That simple change results in a big improvement:
>>> s = ''.join(random.choice(string.printable) for _ in range(1024))
>>> t = bwt(s)
>>> timeit(lambda:ibwt2(t[1]), number=1)
0.9414288997650146
4. Aside on programming style
Something that's perhaps worth mentioning at this point is that you seem quite keen on transforming code so as to use a functional style (for example in this question, where you wanted to express a loop using reduce
). Transforming code into pure functional form is an excellent intellectual exercise, but you have to be aware of two problems that often arise:
Pure functional programs can easily end up doing a lot of allocation and copying. For example, it is often tempting to transform your data into a format suitable for passing in to a function (rather than transforming your function so that it operates on the data you have). Thus in
ibwt
it is convenient to callsorted
on the data in row-major order, but convenient to append a new instance of the strings
in column-major order. In order to express both of these operations in the most convenient way, the function has to transpose the matrix back and forth, and since it do so functionally, each transposition has to copy the whole matrix.Pursuit of pure functional style often results in tacit or point-free programming where variable names are avoided and data is routed from one function to another using combinators. This can seem very elegant, but variable names have two roles in a program: they don't just move data around, they also act as mnemonics to remind programmers of what the data means. Point-free programs can end up being totally mysterious because you lose track of exactly what is being operated on.
Anyway, aside over. Can we make any more progress on the inverse Burrows–Wheeler transform?
5. An efficient implementation
If you look at what the inverse Burrows–Wheeler transform does, you'll see that it repeatedly permutes an array of strings (by sorting them). The permutation is exactly the same at each step (exercise: prove this!). For example, if we are given the string SSYXXIIT
, then at each step we apply the permutation SSYXXIIT
→ IISSTXXY
, which (taking note of where repeated letters go) is the permutation 01234567
→ 56017342
.
This observation allows us to reconstruct a row of the matrix without having to compute the whole matrix. In this example we want to reconstruct row number 3. Well, we know that the final matrix looks like this:
I.......
I.......
S.......
S*......
T.......
X.......
X.......
Y.......
so the first element of the row is S
. What's the next letter of this row (marked with *
)? Where did this come from? Well, we know that the last step of the transform must have looked like this:
S0...... I5......
S1...... I6......
Y2...... -> S0......
X3...... -> S1......
X4...... -> T7......
I5...... -> X3......
I6...... X4......
T7...... Y2......
So the *
must have come from row 1 (which is the image of 3 under the permutation), and so it must be I
(which is element number 1 of the sorted string IISSTXXY
).
We can follow this approach all the way along the row. Here's an implementation in Python:
def ibwt3(k, s):
"""Return the kth row of the inverse Burrows-Wheeler Transform
matrix based on the string s.
>>> ibwt3(3, 'SSYXXIIT')
'SIXTYSIX'
"""
def row(k):
permutation = sorted((t, i) for i, t in enumerate(s))
for _ in s:
t, k = permutation[k]
yield t
return ''.join(row(k))
Let's check that works:
>>> ibwt3(3, 'SSYXXIIT')
'SIXTYSIX'
>>> s = ''.join(random.choice(string.printable) for _ in range(1024))
>>> t = bwt(s)
>>> ibwt3(*t) == s
True
and is acceptably fast:
>>> timeit(lambda:ibwt3(*t), number=1)
0.0013821125030517578
(If you compare my ibwt3
to the sample Python implementation on Wikipedia you'll see that mine is not only much simpler, but also copes with arbitrary characters, not just bytes in the range 0–255. So either I've missed something important, or the Wikipedia article could be improved.)
Explore related questions
See similar questions with these tags.
rot
v[-1]
should bev[-1:]
(ii) inlz78
there's a reference to an undefined variableempty
. \$\endgroup\$v[-1]
it only works for strings: tryrot(range(3))
for example. Withv[-1:]
it works for all sequences. (Perhaps "bug" was a bit strong.) \$\endgroup\$