Is my algorithm slow because it has problems? Or, is any bitwise common prefix solution not going to give me the performance I'm looking for?
After profiling my algorithm, I found that over 60% of the time was spent on one line len(os.path.commonprefix((item1, item2)))
. I'm only looking for the length of the prefix
To solve this I tried to write a bitwise prefix solution
def bit_prefix(a, b):
min_len = min(len(a), len(b))
if min_len > 0:
x = str(bin(
int(a[::-1].encode('hex'), 16) ^ int(b[::-1].encode('hex'), 16)))
y = x.strip('0')
if len(y) is 1:
return min_len
else:
return (len(x) - len(y)) / 8
else:
return 0
I've only gotten a marginal improvement in speed with long prefixes
a = 'a' * 1000000 + 'z'
b = 'a' * 900000 + 'z'
timeit.timeit(lambda: bit_prefix(a, b), number=100)
Out[34]: 6.340372534867129
timeit.timeit(lambda: len(os.path.commonprefix((a, b))), number=100)
Out[35]: 7.5483549056534684
print bit_prefix(a, b), len(os.path.commonprefix((a, b)))
900000 900000
And my algorithm performs more poorly with short prefixes
a = 'aaz'
b = 'az'
timeit.timeit(lambda: bit_prefix(a, b), number=1000000)
Out[42]: 3.968956086175467
timeit.timeit(lambda: len(os.path.commonprefix((a, b))), number=1000000)
Out[43]: 1.1592788235707303
print bit_prefix(a, b), len(os.path.commonprefix((a, b)))
1 1
If my algorithm isn't broken and a bitwise solution won't give me the performance boost I'm looking for, can you refer me to a common prefix solution that would outperform os.path.commonprefix?
Here's the bit_pefix profile
Line # Hits Time Per Hit % Time Line Contents
==============================================================
29 @profile
30 def bit_prefix(a, b):
31 36140 81099 2.2 4.2 min_len = min(len(a), len(b))
32 36140 49386 1.4 2.5 if min_len > 0:
33 36140 49739 1.4 2.6 x = str(
34 36140 47846 1.3 2.5 bin(
35 36140 47232 1.3 2.4 int(
36 36140 89499 2.5 4.6 a[::-1]
37 36140 242994 6.7 12.5 .encode('hex'),
38 36140 164442 4.6 8.4 16)
39 ^
40 36140 49601 1.4 2.5 int(
41 36140 88745 2.5 4.6 b[::-1]
42 36140 216488 6.0 11.1 .encode('hex'),
43 36140 504425 14.0 25.9 16)))
44 36140 187571 5.2 9.6 y = x.strip('0')
45 36140 61027 1.7 3.1 if len(y) is 1:
46 return min_len
47 else:
48 36140 67507 1.9 3.5 return (len(x) - len(y)) / 8
49 else:
50 return 0
Update
I've created another algorithm that seems to be faster than others. Here's the Code Review for that.
1 Answer 1
Is my algorithm slow because it has problems?
Yes. The fundamental problem is that it always processes every character of the input.
There are some obvious improvements which don't address this fundamental problem:
- Since the common prefix can't be more than
min_len
, truncate both strings tomin_len
before encoding them. bin
returns a string, sostr(bin(...))
is overkill.bin(int(a[::-1].encode('hex'), 16) ^ int(b[::-1].encode('hex'), 16))
is also overkill: what you care about is the position of the lowest set bit ofint(a[::-1].encode('hex'), 16) ^ int(b[::-1].encode('hex'), 16)
. You can extract the lowest set bit directly from the integer asdifferences = int(a[::-1].encode('hex'), 16) ^ int(b[::-1].encode('hex'), 16) least_significant_difference = differences ^ (differences - 1)
(That actually gives a binary number which is all
1
s from the lowest set bit to the least significant bit).Then you can either convert to string and find the length, or take the logarithm base 2.
But the fundamental problem is still there: the fastest code is the code which isn't executed, and when the prefix is very short then a simple loop from the start which compares characters is hard to beat. If you really have to beat it, I think you're probably looking at writing some C which coerces a pointer to (16-bit?) characters into a pointer to 64-bit integers and uses that to compare a block of characters at a time. Beware endianness.
Explore related questions
See similar questions with these tags.
enumerate()
withitertools.izip()
would be blisteringly fast since they return generators, so a minimum of characters are iterated, but even that came out with just barely better or just barely worse: about the same. \$\endgroup\$