I'm comparing my algorithms, slice_prefix
and bit_prefix
, with existing ones to find the common prefix length of 2 strings as fast as possible.
Another program I'm working with spends most of its time on one line len(os.path.commonprefix((item1, item2)))
. I've searched for other algorithms in Python and created my own to reduce this bottleneck.
Here's what I've found
times = [
... ['os.path.commonprefix', ospathcommonprefix],
... ['bit_prefix', bit_prefix],
... ['slice_prefix', slice_prefix],
... ['index_prefix', index_prefix],
... ['zip_prefix', zip_prefix],
... ['izip_prefix', izip_prefix]]
a = 'a' * 1000000
b = 'a' * 900000
for i, algorithm in enumerate(times):
times[i][1] = timeit.timeit(lambda: algorithm[1](a, b), number=100)
for result in sorted(times, key=lambda x: x[-1]): print result
['slice_prefix', 0.0408039022572666]
['bit_prefix', 6.330115954430312]
['izip_prefix', 6.653096312379603]
['os.path.commonprefix', 7.151773449750181]
['index_prefix', 8.905739173445]
['zip_prefix', 12.252280194828245]
times = [
... ['os.path.commonprefix', ospathcommonprefix],
... ['bit_prefix', bit_prefix],
... ['slice_prefix', slice_prefix],
... ['index_prefix', index_prefix],
... ['zip_prefix', zip_prefix],
... ['izip_prefix', izip_prefix]]
a = 'a' * 2
b = 'a' * 1
for i, algorithm in enumerate(times):
times[i][1] = timeit.timeit(lambda: algorithm[1](a, b), number=1000000)
for result in sorted(times, key=lambda x: x[-1]): print result
['slice_prefix', 0.7089188635036408]
['index_prefix', 0.8099312869949244]
['izip_prefix', 0.9187295778019688]
['zip_prefix', 1.115640751833098]
['os.path.commonprefix', 1.3019800865420166]
['bit_prefix', 3.7144623631063496]
a = 'a' * 1000000
b = 'a' * 900000
times = [
... ['os.path.commonprefix', ospathcommonprefix],
... ['bit_prefix', bit_prefix],
... ['slice_prefix', slice_prefix],
... ['index_prefix', index_prefix],
... ['zip_prefix', zip_prefix],
... ['izip_prefix', izip_prefix]]
for i, algorithm in enumerate(times):
times[i][1] = algorithm[1](a, b)
print 'Sanity check. All results should be 900000'
Sanity check. All results should be 900000
for result in sorted(times, key=lambda x: x[-1]): print result
['os.path.commonprefix', 900000]
['bit_prefix', 900000]
['slice_prefix', 900000]
['index_prefix', 900000]
['zip_prefix', 900000]
['izip_prefix', 900000]
The algorithms
from itertools import izip, count
import os
def bit_prefix(a, b):
min_len = min(len(a), len(b))
if min_len > 0:
x = str(bin(
int(a[::-1].encode('hex'), 16) ^
int(b[::-1].encode('hex'), 16)))
y = x.strip('0')
if len(y) is 1:
return min_len
else:
return (len(x) - len(y)) / 8
else:
return 0
def ospathcommonprefix(a, b):
return len(os.path.commonprefix((a, b)))
def slice_prefix(a, b, start=0, length=1):
while 1:
while a[start:start + length] == b[start:start + length]:
start += length
length += length
if length > 1:
length = 1
else:
return start
def index_prefix(a, b):
i = -1
min_len = min(len(a), len(b))
for i in xrange(0, min_len):
if a[i] != b[i]:
i -= 1
break
return i + 1
def zip_prefix(a, b):
i = -1
for i, (x, y) in enumerate(zip(a, b)):
if x != y:
i -= 1
break
return i + 1
def izip_prefix(a, b):
i = -1
for i, x, y in izip(count(), a, b):
if x != y:
i -= 1
break
return i + 1
While slice_prefix
is faster than other solutions, I still need to improve 50% more to remove my bottleneck. Maybe learning to write a Python C extension would work?
Do you think a C extension would give me the 50% + performance boost I'm looking for? Other algorithm suggestions? Any optimizations for what I've got here?
A directional variant of slice_prefix
def directional_prefix(a, b, start=0, length=1):
while 1:
while a[start:start + length] == b[start:start + length]:
start = start + length
length += length
length = length / 2
while length > 0 and a[start: start + length] != b[start: start + length]:
length = length / 2
if length > 0:
start = start + length
length = 1
else:
return start
a = 'a' * 1000000
b = 'a' * 900000
i = 0
while 2 ** i < len(b):
length = 2 ** i
print length, timeit.timeit(lambda: directional_prefix(a, b, length=length), number=100)
i += 1
1 0.0380843005421
2 0.0303687541309
4 0.0326917143407
8 0.0325356651429
16 0.0319154189644
32 0.0329461337922
64 0.0315618391367
128 0.0301917666861
256 0.0344714653179
512 0.0325779366976
1024 0.0378156588852
2048 0.0330583311902
4096 0.0388345218751
8192 0.0310024323921
16384 0.0407225196375
32768 0.0332771951284
65536 0.0506670016787
131072 0.0531404802284
262144 0.0809286942078
524288 0.0748048496141
i = 0
while 2 ** i < len(b):
length = 2 ** i
print length, timeit.timeit(lambda: slice_prefix(a, b, length=length), number=100)
i += 1
1 0.041934962645
2 0.0349668721381
4 0.0359395129606
8 0.0346022305948
16 0.0389324970677
32 0.0366881540488
64 0.0372364990776
128 0.0352363039174
256 0.0375355604517
512 0.0350917114961
1024 0.0369461290516
2048 0.034510971444
4096 0.035836797033
8192 0.0356835132641
16384 0.0378843995443
32768 0.0352453903265
65536 0.054919046022
131072 0.0592635347
262144 0.0652649103033
524288 0.0740941344068
3 Answers 3
slice_prefix
and index_prefix
look all right algorithmically but they impose a lot of interpreter overhead for each and every character. As always in such a situtation you could unlock speed reserves by pushing some processing into the engine. In other words, the engine can probably compare hundreds of characters in the same time that your loops go through a small handful.
If you cannot find an existing string compare that reports the mismatch position then you could take a long hard look at the distribution of your input strings and pick an initial block size for comparisons, like 32 or 64. Use slice_prefix
to find the first mismatching block of that size, and then use block size halving to find the mismatch in that block. When the block size gets lower than some threshold - something on the order of 8 or 16 - switch to linear scan. Run copious tests on representative inputs in order to refine the two thresholds (initial blocks size and linear scan threshold).
Some additional observations. At this point there are two fruitful avenues: on one hand you could defer to a function written in C, and on the other you could create an uglified (optimised) version of slice_prefix
that is finely tuned to the specifics of your application, like average prefix length and so on.
Gauging the potential of either approach requires hard data, like the raw overhead for passing two strings to C (which gives you an absolute limit of what could possibly be achieved via that route) and the actual performance of a good library memcmp()
when called from Python under production conditions on representative inputs.
There are some promising avenues for tuning slice_prefix
. The first is a slight tweak of the tail end:
def slice_prefix(a, b, start=0, length=64):
while 1:
while a[start:start + length] == b[start:start + length]:
start += length
length += length
if length >= 4:
length /= 4
elif length == 1:
return start
else:
length = 1
Try factors between 2 and 8 on representative input data.
Your binary approach (a.k.a. directional_prefix
) needs some work yet, though. The basic idea is this: when the first while loop exits then we know that there must a a mismatch some where in the block with length length
starting at start
. If it isn't in the first half then it must be in the second half. So if the first half is equal then you add the current length to start
, if it is different then not. Then halve and continue until the linear threshold is reached, at which point you do a linear scan of the rest.
P.S.: there is no one ring to bind them all, which is why I keep mentioning representative inputs. The memcmp()
of a mature C runtime can comes close to the ideal and typically contains a sh*tload of optimisations for different cases in order to give balanced high performance (note: the standard memcmp()
is no use because it doesn't return the mismatch position, but there are some that do). But the effort for bringing C code into the fold and the complications arising from it are non-negligible, as is the call overhead. Tuning slice_prefix
to meet requirements - if it is possible - might be a better solution. It is already in a state of grace as is, though, and from here on can only come uglification...
-
\$\begingroup\$ I'm trying to understand your suggestion to use blocks. I think you mean scan the 2 items in blocks of some arbitrary size like 64, then use slice_prefix with the start variable equal to last matched block. But how do I scan in blocks of 64 in an optimized way? Do you mean add a for loop before 'while 1:' and simply iterate in blocks by 64 to find the last one where both items are equal? \$\endgroup\$12345678910111213– 123456789101112132016年03月29日 05:55:25 +00:00Commented Mar 29, 2016 at 5:55
-
1\$\begingroup\$ @mattkaeo: As a first approximation you could use a two-level scheme by simply calling
slice_prefix
withlength
set to something like 32, because the function already does everything for you. It finds the first mismatching block, switcheslength
to 1 and then finds the exact mismatch location. It's really neat, actually. \$\endgroup\$DarthGizka– DarthGizka2016年03月29日 06:22:15 +00:00Commented Mar 29, 2016 at 6:22 -
\$\begingroup\$ I think I see what you're saying. I've added a directional variant to the question. I saw some gains but also some losses. Can you tell me if you think I missed something? \$\endgroup\$12345678910111213– 123456789101112132016年03月29日 07:17:09 +00:00Commented Mar 29, 2016 at 7:17
-
1\$\begingroup\$ @mattkaeo: I've added some more details and explanation. Perhaps you should provisionally un-accept my answer, in order to attract more replies... ;-) \$\endgroup\$DarthGizka– DarthGizka2016年03月29日 08:17:29 +00:00Commented Mar 29, 2016 at 8:17
-
\$\begingroup\$ Why start with length=32 and not a fraction of the length of the shorter string? Bisect this problem as much as possible. \$\endgroup\$IceArdor– IceArdor2016年03月29日 08:29:03 +00:00Commented Mar 29, 2016 at 8:29
good approach depends greatly of distribution real cases (lengths of prefix/a/b). For instance when i add
a, b = memoryview(a), memoryview(b)
toslice_prefix
get resultlen(a) = 1000000, len(b)=900000 ['slice2_prefix', 0.1448716410180293] ['slice_prefix', 0.8104352267954309] len(a) = 2, len(b)=1 ['slice_prefix', 0.08141100138638435] ['slice2_prefix', 0.21893188399847152]
when a == b the slice_prefix don't work for me
it looks ideal for c extension optimization (easy code to move to c, remove from python hack code (replacement for os.path.commonprefix), huge time optimization). But maybe try another library first (like numpy)?
I still haven't found 50% + gains but, I did test DarthGizka's halving technique, fixed the issue with identical strings, and was surprised to find large gains with large input using memoryview
. It's probably time to learn to extend Python with C.
Benchmark
algorithms = [
['slice_prefix', slice_prefix],
['smemory_prefix', smemory_prefix],
['dmemory_prefix', dmemory_prefix],
['darth_prefix', darth_prefix]]
for i in xrange(25):
a = 'a' * 2 ** i + 'z'
b = 'a' * 2 ** i + 'y'
times = copy.deepcopy(algorithms)
for j, algorithm in enumerate(times):
times[j][1] = timeit.timeit(lambda: algorithm[1](a, b), number=10000)
print 2 ** i
for result in sorted(times, key=lambda x: x[-1]):
print ' ', result
1
['darth_prefix', 0.008644335433928063]
['slice_prefix', 0.011235937300625665]
['dmemory_prefix', 0.027204313437323435]
['smemory_prefix', 0.02916026173534192]
2
['slice_prefix', 0.0119423068335891]
['darth_prefix', 0.012402158140503161]
['smemory_prefix', 0.043039553928792884]
['dmemory_prefix', 0.04372577533740696]
4
['slice_prefix', 0.014079983313422417]
['darth_prefix', 0.014680476427201938]
['dmemory_prefix', 0.05297890017300233]
['smemory_prefix', 0.0532965294260066]
8
['slice_prefix', 0.01648985699921468]
['darth_prefix', 0.019445310286755557]
['smemory_prefix', 0.06081533533051697]
['dmemory_prefix', 0.07198924801286921]
16
['slice_prefix', 0.019568569399780245]
['darth_prefix', 0.022567084364709444]
['smemory_prefix', 0.06823577097929956]
['dmemory_prefix', 0.0775797599053476]
32
['slice_prefix', 0.02139256723876315]
['darth_prefix', 0.02606219133394916]
['smemory_prefix', 0.07920267156259797]
['dmemory_prefix', 0.09199277986044763]
64
['slice_prefix', 0.026915523655588913]
['darth_prefix', 0.03019137162482366]
['smemory_prefix', 0.08551061470370769]
['dmemory_prefix', 0.10126170714647742]
128
['slice_prefix', 0.02780757198070205]
['darth_prefix', 0.034469885073121986]
['smemory_prefix', 0.09218596481696295]
['dmemory_prefix', 0.11656005938493763]
256
['slice_prefix', 0.030256951794399356]
['darth_prefix', 0.03702000550674711]
['smemory_prefix', 0.10429222207312705]
['dmemory_prefix', 0.12654602285874716]
512
['slice_prefix', 0.03457813185832492]
['darth_prefix', 0.04374592346175632]
['smemory_prefix', 0.1124016445601228]
['dmemory_prefix', 0.14454501387263008]
1024
['slice_prefix', 0.040601630892524554]
['darth_prefix', 0.050192138043712475]
['smemory_prefix', 0.12381456930688728]
['dmemory_prefix', 0.15493986575074814]
2048
['slice_prefix', 0.04844004135520663]
['darth_prefix', 0.060962693179135385]
['smemory_prefix', 0.13431885315276304]
['dmemory_prefix', 0.17624868000166316]
4096
['slice_prefix', 0.05967676877753547]
['darth_prefix', 0.07432366499870113]
['smemory_prefix', 0.14581316051771864]
['dmemory_prefix', 0.18878831946130958]
8192
['slice_prefix', 0.07948672060865647]
['darth_prefix', 0.09363346927420935]
['smemory_prefix', 0.16827436846506316]
['dmemory_prefix', 0.21853485210704093]
16384
['slice_prefix', 0.11653991126149776]
['darth_prefix', 0.12642908472662384]
['smemory_prefix', 0.20402701744933438]
['dmemory_prefix', 0.25172552881849697]
32768
['slice_prefix', 0.17343268333934247]
['darth_prefix', 0.1912483659280042]
['smemory_prefix', 0.25608661007026967]
['dmemory_prefix', 0.308776720462447]
65536
['slice_prefix', 0.2999407803172289]
['darth_prefix', 0.30973829956928967]
['smemory_prefix', 0.3549724187987522]
['dmemory_prefix', 0.4129709673425168]
131072
['slice_prefix', 0.5387379293970298]
['smemory_prefix', 0.5455981681798221]
['darth_prefix', 0.555022354540597]
['dmemory_prefix', 0.6017983978681514]
262144
['smemory_prefix', 0.9152490880996993]
['dmemory_prefix', 0.990508653224424]
['slice_prefix', 1.0029807372084178]
['darth_prefix', 1.0165337088001252]
524288
['smemory_prefix', 1.6633493372646626]
['dmemory_prefix', 1.8189718688727226]
['slice_prefix', 1.9119993141739542]
['darth_prefix', 2.253921279302631]
1048576
['dmemory_prefix', 3.3815675477717377]
['smemory_prefix', 3.852834939850254]
['darth_prefix', 9.262993861143514]
['slice_prefix', 10.2719803196278]
2097152
['smemory_prefix', 6.760387839540272]
['dmemory_prefix', 6.84966378311492]
['slice_prefix', 23.33799268583607]
['darth_prefix', 24.301179692429287]
4194304
['dmemory_prefix', 14.16732977699212]
['smemory_prefix', 14.208940789403641]
['darth_prefix', 45.43554476774898]
['slice_prefix', 48.59348946944465]
8388608
['dmemory_prefix', 28.564412565634484]
['smemory_prefix', 28.571759914951144]
['darth_prefix', 93.14922846511217]
['slice_prefix', 97.82516800967824]
16777216
['smemory_prefix', 57.030781198086515]
['dmemory_prefix', 61.980545998364505]
['slice_prefix', 206.76891168128986]
['darth_prefix', 207.6305335736888]
Functions
def slice_prefix(a, b, start=0, length=1):
if a == b:
return len(a)
while 1:
while a[start:start + length] == b[start:start + length]:
start += length
length += length
if length > 1:
length = 1
else:
return start
def smemory_prefix(a, b, start=0, length=1):
if a == b:
return len(a)
while 1:
while (memoryview(a)[start: start + length] ==
memoryview(b)[start: start + length]):
start += length
length += length
if length > 1:
length = 1
else:
return start
def dmemory_prefix(a, b, start=0, length=1):
if a == b:
return len(a)
while 1:
while (memoryview(a)[start: start + length] ==
memoryview(b)[start: start + length]):
start += length
length += length
if length >= 4:
length /= 4
elif length == 1:
return start
else:
length = 1
def darth_prefix(a, b, start=0, length=1):
if a == b:
return len(a)
while 1:
while a[start:start + length] == b[start:start + length]:
start += length
length += length
if length >= 4:
length /= 4
elif length == 1:
return start
else:
length = 1
Explore related questions
See similar questions with these tags.
slice_prefix
I'd try replacingif length > 1: length = 1 ...
with ``if length > 1: length = length / 2 ...` – when running ahead you double the step length each time, why not shorten by half when slowing down? Reducing from hundreds or thousands right to 1 seems too strong reduction. \$\endgroup\$