Finding the fastest common prefix of 2 strings in Python

Question 1

I'm comparing my algorithms, slice_prefix and bit_prefix, with existing ones to find the common prefix length of 2 strings as fast as possible.

Another program I'm working with spends most of its time on one line len(os.path.commonprefix((item1, item2))). I've searched for other algorithms in Python and created my own to reduce this bottleneck.

Here's what I've found

times = [
... ['os.path.commonprefix', ospathcommonprefix],
... ['bit_prefix', bit_prefix],
... ['slice_prefix', slice_prefix],
... ['index_prefix', index_prefix],
... ['zip_prefix', zip_prefix],
... ['izip_prefix', izip_prefix]]
a = 'a' * 1000000
b = 'a' * 900000
for i, algorithm in enumerate(times):
 times[i][1] = timeit.timeit(lambda: algorithm[1](a, b), number=100)
for result in sorted(times, key=lambda x: x[-1]): print result
['slice_prefix', 0.0408039022572666]
['bit_prefix', 6.330115954430312]
['izip_prefix', 6.653096312379603]
['os.path.commonprefix', 7.151773449750181]
['index_prefix', 8.905739173445]
['zip_prefix', 12.252280194828245]
times = [
... ['os.path.commonprefix', ospathcommonprefix],
... ['bit_prefix', bit_prefix],
... ['slice_prefix', slice_prefix],
... ['index_prefix', index_prefix],
... ['zip_prefix', zip_prefix],
... ['izip_prefix', izip_prefix]]
a = 'a' * 2
b = 'a' * 1
for i, algorithm in enumerate(times):
 times[i][1] = timeit.timeit(lambda: algorithm[1](a, b), number=1000000)
for result in sorted(times, key=lambda x: x[-1]): print result
['slice_prefix', 0.7089188635036408]
['index_prefix', 0.8099312869949244]
['izip_prefix', 0.9187295778019688]
['zip_prefix', 1.115640751833098]
['os.path.commonprefix', 1.3019800865420166]
['bit_prefix', 3.7144623631063496]
a = 'a' * 1000000
b = 'a' * 900000
times = [
... ['os.path.commonprefix', ospathcommonprefix],
... ['bit_prefix', bit_prefix],
... ['slice_prefix', slice_prefix],
... ['index_prefix', index_prefix],
... ['zip_prefix', zip_prefix],
... ['izip_prefix', izip_prefix]]
for i, algorithm in enumerate(times):
 times[i][1] = algorithm[1](a, b)
print 'Sanity check. All results should be 900000'
Sanity check. All results should be 900000
for result in sorted(times, key=lambda x: x[-1]): print result
['os.path.commonprefix', 900000]
['bit_prefix', 900000]
['slice_prefix', 900000]
['index_prefix', 900000]
['zip_prefix', 900000]
['izip_prefix', 900000]

The algorithms

from itertools import izip, count
import os
def bit_prefix(a, b):
 min_len = min(len(a), len(b))
 if min_len > 0:
 x = str(bin(
 int(a[::-1].encode('hex'), 16) ^ 
 int(b[::-1].encode('hex'), 16)))
 y = x.strip('0')
 if len(y) is 1:
 return min_len
 else:
 return (len(x) - len(y)) / 8
 else:
 return 0
def ospathcommonprefix(a, b):
 return len(os.path.commonprefix((a, b))) 
def slice_prefix(a, b, start=0, length=1):
 while 1:
 while a[start:start + length] == b[start:start + length]:
 start += length
 length += length
 if length > 1:
 length = 1
 else:
 return start 
def index_prefix(a, b):
 i = -1
 min_len = min(len(a), len(b))
 for i in xrange(0, min_len):
 if a[i] != b[i]: 
 i -= 1
 break
 return i + 1
def zip_prefix(a, b):
 i = -1
 for i, (x, y) in enumerate(zip(a, b)):
 if x != y:
 i -= 1
 break
 return i + 1
def izip_prefix(a, b):
 i = -1
 for i, x, y in izip(count(), a, b):
 if x != y:
 i -= 1
 break
 return i + 1

While slice_prefix is faster than other solutions, I still need to improve 50% more to remove my bottleneck. Maybe learning to write a Python C extension would work?

Do you think a C extension would give me the 50% + performance boost I'm looking for? Other algorithm suggestions? Any optimizations for what I've got here?

A directional variant of slice_prefix

def directional_prefix(a, b, start=0, length=1):
 while 1:
 while a[start:start + length] == b[start:start + length]:
 start = start + length
 length += length
 length = length / 2
 while length > 0 and a[start: start + length] != b[start: start + length]:
 length = length / 2
 if length > 0:
 start = start + length
 length = 1
 else:
 return start 
a = 'a' * 1000000
b = 'a' * 900000
i = 0
while 2 ** i < len(b):
 length = 2 ** i
 print length, timeit.timeit(lambda: directional_prefix(a, b, length=length), number=100)
 i += 1
1 0.0380843005421
2 0.0303687541309
4 0.0326917143407
8 0.0325356651429
16 0.0319154189644
32 0.0329461337922
64 0.0315618391367
128 0.0301917666861
256 0.0344714653179
512 0.0325779366976
1024 0.0378156588852
2048 0.0330583311902
4096 0.0388345218751
8192 0.0310024323921
16384 0.0407225196375
32768 0.0332771951284
65536 0.0506670016787
131072 0.0531404802284
262144 0.0809286942078
524288 0.0748048496141
i = 0
while 2 ** i < len(b):
 length = 2 ** i
 print length, timeit.timeit(lambda: slice_prefix(a, b, length=length), number=100)
 i += 1
1 0.041934962645
2 0.0349668721381
4 0.0359395129606
8 0.0346022305948
16 0.0389324970677
32 0.0366881540488
64 0.0372364990776
128 0.0352363039174
256 0.0375355604517
512 0.0350917114961
1024 0.0369461290516
2048 0.034510971444
4096 0.035836797033
8192 0.0356835132641
16384 0.0378843995443
32768 0.0352453903265
65536 0.054919046022
131072 0.0592635347
262144 0.0652649103033
524288 0.0740941344068

Question 2

numpy.not_equal(s1,s2).index(True) will probably give you decent performance without resorting to writing your own C extension.

Question 3

In slice_prefix I'd try replacing if length > 1: length = 1 ... with ``if length > 1: length = length / 2 ...` – when running ahead you double the step length each time, why not shorten by half when slowing down? Reducing from hundreds or thousands right to 1 seems too strong reduction.

Question 4

I have rolled back the last edit. Please see What to do when someone answers . I have rolled back the last edit. Please see What to do when someone answers . I have rolled back Rev 7 → 6. Please see What to do when someone answers .

Question 5

slice_prefix and index_prefix look all right algorithmically but they impose a lot of interpreter overhead for each and every character. As always in such a situtation you could unlock speed reserves by pushing some processing into the engine. In other words, the engine can probably compare hundreds of characters in the same time that your loops go through a small handful.

If you cannot find an existing string compare that reports the mismatch position then you could take a long hard look at the distribution of your input strings and pick an initial block size for comparisons, like 32 or 64. Use slice_prefix to find the first mismatching block of that size, and then use block size halving to find the mismatch in that block. When the block size gets lower than some threshold - something on the order of 8 or 16 - switch to linear scan. Run copious tests on representative inputs in order to refine the two thresholds (initial blocks size and linear scan threshold).

Some additional observations. At this point there are two fruitful avenues: on one hand you could defer to a function written in C, and on the other you could create an uglified (optimised) version of slice_prefix that is finely tuned to the specifics of your application, like average prefix length and so on.

Gauging the potential of either approach requires hard data, like the raw overhead for passing two strings to C (which gives you an absolute limit of what could possibly be achieved via that route) and the actual performance of a good library memcmp() when called from Python under production conditions on representative inputs.

There are some promising avenues for tuning slice_prefix. The first is a slight tweak of the tail end:

def slice_prefix(a, b, start=0, length=64):
 while 1:
 while a[start:start + length] == b[start:start + length]:
 start += length
 length += length
 if length >= 4:
 length /= 4
 elif length == 1:
 return start
 else:
 length = 1

Try factors between 2 and 8 on representative input data.

Your binary approach (a.k.a. directional_prefix) needs some work yet, though. The basic idea is this: when the first while loop exits then we know that there must a a mismatch some where in the block with length length starting at start. If it isn't in the first half then it must be in the second half. So if the first half is equal then you add the current length to start, if it is different then not. Then halve and continue until the linear threshold is reached, at which point you do a linear scan of the rest.

P.S.: there is no one ring to bind them all, which is why I keep mentioning representative inputs. The memcmp() of a mature C runtime can comes close to the ideal and typically contains a sh*tload of optimisations for different cases in order to give balanced high performance (note: the standard memcmp() is no use because it doesn't return the mismatch position, but there are some that do). But the effort for bringing C code into the fold and the complications arising from it are non-negligible, as is the call overhead. Tuning slice_prefix to meet requirements - if it is possible - might be a better solution. It is already in a state of grace as is, though, and from here on can only come uglification...

Question 6

I'm trying to understand your suggestion to use blocks. I think you mean scan the 2 items in blocks of some arbitrary size like 64, then use slice_prefix with the start variable equal to last matched block. But how do I scan in blocks of 64 in an optimized way? Do you mean add a for loop before 'while 1:' and simply iterate in blocks by 64 to find the last one where both items are equal?

Question 7

@mattkaeo: As a first approximation you could use a two-level scheme by simply calling slice_prefix with length set to something like 32, because the function already does everything for you. It finds the first mismatching block, switches length to 1 and then finds the exact mismatch location. It's really neat, actually.

Question 8

I think I see what you're saying. I've added a directional variant to the question. I saw some gains but also some losses. Can you tell me if you think I missed something?

Question 9

@mattkaeo: I've added some more details and explanation. Perhaps you should provisionally un-accept my answer, in order to attract more replies... ;-)

Question 10

Why start with length=32 and not a fraction of the length of the shorter string? Bisect this problem as much as possible.

Question 11

good approach depends greatly of distribution real cases (lengths of prefix/a/b). For instance when i add a, b = memoryview(a), memoryview(b) to slice_prefix get result

len(a) = 1000000, len(b)=900000
['slice2_prefix', 0.1448716410180293]
['slice_prefix', 0.8104352267954309]
len(a) = 2, len(b)=1
['slice_prefix', 0.08141100138638435]
['slice2_prefix', 0.21893188399847152]

when a == b the slice_prefix don't work for me
it looks ideal for c extension optimization (easy code to move to c, remove from python hack code (replacement for os.path.commonprefix), huge time optimization). But maybe try another library first (like numpy)?

Question 12

I still haven't found 50% + gains but, I did test DarthGizka's halving technique, fixed the issue with identical strings, and was surprised to find large gains with large input using memoryview. It's probably time to learn to extend Python with C.

Benchmark

algorithms = [
 ['slice_prefix', slice_prefix],
 ['smemory_prefix', smemory_prefix],
 ['dmemory_prefix', dmemory_prefix],
 ['darth_prefix', darth_prefix]]
for i in xrange(25):
 a = 'a' * 2 ** i + 'z'
 b = 'a' * 2 ** i + 'y'
 times = copy.deepcopy(algorithms)
 for j, algorithm in enumerate(times):
 times[j][1] = timeit.timeit(lambda: algorithm[1](a, b), number=10000)
 print 2 ** i
 for result in sorted(times, key=lambda x: x[-1]): 
 print ' ', result
1
 ['darth_prefix', 0.008644335433928063]
 ['slice_prefix', 0.011235937300625665]
 ['dmemory_prefix', 0.027204313437323435]
 ['smemory_prefix', 0.02916026173534192]
2
 ['slice_prefix', 0.0119423068335891]
 ['darth_prefix', 0.012402158140503161]
 ['smemory_prefix', 0.043039553928792884]
 ['dmemory_prefix', 0.04372577533740696]
4
 ['slice_prefix', 0.014079983313422417]
 ['darth_prefix', 0.014680476427201938]
 ['dmemory_prefix', 0.05297890017300233]
 ['smemory_prefix', 0.0532965294260066]
8
 ['slice_prefix', 0.01648985699921468]
 ['darth_prefix', 0.019445310286755557]
 ['smemory_prefix', 0.06081533533051697]
 ['dmemory_prefix', 0.07198924801286921]
16
 ['slice_prefix', 0.019568569399780245]
 ['darth_prefix', 0.022567084364709444]
 ['smemory_prefix', 0.06823577097929956]
 ['dmemory_prefix', 0.0775797599053476]
32
 ['slice_prefix', 0.02139256723876315]
 ['darth_prefix', 0.02606219133394916]
 ['smemory_prefix', 0.07920267156259797]
 ['dmemory_prefix', 0.09199277986044763]
64
 ['slice_prefix', 0.026915523655588913]
 ['darth_prefix', 0.03019137162482366]
 ['smemory_prefix', 0.08551061470370769]
 ['dmemory_prefix', 0.10126170714647742]
128
 ['slice_prefix', 0.02780757198070205]
 ['darth_prefix', 0.034469885073121986]
 ['smemory_prefix', 0.09218596481696295]
 ['dmemory_prefix', 0.11656005938493763]
256
 ['slice_prefix', 0.030256951794399356]
 ['darth_prefix', 0.03702000550674711]
 ['smemory_prefix', 0.10429222207312705]
 ['dmemory_prefix', 0.12654602285874716]
512
 ['slice_prefix', 0.03457813185832492]
 ['darth_prefix', 0.04374592346175632]
 ['smemory_prefix', 0.1124016445601228]
 ['dmemory_prefix', 0.14454501387263008]
1024
 ['slice_prefix', 0.040601630892524554]
 ['darth_prefix', 0.050192138043712475]
 ['smemory_prefix', 0.12381456930688728]
 ['dmemory_prefix', 0.15493986575074814]
2048
 ['slice_prefix', 0.04844004135520663]
 ['darth_prefix', 0.060962693179135385]
 ['smemory_prefix', 0.13431885315276304]
 ['dmemory_prefix', 0.17624868000166316]
4096
 ['slice_prefix', 0.05967676877753547]
 ['darth_prefix', 0.07432366499870113]
 ['smemory_prefix', 0.14581316051771864]
 ['dmemory_prefix', 0.18878831946130958]
8192
 ['slice_prefix', 0.07948672060865647]
 ['darth_prefix', 0.09363346927420935]
 ['smemory_prefix', 0.16827436846506316]
 ['dmemory_prefix', 0.21853485210704093]
16384
 ['slice_prefix', 0.11653991126149776]
 ['darth_prefix', 0.12642908472662384]
 ['smemory_prefix', 0.20402701744933438]
 ['dmemory_prefix', 0.25172552881849697]
32768
 ['slice_prefix', 0.17343268333934247]
 ['darth_prefix', 0.1912483659280042]
 ['smemory_prefix', 0.25608661007026967]
 ['dmemory_prefix', 0.308776720462447]
65536
 ['slice_prefix', 0.2999407803172289]
 ['darth_prefix', 0.30973829956928967]
 ['smemory_prefix', 0.3549724187987522]
 ['dmemory_prefix', 0.4129709673425168]
131072
 ['slice_prefix', 0.5387379293970298]
 ['smemory_prefix', 0.5455981681798221]
 ['darth_prefix', 0.555022354540597]
 ['dmemory_prefix', 0.6017983978681514]
262144
 ['smemory_prefix', 0.9152490880996993]
 ['dmemory_prefix', 0.990508653224424]
 ['slice_prefix', 1.0029807372084178]
 ['darth_prefix', 1.0165337088001252]
524288
 ['smemory_prefix', 1.6633493372646626]
 ['dmemory_prefix', 1.8189718688727226]
 ['slice_prefix', 1.9119993141739542]
 ['darth_prefix', 2.253921279302631]
1048576
 ['dmemory_prefix', 3.3815675477717377]
 ['smemory_prefix', 3.852834939850254]
 ['darth_prefix', 9.262993861143514]
 ['slice_prefix', 10.2719803196278]
2097152
 ['smemory_prefix', 6.760387839540272]
 ['dmemory_prefix', 6.84966378311492]
 ['slice_prefix', 23.33799268583607]
 ['darth_prefix', 24.301179692429287]
4194304
 ['dmemory_prefix', 14.16732977699212]
 ['smemory_prefix', 14.208940789403641]
 ['darth_prefix', 45.43554476774898]
 ['slice_prefix', 48.59348946944465]
8388608
 ['dmemory_prefix', 28.564412565634484]
 ['smemory_prefix', 28.571759914951144]
 ['darth_prefix', 93.14922846511217]
 ['slice_prefix', 97.82516800967824]
16777216
 ['smemory_prefix', 57.030781198086515]
 ['dmemory_prefix', 61.980545998364505]
 ['slice_prefix', 206.76891168128986]
 ['darth_prefix', 207.6305335736888]

Functions

def slice_prefix(a, b, start=0, length=1):
 if a == b:
 return len(a)
 while 1:
 while a[start:start + length] == b[start:start + length]:
 start += length
 length += length
 if length > 1:
 length = 1
 else:
 return start 
def smemory_prefix(a, b, start=0, length=1): 
 if a == b:
 return len(a)
 while 1:
 while (memoryview(a)[start: start + length] == 
 memoryview(b)[start: start + length]):
 start += length
 length += length
 if length > 1:
 length = 1
 else:
 return start
def dmemory_prefix(a, b, start=0, length=1): 
 if a == b:
 return len(a)
 while 1:
 while (memoryview(a)[start: start + length] == 
 memoryview(b)[start: start + length]):
 start += length
 length += length
 if length >= 4:
 length /= 4
 elif length == 1:
 return start
 else:
 length = 1
def darth_prefix(a, b, start=0, length=1):
 if a == b:
 return len(a)
 while 1:
 while a[start:start + length] == b[start:start + length]:
 start += length
 length += length
 if length >= 4:
 length /= 4
 elif length == 1:
 return start
 else:
 length = 1

DarthGizka DarthGizka 2,8211 gold badge13 silver badges30 bronze badges · Accepted Answer · 2016-03-29 05:32:44Z

slice_prefix and index_prefix look all right algorithmically but they impose a lot of interpreter overhead for each and every character. As always in such a situtation you could unlock speed reserves by pushing some processing into the engine. In other words, the engine can probably compare hundreds of characters in the same time that your loops go through a small handful.

If you cannot find an existing string compare that reports the mismatch position then you could take a long hard look at the distribution of your input strings and pick an initial block size for comparisons, like 32 or 64. Use slice_prefix to find the first mismatching block of that size, and then use block size halving to find the mismatch in that block. When the block size gets lower than some threshold - something on the order of 8 or 16 - switch to linear scan. Run copious tests on representative inputs in order to refine the two thresholds (initial blocks size and linear scan threshold).

Some additional observations. At this point there are two fruitful avenues: on one hand you could defer to a function written in C, and on the other you could create an uglified (optimised) version of slice_prefix that is finely tuned to the specifics of your application, like average prefix length and so on.

Gauging the potential of either approach requires hard data, like the raw overhead for passing two strings to C (which gives you an absolute limit of what could possibly be achieved via that route) and the actual performance of a good library memcmp() when called from Python under production conditions on representative inputs.

There are some promising avenues for tuning slice_prefix. The first is a slight tweak of the tail end:

def slice_prefix(a, b, start=0, length=64):
 while 1:
 while a[start:start + length] == b[start:start + length]:
 start += length
 length += length
 if length >= 4:
 length /= 4
 elif length == 1:
 return start
 else:
 length = 1

Try factors between 2 and 8 on representative input data.

Your binary approach (a.k.a. directional_prefix) needs some work yet, though. The basic idea is this: when the first while loop exits then we know that there must a a mismatch some where in the block with length length starting at start. If it isn't in the first half then it must be in the second half. So if the first half is equal then you add the current length to start, if it is different then not. Then halve and continue until the linear threshold is reached, at which point you do a linear scan of the rest.

P.S.: there is no one ring to bind them all, which is why I keep mentioning representative inputs. The memcmp() of a mature C runtime can comes close to the ideal and typically contains a sh*tload of optimisations for different cases in order to give balanced high performance (note: the standard memcmp() is no use because it doesn't return the mismatch position, but there are some that do). But the effort for bringing C code into the fold and the complications arising from it are non-negligible, as is the call overhead. Tuning slice_prefix to meet requirements - if it is possible - might be a better solution. It is already in a state of grace as is, though, and from here on can only come uglification...

I'm trying to understand your suggestion to use blocks. I think you mean scan the 2 items in blocks of some arbitrary size like 64, then use slice_prefix with the start variable equal to last matched block. But how do I scan in blocks of 64 in an optimized way? Do you mean add a for loop before 'while 1:' and simply iterate in blocks by 64 to find the last one where both items are equal?
@mattkaeo: As a first approximation you could use a two-level scheme by simply calling slice_prefix with length set to something like 32, because the function already does everything for you. It finds the first mismatching block, switches length to 1 and then finds the exact mismatch location. It's really neat, actually.
I think I see what you're saying. I've added a directional variant to the question. I saw some gains but also some losses. Can you tell me if you think I missed something?
@mattkaeo: I've added some more details and explanation. Perhaps you should provisionally un-accept my answer, in order to attract more replies... ;-)
Why start with length=32 and not a fraction of the length of the shorter string? Bisect this problem as much as possible.

Stack Exchange Network

Finding the fastest common prefix of 2 strings in Python

Here's what I've found

The algorithms

A directional variant of slice_prefix

3 Answers 3

Benchmark

Functions

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Finding the fastest common prefix of 2 strings in Python

Here's what I've found

The algorithms

A directional variant of slice_prefix

3 Answers 3

Benchmark

Functions

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions