2
\$\begingroup\$

as I was reading the Git Internals in the Pro Git book, I decided to see what objects I need to create to have the short hash say something funny. To do this, I have written the following code:

messages = ['badcafe','deaddad','badbeef','feedbac']
possible_letters = string.ascii_lowercase + string.ascii_uppercase + ' '
def getHash(data):
 return hashlib.sha1(str("blob " + str(len(data)) + "0円" + data).encode('utf-8')).hexdigest()
def populate(n):
 for data in product(possible_letters, repeat=n):
 s = getHash(''.join(data))[0:7]
 if s in messages:
 print(s,data)

Unfortunately, this is too slow for n> 4. For smaller n, it only finds badcafe. Is there any way to speed this up?

asked Jun 7, 2014 at 17:28
\$\endgroup\$

1 Answer 1

1
\$\begingroup\$

That's a brute force search of hashes for a particular prefix, meaning it's complexity is \$O(2^n)\,ドル where \$n\$ is the number of bits in the prefix. In your case that's \2ドル^{28}\$.

You can't make it faster in terms of complexity or the hash would be broken, but you can increase the constant factor.


Using cProfile I see that out of 30 seconds of runtime the worst offenders are:

8042221 10.763 0.000 24.243 0.000 tmp.py:10(getHash)
 5 5.138 1.028 30.701 6.140 tmp.py:13(populate)
8042222 3.805 0.000 3.805 0.000 {_hashlib.openssl_sha1}
8042221 5.143 0.000 5.143 0.000 {method 'encode' of 'str' objects}
8042221 3.876 0.000 3.876 0.000 {method 'hexdigest' of '_hashlib.HASH' objects}
8042221 1.320 0.000 1.320 0.000 {method 'join' of 'str' objects}

The actual hashing is done in hexdigest and OpenSSL. Those pretty much cannot be reduced, so at best it could be made ~3 times as fast.


In python 2 you can remove the encode invocation. That alone saves a few seconds. A set is faster for in testing, so that's also an easy fix. Take advantage of hashlib.sha1's copy and update by noticing the length stays constant and you are at 2x the speed:

def getHash(data, h):
 h = h.copy() # "fork" the hash
 h.update(data)
 return h.hexdigest()
def populate(n):
 h = hashlib.sha1("blob " + str(n) + "0円") # shared part of the hash
 for data in product(possible_letters, repeat=n):
 s = getHash(''.join(data), h)[0:7]
 if s in messages:
 print(s,data)

On Python 3 you can encode the letters into bytes and avoid encode inside the loop, for similar performance.

You might still make it about 1.5x as fast if you replaced the product -> join algorithm with something faster (bytearray(n) and depth first search?), but that's about it.

answered Jun 7, 2014 at 18:12
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.