Here is a implementation of the cryptographic hash function SHA1 written in Python. It does not use any external libraries, only built-in functions. I know that it would be faster to use an external library, but my code is for learning purposes.
I want to know if I am properly implementing the algorithm. Am I taking any unnecessary steps? What changes can I make to speed up the code?
The code properly works. I've compared the result to a trusted implementation.
def sha1(data):
bytes = ""
h0 = 0x67452301
h1 = 0xEFCDAB89
h2 = 0x98BADCFE
h3 = 0x10325476
h4 = 0xC3D2E1F0
for n in range(len(data)):
bytes+='{0:08b}'.format(ord(data[n]))
bits = bytes+"1"
pBits = bits
#pad until length equals 448 mod 512
while len(pBits)%512 != 448:
pBits+="0"
#append the original length
pBits+='{0:064b}'.format(len(bits)-1)
def chunks(l, n):
return [l[i:i+n] for i in range(0, len(l), n)]
def rol(n, b):
return ((n << b) | (n >> (32 - b))) & 0xffffffff
for c in chunks(pBits, 512):
words = chunks(c, 32)
w = [0]*80
for n in range(0, 16):
w[n] = int(words[n], 2)
for i in range(16, 80):
w[i] = rol((w[i-3] ^ w[i-8] ^ w[i-14] ^ w[i-16]), 1)
a = h0
b = h1
c = h2
d = h3
e = h4
#Main loop
for i in range(0, 80):
if 0 <= i <= 19:
f = (b & c) | ((~b) & d)
k = 0x5A827999
elif 20 <= i <= 39:
f = b ^ c ^ d
k = 0x6ED9EBA1
elif 40 <= i <= 59:
f = (b & c) | (b & d) | (c & d)
k = 0x8F1BBCDC
elif 60 <= i <= 79:
f = b ^ c ^ d
k = 0xCA62C1D6
temp = rol(a, 5) + f + e + k + w[i] & 0xffffffff
e = d
d = c
c = rol(b, 30)
b = a
a = temp
h0 = h0 + a & 0xffffffff
h1 = h1 + b & 0xffffffff
h2 = h2 + c & 0xffffffff
h3 = h3 + d & 0xffffffff
h4 = h4 + e & 0xffffffff
return '%08x%08x%08x%08x%08x' % (h0, h1, h2, h3, h4)
print sha1("hello world")
-
2\$\begingroup\$ docs.python.org/2/library/hashlib.html \$\endgroup\$Alex Buchanan– Alex Buchanan2013年12月18日 02:47:05 +00:00Commented Dec 18, 2013 at 2:47
-
1\$\begingroup\$ @AlexBuchanan I said that my code was not for practical use. \$\endgroup\$kyle k– kyle k2013年12月18日 05:09:38 +00:00Commented Dec 18, 2013 at 5:09
-
1\$\begingroup\$ That's what the reinventing-the-wheel tag is for. I've added it for you. \$\endgroup\$200_success– 200_success2013年12月18日 09:51:22 +00:00Commented Dec 18, 2013 at 9:51
-
\$\begingroup\$ Ah, my bad, I must have skimmed over that line. \$\endgroup\$Alex Buchanan– Alex Buchanan2013年12月18日 17:26:00 +00:00Commented Dec 18, 2013 at 17:26
3 Answers 3
I'm no expert on SHA1, and I see mostly small things, but one big thing that gave me pause. As an overall observation, you seem to value readability over memory usage. Since hashes are often used for large files, this seems to be a risky tradeoff, and python does provide good ways to handle this as well. More of this in the third observation below. On to the observations:
pBits
being represented as a series of characters for each bit seems unusual for such low level work. I would typically expect to see this as a hexadecimal string, or even a number. Yet except for possible difficulties comparing it to other implementations that favor less space usage, this seems to make things easy to read.- Filling out
pBits
to lengths congruent to 448 mod 512 could be done in a single step by figuring out how many bits of padding are necessary and doing a singlepBits += "0" * padding
; there's no need for a while loop. This one's important:
chunks()
should probably be a generator so that you don't have three copies of your data around (one original, one expanded a bit to a byte, and one in chunked strings); correspondingly this could serve out the trailing zero and length bits. Let's examine what this would look like, as 3 copies of large data is really large. Let's remove the sliced copies, creating only one chunk at a time:def iterchunks(bits, chunkbits=512): for i in range(0, len(bits), chunkbits): yield bits[i:i + chunkbits]
Note that if the pBits generation was here, checking
i + chunkbits
againstlen(bits)
and the bytes to bits conversion, you could process a file-like stream instead of a string, taking you further down, from the original 3 copies (with two of them 8 times larger than usual) to not even a full copy of your data in memory at one time.- Back to small ideas. Your bit string could be more literally a bit string, storing
0円
and1円
characters. This would let you simplify setting upw
intow = map(ord, words)
. (Of course with the appropriate helper, you could do that with the0
and1
characters of your current string.) I'm not a big fan of the
for i in range(0, 80): if 0 <= i <= 19: ...
section. It feels like it might read better with a series of for loops that do their different thing then call a helper:for i in range(0, 19): f = (b & c) | ((~b) & d) k = 0x5A827999 a, b, c, d, e = update(a, b, c, d, e, f, k, w[i]) for i in range(20, 39): f = b ^ c ^ d k = 0x6ED9EBA1 a, b, c, d, e = update(a, b, c, d, e, f, k, w[i]) ...
but that seems pretty ridiculous, and would probably hurt performance to boot. After looking more closely at the obvious alternative, I'm going to withdraw this complaint.
Serializing the data into a string of '1'
and '0'
characters, then parsing the string back into numbers, seems incredibly wasteful. You should aim to do the whole thing without stringifying the numbers. I've done it using struct.pack()
and struct.unpack()
, though you could also use basic functions like ord()
with more bit manipulation.
I've kept everything starting from the main loop about the same. The minor changes I've made are to use parallel assignments and inclusive-exclusive ranges, both of which are in the spirit of Python.
from struct import pack, unpack
def sha1(data):
""" Returns the SHA1 sum as a 40-character hex string """
h0 = 0x67452301
h1 = 0xEFCDAB89
h2 = 0x98BADCFE
h3 = 0x10325476
h4 = 0xC3D2E1F0
def rol(n, b):
return ((n << b) | (n >> (32 - b))) & 0xffffffff
# After the data, append a '1' bit, then pad data to a multiple of 64 bytes
# (512 bits). The last 64 bits must contain the length of the original
# string in bits, so leave room for that (adding a whole padding block if
# necessary).
padding = chr(128) + chr(0) * (55 - len(data) % 64)
if len(data) % 64 > 55:
padding += chr(0) * (64 + 55 - len(data) % 64)
padded_data = data + padding + pack('>Q', 8 * len(data))
thunks = [padded_data[i:i+64] for i in range(0, len(padded_data), 64)]
for thunk in thunks:
w = list(unpack('>16L', thunk)) + [0] * 64
for i in range(16, 80):
w[i] = rol((w[i-3] ^ w[i-8] ^ w[i-14] ^ w[i-16]), 1)
a, b, c, d, e = h0, h1, h2, h3, h4
# Main loop
for i in range(0, 80):
if 0 <= i < 20:
f = (b & c) | ((~b) & d)
k = 0x5A827999
elif 20 <= i < 40:
f = b ^ c ^ d
k = 0x6ED9EBA1
elif 40 <= i < 60:
f = (b & c) | (b & d) | (c & d)
k = 0x8F1BBCDC
elif 60 <= i < 80:
f = b ^ c ^ d
k = 0xCA62C1D6
a, b, c, d, e = rol(a, 5) + f + e + k + w[i] & 0xffffffff, \
a, rol(b, 30), c, d
h0 = h0 + a & 0xffffffff
h1 = h1 + b & 0xffffffff
h2 = h2 + c & 0xffffffff
h3 = h3 + d & 0xffffffff
h4 = h4 + e & 0xffffffff
return '%08x%08x%08x%08x%08x' % (h0, h1, h2, h3, h4)
-
3\$\begingroup\$ You could avoid the
if
in the calculation of the padding if you wrote63 - (len(data) + 8) % 64
. \$\endgroup\$Gareth Rees– Gareth Rees2013年12月18日 12:14:28 +00:00Commented Dec 18, 2013 at 12:14 -
\$\begingroup\$ FYI same algorithm using the hashlib.sha1 streaming interface (.update and .digest methods): github.com/pts/tinyveracrypt/blob/… \$\endgroup\$pts– pts2019年10月31日 20:06:37 +00:00Commented Oct 31, 2019 at 20:06
There's no documentation. What does this code do and how am I supposed to use it?
There are no test cases. Since the
hashlib
module has asha1
function, it would be easy to write a test case that generates some random data and hashes it with both algorithms. For example:import hashlib from random import randrange from unittest import TestCase class TestSha1(TestCase): def test_sha1(self): for l in range(129): data = bytearray(randrange(256) for _ in range(l)) self.assertEqual(hashlib.sha1(data).hexdigest(), sha1(data))
You start by converting the message to a string containing its representation in binary. But then later on you convert it back to integers again. It would be simpler and faster to work with bytes throughout. Python provides the
bytearray
type for manipulating sequences of bytes and thestruct
module for interpreting sequences of bytes as binary data. So you could initialize and pad the message like this:message = bytearray(data) # Append a 1-bit. message.append(0x80) # Pad with zeroes until message length in bytes is 56 (mod 64). message.extend([0] * (63 - (len(message) + 7) % 64)) # Append the original length (big-endian, 64 bits). message.extend(struct.pack('>Q', len(data) * 8))
and then convert the chunks to 32-bit integers using
struct.unpack
like this:for chunk in range(0, len(message), 64): w = list(struct.unpack('>16L', message[chunk: chunk+64])) for i in range(16, 80): w.append(rol((w[i-3] ^ w[i-8] ^ w[i-14] ^ w[i-16]), 1))
This runs substantially faster than your code.
I would write the conditions in the main loop as
0 <= i < 20
,20 <= i < 40
and so on, to match the half-open intervals that Python uses elsewhere (in list slices,range
and so on).I would replace:
elif 60 <= i <= 79:
by
else: assert 60 <= i < 80
to make it clear that this case must match.
You can avoid the need for
temp
by using a tuple assignment:a, b, c, d, e = ((rol(a, 5) + f + e + k + w[i]) & 0xffffffff, a, rol(b, 30), c, d)
Explore related questions
See similar questions with these tags.