Python implementation of SHA1

Question 1

Here is a implementation of the cryptographic hash function SHA1 written in Python. It does not use any external libraries, only built-in functions. I know that it would be faster to use an external library, but my code is for learning purposes.

I want to know if I am properly implementing the algorithm. Am I taking any unnecessary steps? What changes can I make to speed up the code?

The code properly works. I've compared the result to a trusted implementation.

def sha1(data):
 bytes = ""
 h0 = 0x67452301
 h1 = 0xEFCDAB89
 h2 = 0x98BADCFE
 h3 = 0x10325476
 h4 = 0xC3D2E1F0
 for n in range(len(data)):
 bytes+='{0:08b}'.format(ord(data[n]))
 bits = bytes+"1"
 pBits = bits
 #pad until length equals 448 mod 512
 while len(pBits)%512 != 448:
 pBits+="0"
 #append the original length
 pBits+='{0:064b}'.format(len(bits)-1)
 def chunks(l, n):
 return [l[i:i+n] for i in range(0, len(l), n)]
 def rol(n, b):
 return ((n << b) | (n >> (32 - b))) & 0xffffffff
 for c in chunks(pBits, 512): 
 words = chunks(c, 32)
 w = [0]*80
 for n in range(0, 16):
 w[n] = int(words[n], 2)
 for i in range(16, 80):
 w[i] = rol((w[i-3] ^ w[i-8] ^ w[i-14] ^ w[i-16]), 1) 
 a = h0
 b = h1
 c = h2
 d = h3
 e = h4
 #Main loop
 for i in range(0, 80):
 if 0 <= i <= 19:
 f = (b & c) | ((~b) & d)
 k = 0x5A827999
 elif 20 <= i <= 39:
 f = b ^ c ^ d
 k = 0x6ED9EBA1
 elif 40 <= i <= 59:
 f = (b & c) | (b & d) | (c & d) 
 k = 0x8F1BBCDC
 elif 60 <= i <= 79:
 f = b ^ c ^ d
 k = 0xCA62C1D6
 temp = rol(a, 5) + f + e + k + w[i] & 0xffffffff
 e = d
 d = c
 c = rol(b, 30)
 b = a
 a = temp
 h0 = h0 + a & 0xffffffff
 h1 = h1 + b & 0xffffffff
 h2 = h2 + c & 0xffffffff
 h3 = h3 + d & 0xffffffff
 h4 = h4 + e & 0xffffffff
 return '%08x%08x%08x%08x%08x' % (h0, h1, h2, h3, h4)
print sha1("hello world")

Question 2

docs.python.org/2/library/hashlib.html

Question 3

@AlexBuchanan I said that my code was not for practical use.

Question 4

That's what the reinventing-the-wheel tag is for. I've added it for you.

Question 5

Ah, my bad, I must have skimmed over that line.

Question 6

I'm no expert on SHA1, and I see mostly small things, but one big thing that gave me pause. As an overall observation, you seem to value readability over memory usage. Since hashes are often used for large files, this seems to be a risky tradeoff, and python does provide good ways to handle this as well. More of this in the third observation below. On to the observations:

pBits being represented as a series of characters for each bit seems unusual for such low level work. I would typically expect to see this as a hexadecimal string, or even a number. Yet except for possible difficulties comparing it to other implementations that favor less space usage, this seems to make things easy to read.
Filling out pBits to lengths congruent to 448 mod 512 could be done in a single step by figuring out how many bits of padding are necessary and doing a single pBits += "0" * padding; there's no need for a while loop.
This one's important: chunks() should probably be a generator so that you don't have three copies of your data around (one original, one expanded a bit to a byte, and one in chunked strings); correspondingly this could serve out the trailing zero and length bits. Let's examine what this would look like, as 3 copies of large data is really large. Let's remove the sliced copies, creating only one chunk at a time:
```
def iterchunks(bits, chunkbits=512):
 for i in range(0, len(bits), chunkbits):
 yield bits[i:i + chunkbits]
```
Note that if the pBits generation was here, checking i + chunkbits against len(bits) and the bytes to bits conversion, you could process a file-like stream instead of a string, taking you further down, from the original 3 copies (with two of them 8 times larger than usual) to not even a full copy of your data in memory at one time.
Back to small ideas. Your bit string could be more literally a bit string, storing 0円 and 1円 characters. This would let you simplify setting up w into w = map(ord, words). (Of course with the appropriate helper, you could do that with the 0 and 1 characters of your current string.)
I'm not a big fan of the for i in range(0, 80): if 0 <= i <= 19: ... section. It feels like it might read better with a series of for loops that do their different thing then call a helper:
```
for i in range(0, 19):
 f = (b & c) | ((~b) & d)
 k = 0x5A827999
 a, b, c, d, e = update(a, b, c, d, e, f, k, w[i])
for i in range(20, 39):
 f = b ^ c ^ d
 k = 0x6ED9EBA1
 a, b, c, d, e = update(a, b, c, d, e, f, k, w[i])
...
```
but that seems pretty ridiculous, and would probably hurt performance to boot. After looking more closely at the obvious alternative, I'm going to withdraw this complaint.

Question 7

Serializing the data into a string of '1' and '0' characters, then parsing the string back into numbers, seems incredibly wasteful. You should aim to do the whole thing without stringifying the numbers. I've done it using struct.pack() and struct.unpack(), though you could also use basic functions like ord() with more bit manipulation.

I've kept everything starting from the main loop about the same. The minor changes I've made are to use parallel assignments and inclusive-exclusive ranges, both of which are in the spirit of Python.

from struct import pack, unpack
def sha1(data):
 """ Returns the SHA1 sum as a 40-character hex string """
 h0 = 0x67452301
 h1 = 0xEFCDAB89
 h2 = 0x98BADCFE
 h3 = 0x10325476
 h4 = 0xC3D2E1F0
 def rol(n, b):
 return ((n << b) | (n >> (32 - b))) & 0xffffffff
 # After the data, append a '1' bit, then pad data to a multiple of 64 bytes
 # (512 bits). The last 64 bits must contain the length of the original
 # string in bits, so leave room for that (adding a whole padding block if
 # necessary).
 padding = chr(128) + chr(0) * (55 - len(data) % 64)
 if len(data) % 64 > 55:
 padding += chr(0) * (64 + 55 - len(data) % 64)
 padded_data = data + padding + pack('>Q', 8 * len(data))
 thunks = [padded_data[i:i+64] for i in range(0, len(padded_data), 64)]
 for thunk in thunks:
 w = list(unpack('>16L', thunk)) + [0] * 64
 for i in range(16, 80):
 w[i] = rol((w[i-3] ^ w[i-8] ^ w[i-14] ^ w[i-16]), 1)
 a, b, c, d, e = h0, h1, h2, h3, h4
 # Main loop
 for i in range(0, 80):
 if 0 <= i < 20:
 f = (b & c) | ((~b) & d)
 k = 0x5A827999
 elif 20 <= i < 40:
 f = b ^ c ^ d
 k = 0x6ED9EBA1
 elif 40 <= i < 60:
 f = (b & c) | (b & d) | (c & d) 
 k = 0x8F1BBCDC
 elif 60 <= i < 80:
 f = b ^ c ^ d
 k = 0xCA62C1D6
 a, b, c, d, e = rol(a, 5) + f + e + k + w[i] & 0xffffffff, \
 a, rol(b, 30), c, d
 h0 = h0 + a & 0xffffffff
 h1 = h1 + b & 0xffffffff
 h2 = h2 + c & 0xffffffff
 h3 = h3 + d & 0xffffffff
 h4 = h4 + e & 0xffffffff
 return '%08x%08x%08x%08x%08x' % (h0, h1, h2, h3, h4)

Question 8

You could avoid the if in the calculation of the padding if you wrote 63 - (len(data) + 8) % 64.

Question 9

FYI same algorithm using the hashlib.sha1 streaming interface (.update and .digest methods): github.com/pts/tinyveracrypt/blob/…

Question 10

There's no documentation. What does this code do and how am I supposed to use it?

There are no test cases. Since the hashlib module has a sha1 function, it would be easy to write a test case that generates some random data and hashes it with both algorithms. For example:

import hashlib
from random import randrange
from unittest import TestCase
class TestSha1(TestCase):
 def test_sha1(self):
 for l in range(129):
 data = bytearray(randrange(256) for _ in range(l))
 self.assertEqual(hashlib.sha1(data).hexdigest(), sha1(data))

You start by converting the message to a string containing its representation in binary. But then later on you convert it back to integers again. It would be simpler and faster to work with bytes throughout. Python provides the bytearray type for manipulating sequences of bytes and the struct module for interpreting sequences of bytes as binary data. So you could initialize and pad the message like this:
```
message = bytearray(data)
# Append a 1-bit.
message.append(0x80)
# Pad with zeroes until message length in bytes is 56 (mod 64).
message.extend([0] * (63 - (len(message) + 7) % 64))
# Append the original length (big-endian, 64 bits).
message.extend(struct.pack('>Q', len(data) * 8))
```
and then convert the chunks to 32-bit integers using struct.unpack like this:
```
for chunk in range(0, len(message), 64):
 w = list(struct.unpack('>16L', message[chunk: chunk+64]))
 for i in range(16, 80):
 w.append(rol((w[i-3] ^ w[i-8] ^ w[i-14] ^ w[i-16]), 1))
```
This runs substantially faster than your code.
I would write the conditions in the main loop as 0 <= i < 20, 20 <= i < 40 and so on, to match the half-open intervals that Python uses elsewhere (in list slices, range and so on).
I would replace:
```
elif 60 <= i <= 79:
```
by
```
else:
 assert 60 <= i < 80
```
to make it clear that this case must match.

You can avoid the need for temp by using a tuple assignment:

a, b, c, d, e = ((rol(a, 5) + f + e + k + w[i]) & 0xffffffff, a, rol(b, 30), c, d)

Michael Urman Michael Urman 3,8701 gold badge12 silver badges23 bronze badges · Answer 1 · 2013-12-18 04:07:29Z

I'm no expert on SHA1, and I see mostly small things, but one big thing that gave me pause. As an overall observation, you seem to value readability over memory usage. Since hashes are often used for large files, this seems to be a risky tradeoff, and python does provide good ways to handle this as well. More of this in the third observation below. On to the observations:

pBits being represented as a series of characters for each bit seems unusual for such low level work. I would typically expect to see this as a hexadecimal string, or even a number. Yet except for possible difficulties comparing it to other implementations that favor less space usage, this seems to make things easy to read.
Filling out pBits to lengths congruent to 448 mod 512 could be done in a single step by figuring out how many bits of padding are necessary and doing a single pBits += "0" * padding; there's no need for a while loop.
This one's important: chunks() should probably be a generator so that you don't have three copies of your data around (one original, one expanded a bit to a byte, and one in chunked strings); correspondingly this could serve out the trailing zero and length bits. Let's examine what this would look like, as 3 copies of large data is really large. Let's remove the sliced copies, creating only one chunk at a time:
```
def iterchunks(bits, chunkbits=512):
 for i in range(0, len(bits), chunkbits):
 yield bits[i:i + chunkbits]
```
Note that if the pBits generation was here, checking i + chunkbits against len(bits) and the bytes to bits conversion, you could process a file-like stream instead of a string, taking you further down, from the original 3 copies (with two of them 8 times larger than usual) to not even a full copy of your data in memory at one time.
Back to small ideas. Your bit string could be more literally a bit string, storing 0円 and 1円 characters. This would let you simplify setting up w into w = map(ord, words). (Of course with the appropriate helper, you could do that with the 0 and 1 characters of your current string.)
I'm not a big fan of the for i in range(0, 80): if 0 <= i <= 19: ... section. It feels like it might read better with a series of for loops that do their different thing then call a helper:
```
for i in range(0, 19):
 f = (b & c) | ((~b) & d)
 k = 0x5A827999
 a, b, c, d, e = update(a, b, c, d, e, f, k, w[i])
for i in range(20, 39):
 f = b ^ c ^ d
 k = 0x6ED9EBA1
 a, b, c, d, e = update(a, b, c, d, e, f, k, w[i])
...
```
but that seems pretty ridiculous, and would probably hurt performance to boot. After looking more closely at the obvious alternative, I'm going to withdraw this complaint.

200_success 200_success 145k22 gold badges190 silver badges478 bronze badges · Answer 2 · 2013-12-18 11:58:32Z

Serializing the data into a string of '1' and '0' characters, then parsing the string back into numbers, seems incredibly wasteful. You should aim to do the whole thing without stringifying the numbers. I've done it using struct.pack() and struct.unpack(), though you could also use basic functions like ord() with more bit manipulation.

I've kept everything starting from the main loop about the same. The minor changes I've made are to use parallel assignments and inclusive-exclusive ranges, both of which are in the spirit of Python.

from struct import pack, unpack
def sha1(data):
 """ Returns the SHA1 sum as a 40-character hex string """
 h0 = 0x67452301
 h1 = 0xEFCDAB89
 h2 = 0x98BADCFE
 h3 = 0x10325476
 h4 = 0xC3D2E1F0
 def rol(n, b):
 return ((n << b) | (n >> (32 - b))) & 0xffffffff
 # After the data, append a '1' bit, then pad data to a multiple of 64 bytes
 # (512 bits). The last 64 bits must contain the length of the original
 # string in bits, so leave room for that (adding a whole padding block if
 # necessary).
 padding = chr(128) + chr(0) * (55 - len(data) % 64)
 if len(data) % 64 > 55:
 padding += chr(0) * (64 + 55 - len(data) % 64)
 padded_data = data + padding + pack('>Q', 8 * len(data))
 thunks = [padded_data[i:i+64] for i in range(0, len(padded_data), 64)]
 for thunk in thunks:
 w = list(unpack('>16L', thunk)) + [0] * 64
 for i in range(16, 80):
 w[i] = rol((w[i-3] ^ w[i-8] ^ w[i-14] ^ w[i-16]), 1)
 a, b, c, d, e = h0, h1, h2, h3, h4
 # Main loop
 for i in range(0, 80):
 if 0 <= i < 20:
 f = (b & c) | ((~b) & d)
 k = 0x5A827999
 elif 20 <= i < 40:
 f = b ^ c ^ d
 k = 0x6ED9EBA1
 elif 40 <= i < 60:
 f = (b & c) | (b & d) | (c & d) 
 k = 0x8F1BBCDC
 elif 60 <= i < 80:
 f = b ^ c ^ d
 k = 0xCA62C1D6
 a, b, c, d, e = rol(a, 5) + f + e + k + w[i] & 0xffffffff, \
 a, rol(b, 30), c, d
 h0 = h0 + a & 0xffffffff
 h1 = h1 + b & 0xffffffff
 h2 = h2 + c & 0xffffffff
 h3 = h3 + d & 0xffffffff
 h4 = h4 + e & 0xffffffff
 return '%08x%08x%08x%08x%08x' % (h0, h1, h2, h3, h4)

You could avoid the if in the calculation of the padding if you wrote 63 - (len(data) + 8) % 64.
FYI same algorithm using the hashlib.sha1 streaming interface (.update and .digest methods): github.com/pts/tinyveracrypt/blob/…

Gareth Rees Gareth Rees 50.1k3 gold badges130 silver badges210 bronze badges · Answer 3 · 2013-12-18 11:51:20Z

There's no documentation. What does this code do and how am I supposed to use it?

There are no test cases. Since the hashlib module has a sha1 function, it would be easy to write a test case that generates some random data and hashes it with both algorithms. For example:

import hashlib
from random import randrange
from unittest import TestCase
class TestSha1(TestCase):
 def test_sha1(self):
 for l in range(129):
 data = bytearray(randrange(256) for _ in range(l))
 self.assertEqual(hashlib.sha1(data).hexdigest(), sha1(data))

You start by converting the message to a string containing its representation in binary. But then later on you convert it back to integers again. It would be simpler and faster to work with bytes throughout. Python provides the bytearray type for manipulating sequences of bytes and the struct module for interpreting sequences of bytes as binary data. So you could initialize and pad the message like this:
```
message = bytearray(data)
# Append a 1-bit.
message.append(0x80)
# Pad with zeroes until message length in bytes is 56 (mod 64).
message.extend([0] * (63 - (len(message) + 7) % 64))
# Append the original length (big-endian, 64 bits).
message.extend(struct.pack('>Q', len(data) * 8))
```
and then convert the chunks to 32-bit integers using struct.unpack like this:
```
for chunk in range(0, len(message), 64):
 w = list(struct.unpack('>16L', message[chunk: chunk+64]))
 for i in range(16, 80):
 w.append(rol((w[i-3] ^ w[i-8] ^ w[i-14] ^ w[i-16]), 1))
```
This runs substantially faster than your code.
I would write the conditions in the main loop as 0 <= i < 20, 20 <= i < 40 and so on, to match the half-open intervals that Python uses elsewhere (in list slices, range and so on).
I would replace:
```
elif 60 <= i <= 79:
```
by
```
else:
 assert 60 <= i < 80
```
to make it clear that this case must match.

You can avoid the need for temp by using a tuple assignment:

a, b, c, d, e = ((rol(a, 5) + f + e + k + w[i]) & 0xffffffff, a, rol(b, 30), c, d)

Stack Exchange Network

Python implementation of SHA1

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Python implementation of SHA1

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions