A basic pure Python hash map using a tree

Question 1

I would like feedback on my first attempt to understand how hash tables work. I've read some on how hash tables use modulo operations to find the location of a key.

Can you help me understand, compare the modulo approach with my attempt to use a tree?

import random
import hashlib
from collections import defaultdict
import timeit
def insert(key, value, table=[]):
 alphabet = lambda: [[] for i in xrange(23)]
 if not table:
 table = alphabet()
 get_hash = lambda x: hashlib.sha1(str(x)).hexdigest().upper()
 key_hash = get_hash(key)
 entry = table
 for i, char in enumerate(key_hash):
 ordinal = ord(char) - 48
 entry = entry[ordinal]
 #entry is a list representing 'char' at index 'i'
 entry_len = len(entry)
 if entry_len is 0:
 entry.extend((key, [value]))
 break
 elif entry_len is 2:
 extant_key, extant_values = entry[0], entry[1]
 if key == extant_key:
 extant_values.append(value)
 break
 del entry[:]
 #entry = []
 extant_hash = get_hash(extant_key)
 for i in xrange(i + 1, len(key_hash)):
 char, extant_char = key_hash[i], extant_hash[i]
 if char == extant_char:
 entry.extend(alphabet())
 ordinal = ord(char) - 48
 entry = entry[ordinal]
 #entry = []
 else: break
 if char != extant_char:
 entry.extend(alphabet())
 char_ord = ord(char) - 48
 entry[char_ord].extend((key, [value]))
 extant_ord = ord(extant_char) - 48
 entry[extant_ord].extend((extant_key, extant_values))
 #entry has empty lists except 2
 break
 else: pass
 #hash collision, to be continued
 return table
def get(key, table, pop=False):
 key_hash = hashlib.sha1(str(key)).hexdigest().upper()
 entry = table
 for i, char in enumerate(key_hash):
 ordinal = ord(char) - 48
 entry = entry[ordinal]
 entry_len = len(entry)
 if entry_len is 0:
 raise KeyError(key)
 elif entry_len is 2:
 items = entry[1]
 if pop:
 del entry[:]
 if not any(item for item in entry):
 del entry
 return items
 raise KeyError(key)
def make_defaultdict(kv):
 d = defaultdict(list)
 for items in kv:
 k, v = items
 d[k].append(v)
 return d
def make_my_table(kv):
 t = []
 for items in kv:
 k, v = items
 t = insert(k, v, t)
 return t
kv = [(random.randint(1000, 9999), random.randint(1000, 9999))
 for i in xrange(1000000)]
dd = make_defaultdict(kv)
mt = make_my_table(kv)
validate = []
for key in dd.keys():
 if dd[key] == get(key, mt):
 validate.append(True)
 else:
 validate.append(False)
defaultdict_time = timeit.timeit(lambda: make_defaultdict(kv), number=1)
my_table_time = timeit.timeit(lambda: make_my_table(kv), number=1)
print 'Time to build defaultdict', defaultdict_time
print 'Time to build my table', my_table_time
print 'Both tables are identical', all(i for i in validate)
Time to build defaultdict 0.342477702698
Time to build my table 4.21419315689
Both tables are identical True

Question 2

I wanted to know how your program worked and so I removed all the module level checking code and used the following:

table = []
def check(key, value):
 global table
 table = insert(key, value, table)
 print table
 print get(key, table)
check(5, 'value')
check('key', 'value2')
check('key', 'value3')
check('key2', 'value4')

This gave me a somewhat strange output.

[[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [5, ['value']], [], [], [], [], []]
['value']
[[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [[], [], [], [], [], [], ['key', ['value2']], [], [], [], [], [], [], [], [], [], [], [], [], [5, ['value']], [], [], []], [], [], [], [], []]
['value2']
[[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [[], [], [], [], [], [], ['key', ['value2', 'value3']], [], [], [], [], [], [], [], [], [], [], [], [], [5, ['value']], [], [], []], [], [], [], [], []]
['value2', 'value3']
[[], [], [], [], [], [], [], [], ['key2', ['value4']], [], [], [], [], [], [], [], [], [[], [], [], [], [], [], ['key', ['value2', 'value3']], [], [], [], [], [], [], [], [], [], [], [], [], [5, ['value']], [], [], []], [], [], [], [], []]
['value4']

The keys 5 and key collide and so the results should be a list of two items, but you spam it with lots of items. This can be due to two things:

A bad algorithm that spams list.append
Your algorithm being a 'hash-tree' not hash-table.

From the looks of your algorithm it's probably the latter. This is as you do:

entry = table
for i, char in enumerate(key_hash):
 ordinal = ord(char) - 48
 entry = entry[ordinal]

This is fine and all, but then it's not a hash-table.

A hash-table should be a 2D-table of items. Where items here is just a [key, value] pair.

If you change your code to a hash-table then it'd be so much more simpler. The algorithm is hash key, modulo it, append to table:

def add(key, value, table):
 hashed_key = hash(key) % len(table)
 table[hased_key].append([key, value])

This should bring up some concerns:

What if the size of the table changes?
What if the key is already in the table?
What if the key is already in the table and the size of the table changes?

These can be resolved, (1) by using a class, (2) by checking the list.

class Dict(object):
 def __init__(self, size=8192):
 self._table = [[] for _ in xrange(size)]
 self._mod = size
 def _get_item_index(self, key):
 hashed_key = hash(key) % self._mod
 list_ = self._table[hashed_key]
 return list_, next((i for i in xrange(len(list_)) if list_[i][0] == key), None)

This then makes insert and get easy to make.

def insert(self, key, value):
 list_, index = self._get_item_index(key)
 if index is None:
 list_.append([key, [value]])
 else:
 list_[index][1].append(value)
def get(self, key):
 list_, index = self._get_item_index(key)
 if index is None:
 raise KeyError(key)
 return list_[index][1]

Your method seems alright, it's hard to read, and is way more complex then it needs to be, so I'd recommend at least making a _get_item_index if you do carry on with your hash tree. But it seems faster than my one.

For in-depth review of the code:

If I were to make _get_item then I'd look at the get function first. I don't think that you should have pop in that function, and so I'll remove it. This leaves you with:

def get(key, table):
 key_hash = hashlib.sha1(str(key)).hexdigest().upper()
 entry = table
 for i, char in enumerate(key_hash):
 ordinal = ord(char) - 48
 entry = entry[ordinal]
 entry_len = len(entry)
 if entry_len is 0:
 raise KeyError(key)
 elif entry_len is 2:
 return entry[1]
 raise KeyError(key)

From this we can see the basic looping structure. I'd first merge ordinal with entry, to get entry = entry[ord(char) - 48]. You shouldn't use 1 is 1 instead use 1 == 1, it makes more sense as 1 == 1.0 is true where 1 is 1.0 is not. As we need to get the item, entry, even if it's not there (we'll be using this for update too), we should change the ifs to just return the entry. Finally you should notice that entry_len == 0 is the same as not entry.
Merging these all together gets us:

def _get_item(key, table):
 key_hash = hashlib.sha1(str(key)).hexdigest().upper()
 entry = table
 for i, char in enumerate(key_hash):
 entry = entry[ord(char) - 48]
 if not entry:
 return entry
 elif len(entry) == 2:
 return entry
 else:
 raise KeyError(key)

To use it is simple:

def get(key, table):
 item = _get_item(key)
 if item and item[0] == key:
 return item[1]
 raise KeyError(key)

We still need to get _get_item working for update however.

You should be able to see that the code is roughly the same for all except elif entry_len is 2:. To help with this we should break out of the loop so that we're in a 'separate environment'. To make things simpler I'd also return if key == entry[0]. And so:

elif len(entry) == 2:
 if key == entry[0]:
 return entry
 break

You should notice that we don't want the new code to run if we are using the other version of this function. To do this we can pass a flag that returns the entry after exiting the for.

if extend:
 return entry

Finally we're left with:

extant_key, extant_values = entry[0], entry[1]
del entry[:]
#entry = []
extant_hash = get_hash(extant_key)
for i in xrange(i + 1, len(key_hash)):
 char, extant_char = key_hash[i], extant_hash[i]
 if char == extant_char:
 entry.extend(alphabet())
 ordinal = ord(char) - 48
 entry = entry[ordinal]
 #entry = []
 else: break
if char != extant_char:
 entry.extend(alphabet())
 char_ord = ord(char) - 48
 entry[char_ord].extend((key, [value]))
 extant_ord = ord(extant_char) - 48
 entry[extant_ord].extend((extant_key, extant_values))
 #entry has empty lists except 2
 break
else: pass
 #hash collision, to be continued

This leads to the following changes:

Using both extant_key and extant_value is more verbose than I like, and would just use extant with the slice item[0:2].
Use izip rather than xrange(i + 1, len(key_hash)). This is quite simple add from itertools import izip at the top, and then use:
```
for char, extant_char in izip(key_hash[i + 1:], extant_hash[i + 1:]):
```
Use item[ord(...) - 48] rather than setting intermarry variables.
Raise an error if you don't break out the for loop.
Change the if to char != extant_char and then break. This adds a little bit of readability, and prevents large amounts of indents.
Finally return entry[ord(char) - 48].

This should lead to something like:

extant = entry[0:2]
del entry[:]
extant_hash = get_hash(extant[0])
for char, extant_char in izip(key_hash[i + 1:], extant_hash[i + 1:]):
 if char != extant_char:
 break
 entry.extend(alphabet())
 entry = entry[ord(char) - 48]
else:
 raise KeyError(key)
entry.extend(alphabet())
entry[ord(extant_char) - 48].extend(extant)
return entry[ord(char) - 48]

Finally I should note that the two KeyErrors raised in this function, aren't really errors with the key, it's actually that there is actually a sha1 collision. And so I'd change the errors to custom errors.

What is the likelihood of these? Not very likely but my favorite programmers.SE answer makes me think that ruling this out is not a good idea.

Peilonrayz ♦Peilonrayz 44.4k7 gold badges80 silver badges157 bronze badges · Accepted Answer · 2016-04-14 14:57:21Z

I wanted to know how your program worked and so I removed all the module level checking code and used the following:

table = []
def check(key, value):
 global table
 table = insert(key, value, table)
 print table
 print get(key, table)
check(5, 'value')
check('key', 'value2')
check('key', 'value3')
check('key2', 'value4')

This gave me a somewhat strange output.

[[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [5, ['value']], [], [], [], [], []]
['value']
[[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [[], [], [], [], [], [], ['key', ['value2']], [], [], [], [], [], [], [], [], [], [], [], [], [5, ['value']], [], [], []], [], [], [], [], []]
['value2']
[[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [[], [], [], [], [], [], ['key', ['value2', 'value3']], [], [], [], [], [], [], [], [], [], [], [], [], [5, ['value']], [], [], []], [], [], [], [], []]
['value2', 'value3']
[[], [], [], [], [], [], [], [], ['key2', ['value4']], [], [], [], [], [], [], [], [], [[], [], [], [], [], [], ['key', ['value2', 'value3']], [], [], [], [], [], [], [], [], [], [], [], [], [5, ['value']], [], [], []], [], [], [], [], []]
['value4']

The keys 5 and key collide and so the results should be a list of two items, but you spam it with lots of items. This can be due to two things:

A bad algorithm that spams list.append
Your algorithm being a 'hash-tree' not hash-table.

From the looks of your algorithm it's probably the latter. This is as you do:

entry = table
for i, char in enumerate(key_hash):
 ordinal = ord(char) - 48
 entry = entry[ordinal]

This is fine and all, but then it's not a hash-table.

A hash-table should be a 2D-table of items. Where items here is just a [key, value] pair.

If you change your code to a hash-table then it'd be so much more simpler. The algorithm is hash key, modulo it, append to table:

def add(key, value, table):
 hashed_key = hash(key) % len(table)
 table[hased_key].append([key, value])

This should bring up some concerns:

What if the size of the table changes?
What if the key is already in the table?
What if the key is already in the table and the size of the table changes?

These can be resolved, (1) by using a class, (2) by checking the list.

class Dict(object):
 def __init__(self, size=8192):
 self._table = [[] for _ in xrange(size)]
 self._mod = size
 def _get_item_index(self, key):
 hashed_key = hash(key) % self._mod
 list_ = self._table[hashed_key]
 return list_, next((i for i in xrange(len(list_)) if list_[i][0] == key), None)

This then makes insert and get easy to make.

def insert(self, key, value):
 list_, index = self._get_item_index(key)
 if index is None:
 list_.append([key, [value]])
 else:
 list_[index][1].append(value)
def get(self, key):
 list_, index = self._get_item_index(key)
 if index is None:
 raise KeyError(key)
 return list_[index][1]

Your method seems alright, it's hard to read, and is way more complex then it needs to be, so I'd recommend at least making a _get_item_index if you do carry on with your hash tree. But it seems faster than my one.

For in-depth review of the code:

If I were to make _get_item then I'd look at the get function first. I don't think that you should have pop in that function, and so I'll remove it. This leaves you with:

def get(key, table):
 key_hash = hashlib.sha1(str(key)).hexdigest().upper()
 entry = table
 for i, char in enumerate(key_hash):
 ordinal = ord(char) - 48
 entry = entry[ordinal]
 entry_len = len(entry)
 if entry_len is 0:
 raise KeyError(key)
 elif entry_len is 2:
 return entry[1]
 raise KeyError(key)

From this we can see the basic looping structure. I'd first merge ordinal with entry, to get entry = entry[ord(char) - 48]. You shouldn't use 1 is 1 instead use 1 == 1, it makes more sense as 1 == 1.0 is true where 1 is 1.0 is not. As we need to get the item, entry, even if it's not there (we'll be using this for update too), we should change the ifs to just return the entry. Finally you should notice that entry_len == 0 is the same as not entry.
Merging these all together gets us:

def _get_item(key, table):
 key_hash = hashlib.sha1(str(key)).hexdigest().upper()
 entry = table
 for i, char in enumerate(key_hash):
 entry = entry[ord(char) - 48]
 if not entry:
 return entry
 elif len(entry) == 2:
 return entry
 else:
 raise KeyError(key)

To use it is simple:

def get(key, table):
 item = _get_item(key)
 if item and item[0] == key:
 return item[1]
 raise KeyError(key)

We still need to get _get_item working for update however.

You should be able to see that the code is roughly the same for all except elif entry_len is 2:. To help with this we should break out of the loop so that we're in a 'separate environment'. To make things simpler I'd also return if key == entry[0]. And so:

elif len(entry) == 2:
 if key == entry[0]:
 return entry
 break

You should notice that we don't want the new code to run if we are using the other version of this function. To do this we can pass a flag that returns the entry after exiting the for.

if extend:
 return entry

Finally we're left with:

extant_key, extant_values = entry[0], entry[1]
del entry[:]
#entry = []
extant_hash = get_hash(extant_key)
for i in xrange(i + 1, len(key_hash)):
 char, extant_char = key_hash[i], extant_hash[i]
 if char == extant_char:
 entry.extend(alphabet())
 ordinal = ord(char) - 48
 entry = entry[ordinal]
 #entry = []
 else: break
if char != extant_char:
 entry.extend(alphabet())
 char_ord = ord(char) - 48
 entry[char_ord].extend((key, [value]))
 extant_ord = ord(extant_char) - 48
 entry[extant_ord].extend((extant_key, extant_values))
 #entry has empty lists except 2
 break
else: pass
 #hash collision, to be continued

This leads to the following changes:

Using both extant_key and extant_value is more verbose than I like, and would just use extant with the slice item[0:2].
Use izip rather than xrange(i + 1, len(key_hash)). This is quite simple add from itertools import izip at the top, and then use:
```
for char, extant_char in izip(key_hash[i + 1:], extant_hash[i + 1:]):
```
Use item[ord(...) - 48] rather than setting intermarry variables.
Raise an error if you don't break out the for loop.
Change the if to char != extant_char and then break. This adds a little bit of readability, and prevents large amounts of indents.
Finally return entry[ord(char) - 48].

This should lead to something like:

extant = entry[0:2]
del entry[:]
extant_hash = get_hash(extant[0])
for char, extant_char in izip(key_hash[i + 1:], extant_hash[i + 1:]):
 if char != extant_char:
 break
 entry.extend(alphabet())
 entry = entry[ord(char) - 48]
else:
 raise KeyError(key)
entry.extend(alphabet())
entry[ord(extant_char) - 48].extend(extant)
return entry[ord(char) - 48]

Finally I should note that the two KeyErrors raised in this function, aren't really errors with the key, it's actually that there is actually a sha1 collision. And so I'd change the errors to custom errors.

What is the likelihood of these? Not very likely but my favorite programmers.SE answer makes me think that ruling this out is not a good idea.

Stack Exchange Network

A basic pure Python hash map using a tree

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

A basic pure Python hash map using a tree

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions