I would like feedback on my first attempt to understand how hash tables work. I've read some on how hash tables use modulo operations to find the location of a key.
Can you help me understand, compare the modulo approach with my attempt to use a tree?
import random
import hashlib
from collections import defaultdict
import timeit
def insert(key, value, table=[]):
alphabet = lambda: [[] for i in xrange(23)]
if not table:
table = alphabet()
get_hash = lambda x: hashlib.sha1(str(x)).hexdigest().upper()
key_hash = get_hash(key)
entry = table
for i, char in enumerate(key_hash):
ordinal = ord(char) - 48
entry = entry[ordinal]
#entry is a list representing 'char' at index 'i'
entry_len = len(entry)
if entry_len is 0:
entry.extend((key, [value]))
break
elif entry_len is 2:
extant_key, extant_values = entry[0], entry[1]
if key == extant_key:
extant_values.append(value)
break
del entry[:]
#entry = []
extant_hash = get_hash(extant_key)
for i in xrange(i + 1, len(key_hash)):
char, extant_char = key_hash[i], extant_hash[i]
if char == extant_char:
entry.extend(alphabet())
ordinal = ord(char) - 48
entry = entry[ordinal]
#entry = []
else: break
if char != extant_char:
entry.extend(alphabet())
char_ord = ord(char) - 48
entry[char_ord].extend((key, [value]))
extant_ord = ord(extant_char) - 48
entry[extant_ord].extend((extant_key, extant_values))
#entry has empty lists except 2
break
else: pass
#hash collision, to be continued
return table
def get(key, table, pop=False):
key_hash = hashlib.sha1(str(key)).hexdigest().upper()
entry = table
for i, char in enumerate(key_hash):
ordinal = ord(char) - 48
entry = entry[ordinal]
entry_len = len(entry)
if entry_len is 0:
raise KeyError(key)
elif entry_len is 2:
items = entry[1]
if pop:
del entry[:]
if not any(item for item in entry):
del entry
return items
raise KeyError(key)
def make_defaultdict(kv):
d = defaultdict(list)
for items in kv:
k, v = items
d[k].append(v)
return d
def make_my_table(kv):
t = []
for items in kv:
k, v = items
t = insert(k, v, t)
return t
kv = [(random.randint(1000, 9999), random.randint(1000, 9999))
for i in xrange(1000000)]
dd = make_defaultdict(kv)
mt = make_my_table(kv)
validate = []
for key in dd.keys():
if dd[key] == get(key, mt):
validate.append(True)
else:
validate.append(False)
defaultdict_time = timeit.timeit(lambda: make_defaultdict(kv), number=1)
my_table_time = timeit.timeit(lambda: make_my_table(kv), number=1)
print 'Time to build defaultdict', defaultdict_time
print 'Time to build my table', my_table_time
print 'Both tables are identical', all(i for i in validate)
Time to build defaultdict 0.342477702698
Time to build my table 4.21419315689
Both tables are identical True
1 Answer 1
I wanted to know how your program worked and so I removed all the module level checking code and used the following:
table = []
def check(key, value):
global table
table = insert(key, value, table)
print table
print get(key, table)
check(5, 'value')
check('key', 'value2')
check('key', 'value3')
check('key2', 'value4')
This gave me a somewhat strange output.
[[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [5, ['value']], [], [], [], [], []]
['value']
[[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [[], [], [], [], [], [], ['key', ['value2']], [], [], [], [], [], [], [], [], [], [], [], [], [5, ['value']], [], [], []], [], [], [], [], []]
['value2']
[[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [[], [], [], [], [], [], ['key', ['value2', 'value3']], [], [], [], [], [], [], [], [], [], [], [], [], [5, ['value']], [], [], []], [], [], [], [], []]
['value2', 'value3']
[[], [], [], [], [], [], [], [], ['key2', ['value4']], [], [], [], [], [], [], [], [], [[], [], [], [], [], [], ['key', ['value2', 'value3']], [], [], [], [], [], [], [], [], [], [], [], [], [5, ['value']], [], [], []], [], [], [], [], []]
['value4']
The keys 5
and key
collide and so the results should be a list of two items, but you spam it with lots of items.
This can be due to two things:
- A bad algorithm that spams
list.append
- Your algorithm being a 'hash-tree' not hash-table.
From the looks of your algorithm it's probably the latter. This is as you do:
entry = table
for i, char in enumerate(key_hash):
ordinal = ord(char) - 48
entry = entry[ordinal]
This is fine and all, but then it's not a hash-table.
A hash-table should be a 2D-table of items. Where items here is just a [key, value]
pair.
If you change your code to a hash-table then it'd be so much more simpler. The algorithm is hash key, modulo it, append to table:
def add(key, value, table):
hashed_key = hash(key) % len(table)
table[hased_key].append([key, value])
This should bring up some concerns:
- What if the size of the table changes?
- What if the key is already in the table?
- What if the key is already in the table and the size of the table changes?
These can be resolved, (1) by using a class, (2) by checking the list.
class Dict(object):
def __init__(self, size=8192):
self._table = [[] for _ in xrange(size)]
self._mod = size
def _get_item_index(self, key):
hashed_key = hash(key) % self._mod
list_ = self._table[hashed_key]
return list_, next((i for i in xrange(len(list_)) if list_[i][0] == key), None)
This then makes insert
and get
easy to make.
def insert(self, key, value):
list_, index = self._get_item_index(key)
if index is None:
list_.append([key, [value]])
else:
list_[index][1].append(value)
def get(self, key):
list_, index = self._get_item_index(key)
if index is None:
raise KeyError(key)
return list_[index][1]
Your method seems alright, it's hard to read, and is way more complex then it needs to be, so I'd recommend at least making a _get_item_index
if you do carry on with your hash tree.
But it seems faster than my one.
For in-depth review of the code:
If I were to make _get_item
then I'd look at the get
function first.
I don't think that you should have pop
in that function, and so I'll remove it.
This leaves you with:
def get(key, table):
key_hash = hashlib.sha1(str(key)).hexdigest().upper()
entry = table
for i, char in enumerate(key_hash):
ordinal = ord(char) - 48
entry = entry[ordinal]
entry_len = len(entry)
if entry_len is 0:
raise KeyError(key)
elif entry_len is 2:
return entry[1]
raise KeyError(key)
From this we can see the basic looping structure. I'd first merge ordinal with entry, to get entry = entry[ord(char) - 48]
.
You shouldn't use 1 is 1
instead use 1 == 1
, it makes more sense as 1 == 1.0
is true where 1 is 1.0
is not.
As we need to get the item, entry
, even if it's not there (we'll be using this for update
too), we should change the if
s to just return the entry.
Finally you should notice that entry_len == 0
is the same as not entry
.
Merging these all together gets us:
def _get_item(key, table):
key_hash = hashlib.sha1(str(key)).hexdigest().upper()
entry = table
for i, char in enumerate(key_hash):
entry = entry[ord(char) - 48]
if not entry:
return entry
elif len(entry) == 2:
return entry
else:
raise KeyError(key)
To use it is simple:
def get(key, table):
item = _get_item(key)
if item and item[0] == key:
return item[1]
raise KeyError(key)
We still need to get _get_item
working for update
however.
You should be able to see that the code is roughly the same for all except elif entry_len is 2:
.
To help with this we should break out of the loop so that we're in a 'separate environment'.
To make things simpler I'd also return if key == entry[0]
.
And so:
elif len(entry) == 2:
if key == entry[0]:
return entry
break
You should notice that we don't want the new code to run if we are using the other version of this function. To do this we can pass a flag that returns the entry after exiting the for.
if extend:
return entry
Finally we're left with:
extant_key, extant_values = entry[0], entry[1]
del entry[:]
#entry = []
extant_hash = get_hash(extant_key)
for i in xrange(i + 1, len(key_hash)):
char, extant_char = key_hash[i], extant_hash[i]
if char == extant_char:
entry.extend(alphabet())
ordinal = ord(char) - 48
entry = entry[ordinal]
#entry = []
else: break
if char != extant_char:
entry.extend(alphabet())
char_ord = ord(char) - 48
entry[char_ord].extend((key, [value]))
extant_ord = ord(extant_char) - 48
entry[extant_ord].extend((extant_key, extant_values))
#entry has empty lists except 2
break
else: pass
#hash collision, to be continued
This leads to the following changes:
- Using both
extant_key
andextant_value
is more verbose than I like, and would just useextant
with the sliceitem[0:2]
. Use
izip
rather thanxrange(i + 1, len(key_hash))
. This is quite simple addfrom itertools import izip
at the top, and then use:for char, extant_char in izip(key_hash[i + 1:], extant_hash[i + 1:]):
Use
item[ord(...) - 48]
rather than setting intermarry variables.- Raise an error if you don't break out the for loop.
- Change the if to
char != extant_char
and then break. This adds a little bit of readability, and prevents large amounts of indents. - Finally
return entry[ord(char) - 48]
.
This should lead to something like:
extant = entry[0:2]
del entry[:]
extant_hash = get_hash(extant[0])
for char, extant_char in izip(key_hash[i + 1:], extant_hash[i + 1:]):
if char != extant_char:
break
entry.extend(alphabet())
entry = entry[ord(char) - 48]
else:
raise KeyError(key)
entry.extend(alphabet())
entry[ord(extant_char) - 48].extend(extant)
return entry[ord(char) - 48]
Finally I should note that the two KeyError
s raised in this function, aren't really errors with the key, it's actually that there is actually a sha1 collision.
And so I'd change the errors to custom errors.
What is the likelihood of these? Not very likely but my favorite programmers.SE answer makes me think that ruling this out is not a good idea.