Motivation: I have a large number of large read-only dicts with large string keys mapped to tuples of two floats. They are taking up a large amount of heap space.
Proposed solution: a 3xn NumPy array, with the first column being the hash value of the key, and sorted by that column. Only contains and getitem
ops are required for my needs.
I realize that this solution is \$O(log(n))\$ rather than \$O(c)\$ but I assume that this isn't that big a problem for my application as there aren't that many lookups per request and my primary problem is memory constraints.
I'm not sure how well searchsorted
performs here though, or if I should try to change the stride somehow so that the first column is contiguous or something like that. I should also probably be using a record array.
import numpy
class FCD(object):
def __init__(self, d):
data = []
for key, value in d.iteritems():
data.append((hash(key), value[0], value[1]))
data.sort()
self.data = numpy.array(data)
def __getitem__(self, key):
hkey = hash(key)
ix = numpy.searchsorted(self.data[:, 0], hkey)
if ix < len(self.data) and self.data[ix][0] == hkey:
return self.data[ix][1:]
else:
raise KeyError
def __contains__(self, key):
hkey = hash(key)
ix = numpy.searchsorted(self.data[:, 0], hkey)
return ix < len(self.data) and self.data[ix][0] == hkey
1 Answer 1
I think this looks pretty efficient. Let me add a few more minor, detailed comments (more than 1 year after the question was asked; I hope they are still useful):
After you convert to a NumPy array, the keys are longer guaranteed to be unique (if someone re-uses your code in a different context, etc). This will cause
searchsorted
to fail. You might want to add a check such as:assert len(self.data[:, 0]) == len(np.unique(self.data[:, 0])
to make sure users and readers of your code know that the first column of your array must be unique.
I agree that record arrays (aka structured arrays) would be better. Your code would be more readable because instead of things like:
self.data[ix][1:]
you could write:
self.data[ix]['Float_1'], self.data[ix]['Float_2']
(but please use more descriptive names for your context than
'Float_1
etc.!)Also, and more importantly: without using a structured array, your hash keys will be stored as floats instead of ints. That will take extra memory, and make future calls to
searchsorted
and__eq
slightly less efficient.Your
__contains__
method seems overly stringent to me, which would result in it being slower. Do you need to callsearcshorted
in__contains__
at all? Wouldn't something like this suffice and be more efficient?def __contains__(self, key): hkey = hash(key) return np.in1d(self.data[:, 0], hkey).any()
You are duplicating the
__contains__
check in your__getitem__
code. I'd just say it this way:if __contains__(self, key): return self.data[ix][1:]
Very minor, but the idiom is always
import numpy as np
. It's probably best to use that so you can donp.searchsorted
etc. instead of typingnumpy.searchsorted
as that's what most NumPy users do.
dict
here. Can you prove that your solution is more memory efficient. The lookups will certainly be slower. \$\endgroup\$