NumPy array as quasi-hash table

Question 1

Motivation: I have a large number of large read-only dicts with large string keys mapped to tuples of two floats. They are taking up a large amount of heap space.

Proposed solution: a 3xn NumPy array, with the first column being the hash value of the key, and sorted by that column. Only contains and getitem ops are required for my needs.

I realize that this solution is \$O(log(n))\$ rather than \$O(c)\$ but I assume that this isn't that big a problem for my application as there aren't that many lookups per request and my primary problem is memory constraints.

I'm not sure how well searchsorted performs here though, or if I should try to change the stride somehow so that the first column is contiguous or something like that. I should also probably be using a record array.

import numpy
class FCD(object):
 def __init__(self, d):
 data = []
 for key, value in d.iteritems():
 data.append((hash(key), value[0], value[1]))
 data.sort()
 self.data = numpy.array(data)
 def __getitem__(self, key):
 hkey = hash(key)
 ix = numpy.searchsorted(self.data[:, 0], hkey)
 if ix < len(self.data) and self.data[ix][0] == hkey:
 return self.data[ix][1:]
 else:
 raise KeyError
 def __contains__(self, key):
 hkey = hash(key)
 ix = numpy.searchsorted(self.data[:, 0], hkey)
 return ix < len(self.data) and self.data[ix][0] == hkey

Question 2

I am a little confused as to why you do not just use a dict here. Can you prove that your solution is more memory efficient. The lookups will certainly be slower.

Question 3

@ebarr: A tuple of two floats costs 40 +たす 24 +たす 24 =わ 84 bytes, and each dictionary slot costs 24 bytes, so there's a potential saving of 84 bytes per entry using the mechanism described.

Question 4

I think this looks pretty efficient. Let me add a few more minor, detailed comments (more than 1 year after the question was asked; I hope they are still useful):

After you convert to a NumPy array, the keys are longer guaranteed to be unique (if someone re-uses your code in a different context, etc). This will cause searchsorted to fail. You might want to add a check such as:
```
assert len(self.data[:, 0]) == len(np.unique(self.data[:, 0])
```
to make sure users and readers of your code know that the first column of your array must be unique.
I agree that record arrays (aka structured arrays) would be better. Your code would be more readable because instead of things like:
```
self.data[ix][1:]
```
you could write:
```
self.data[ix]['Float_1'], self.data[ix]['Float_2']
```
(but please use more descriptive names for your context than 'Float_1 etc.!)

Also, and more importantly: without using a structured array, your hash keys will be stored as floats instead of ints. That will take extra memory, and make future calls to searchsorted and __eq slightly less efficient.
Your __contains__ method seems overly stringent to me, which would result in it being slower. Do you need to call searcshorted in __contains__ at all? Wouldn't something like this suffice and be more efficient?
```
def __contains__(self, key):
 hkey = hash(key)
 return np.in1d(self.data[:, 0], hkey).any()
```
You are duplicating the __contains__ check in your __getitem__ code. I'd just say it this way:
```
if __contains__(self, key): return self.data[ix][1:]
```
Very minor, but the idiom is always import numpy as np. It's probably best to use that so you can do np.searchsorted etc. instead of typing numpy.searchsorted as that's what most NumPy users do.

Curt F. Curt F. 1,65611 silver badges22 bronze badges · Answer 1 · 2015-08-03 01:53:43Z

I think this looks pretty efficient. Let me add a few more minor, detailed comments (more than 1 year after the question was asked; I hope they are still useful):

After you convert to a NumPy array, the keys are longer guaranteed to be unique (if someone re-uses your code in a different context, etc). This will cause searchsorted to fail. You might want to add a check such as:
```
assert len(self.data[:, 0]) == len(np.unique(self.data[:, 0])
```
to make sure users and readers of your code know that the first column of your array must be unique.
I agree that record arrays (aka structured arrays) would be better. Your code would be more readable because instead of things like:
```
self.data[ix][1:]
```
you could write:
```
self.data[ix]['Float_1'], self.data[ix]['Float_2']
```
(but please use more descriptive names for your context than 'Float_1 etc.!)

Also, and more importantly: without using a structured array, your hash keys will be stored as floats instead of ints. That will take extra memory, and make future calls to searchsorted and __eq slightly less efficient.
Your __contains__ method seems overly stringent to me, which would result in it being slower. Do you need to call searcshorted in __contains__ at all? Wouldn't something like this suffice and be more efficient?
```
def __contains__(self, key):
 hkey = hash(key)
 return np.in1d(self.data[:, 0], hkey).any()
```
You are duplicating the __contains__ check in your __getitem__ code. I'd just say it this way:
```
if __contains__(self, key): return self.data[ix][1:]
```
Very minor, but the idiom is always import numpy as np. It's probably best to use that so you can do np.searchsorted etc. instead of typing numpy.searchsorted as that's what most NumPy users do.

Stack Exchange Network

NumPy array as quasi-hash table

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

NumPy array as quasi-hash table

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions