I am using a regular Python 3 dictionary to create a hashmap where both the keys and values are positive integers. The following code shows that a dict with about 6 million keys require 320 MB of memory.
import numpy as np
from sys import getsizeof
N = 10*1000*1000
a = np.random.randint(0, N, N)
b = np.random.randint(0, N, N)
d = dict(zip(a,b))
print('Number of elements:', len(d), 'Memory size (MB):', round(getsizeof(d)/2**20, 3))
print('Element memory size (B):', getsizeof(d[list(d.keys())[0]]))
# Number of elements: 6323010 Memory size (MB): 320.0
# Element memory size (B): 32
How can we create a more memory-efficient hashmap, ideally with O(1) lookup? The required hashmap can be immutable.
In my use case, the expected size of the hashmap can be up to 2 billion. Using Python dictionaries will require an estimated 64 GB of memory. Although this still fits into memory, we will still require some memory for other processes.
-
If it's possible, maybe use a list and treat the index as key?Arunmozhi– Arunmozhi2020年07月05日 16:25:35 +00:00Commented Jul 5, 2020 at 16:25
-
@Arunmozhi I think that's possible. Additionally, maybe a lot of the memory usage is coming from Python representing the integers as objects instead of a true integer?Athena Wisdom– Athena Wisdom2020年07月05日 16:29:13 +00:00Commented Jul 5, 2020 at 16:29
-
Is there an upper bound on the magnitude of those integers?Paul Panzer– Paul Panzer2020年07月05日 18:10:44 +00:00Commented Jul 5, 2020 at 18:10
-
@PaulPanzer Yes, uint32 will be sufficient.Athena Wisdom– Athena Wisdom2020年07月06日 00:00:25 +00:00Commented Jul 6, 2020 at 0:00
-
In that case you can simply make a linear lookup table (a nump array): 2^32 * 4 are 16 GB. And lookup will be faster than with a dict. And as a free bonus you can bulk lookup.Paul Panzer– Paul Panzer2020年07月06日 04:13:04 +00:00Commented Jul 6, 2020 at 4:13
2 Answers 2
Given your numbers 2 * 10^9 key-value pairs of uint32 a memory addressed numpy lookup table will be hard to beat memory and speed wise as well as for sheer simplicity. The dead space will just be ~50% - roughly the same as the space you will be saving by not having to store the keys.
-
What is "the dead space" referring to?Athena Wisdom– Athena Wisdom2020年07月06日 06:22:02 +00:00Commented Jul 6, 2020 at 6:22
-
@AthenaWisdom Addresses in the lookup table that do not correspond to a valid key. One would mark those with a special value.Paul Panzer– Paul Panzer2020年07月06日 06:30:16 +00:00Commented Jul 6, 2020 at 6:30
The most memory-efficient way to store key / value pairs is as a list of pair of tuples/lists, but lookup of course will be very slow (even if you sort the list and use bisect for the lookup, it's still going to be extremely slower than a dict).
Consider using shelve instead -- that will use little memory (since the data reside on disk) and still offer pretty spiffy lookup performance (not as fast as an in-memory dict, of course, but for a large amount of data it will be much faster than lookup on a list of tuples, even a sorted one, can ever be!-).
-
A list of tuples is not the most memory-efficient way to store key-value pairs, since it spends so much memory on tuples. It's easily beaten by a pair of lists, one for keys and one for values. Compressed options can do even better.user2357112– user23571122020年07月06日 01:41:48 +00:00Commented Jul 6, 2020 at 1:41
-
In fact, a dict easily beats a list of tuples in typical cases.user2357112– user23571122020年07月06日 01:53:11 +00:00Commented Jul 6, 2020 at 1:53
-
@user2357112supportsMonica What do you mean by "compressed options"?Athena Wisdom– Athena Wisdom2020年07月06日 03:44:59 +00:00Commented Jul 6, 2020 at 3:44
-
@AthenaWisdom: I was thinking about actually applying compression algorithms to compress your data, assuming it's actually compressible and not just random like your example.user2357112– user23571122020年07月06日 04:22:58 +00:00Commented Jul 6, 2020 at 4:22
-
1Even without compression, you can store your data in a way that has less object overhead, like NumPy arrays or serialized bytestrings holding a large number of ints' worth of data instead of one object per integer.user2357112– user23571122020年07月06日 04:25:12 +00:00Commented Jul 6, 2020 at 4:25
Explore related questions
See similar questions with these tags.