2

I am using a regular Python 3 dictionary to create a hashmap where both the keys and values are positive integers. The following code shows that a dict with about 6 million keys require 320 MB of memory.

import numpy as np
from sys import getsizeof
N = 10*1000*1000
a = np.random.randint(0, N, N)
b = np.random.randint(0, N, N)
d = dict(zip(a,b))
print('Number of elements:', len(d), 'Memory size (MB):', round(getsizeof(d)/2**20, 3))
print('Element memory size (B):', getsizeof(d[list(d.keys())[0]]))
# Number of elements: 6323010 Memory size (MB): 320.0
# Element memory size (B): 32

How can we create a more memory-efficient hashmap, ideally with O(1) lookup? The required hashmap can be immutable.

In my use case, the expected size of the hashmap can be up to 2 billion. Using Python dictionaries will require an estimated 64 GB of memory. Although this still fits into memory, we will still require some memory for other processes.

asked Jul 5, 2020 at 16:17
6
  • If it's possible, maybe use a list and treat the index as key? Commented Jul 5, 2020 at 16:25
  • @Arunmozhi I think that's possible. Additionally, maybe a lot of the memory usage is coming from Python representing the integers as objects instead of a true integer? Commented Jul 5, 2020 at 16:29
  • Is there an upper bound on the magnitude of those integers? Commented Jul 5, 2020 at 18:10
  • @PaulPanzer Yes, uint32 will be sufficient. Commented Jul 6, 2020 at 0:00
  • In that case you can simply make a linear lookup table (a nump array): 2^32 * 4 are 16 GB. And lookup will be faster than with a dict. And as a free bonus you can bulk lookup. Commented Jul 6, 2020 at 4:13

2 Answers 2

1

Given your numbers 2 * 10^9 key-value pairs of uint32 a memory addressed numpy lookup table will be hard to beat memory and speed wise as well as for sheer simplicity. The dead space will just be ~50% - roughly the same as the space you will be saving by not having to store the keys.

answered Jul 6, 2020 at 5:02
2
  • What is "the dead space" referring to? Commented Jul 6, 2020 at 6:22
  • @AthenaWisdom Addresses in the lookup table that do not correspond to a valid key. One would mark those with a special value. Commented Jul 6, 2020 at 6:30
1

The most memory-efficient way to store key / value pairs is as a list of pair of tuples/lists, but lookup of course will be very slow (even if you sort the list and use bisect for the lookup, it's still going to be extremely slower than a dict).

Consider using shelve instead -- that will use little memory (since the data reside on disk) and still offer pretty spiffy lookup performance (not as fast as an in-memory dict, of course, but for a large amount of data it will be much faster than lookup on a list of tuples, even a sorted one, can ever be!-).

answered Jul 5, 2020 at 16:45
7
  • A list of tuples is not the most memory-efficient way to store key-value pairs, since it spends so much memory on tuples. It's easily beaten by a pair of lists, one for keys and one for values. Compressed options can do even better. Commented Jul 6, 2020 at 1:41
  • In fact, a dict easily beats a list of tuples in typical cases. Commented Jul 6, 2020 at 1:53
  • @user2357112supportsMonica What do you mean by "compressed options"? Commented Jul 6, 2020 at 3:44
  • @AthenaWisdom: I was thinking about actually applying compression algorithms to compress your data, assuming it's actually compressible and not just random like your example. Commented Jul 6, 2020 at 4:22
  • 1
    Even without compression, you can store your data in a way that has less object overhead, like NumPy arrays or serialized bytestrings holding a large number of ints' worth of data instead of one object per integer. Commented Jul 6, 2020 at 4:25

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.