Memory Efficient Hashmap Alternative to Python Dictionary (Integer to Integer)

Question 1

I am using a regular Python 3 dictionary to create a hashmap where both the keys and values are positive integers. The following code shows that a dict with about 6 million keys require 320 MB of memory.

import numpy as np
from sys import getsizeof
N = 10*1000*1000
a = np.random.randint(0, N, N)
b = np.random.randint(0, N, N)
d = dict(zip(a,b))
print('Number of elements:', len(d), 'Memory size (MB):', round(getsizeof(d)/2**20, 3))
print('Element memory size (B):', getsizeof(d[list(d.keys())[0]]))
# Number of elements: 6323010 Memory size (MB): 320.0
# Element memory size (B): 32

How can we create a more memory-efficient hashmap, ideally with O(1) lookup? The required hashmap can be immutable.

In my use case, the expected size of the hashmap can be up to 2 billion. Using Python dictionaries will require an estimated 64 GB of memory. Although this still fits into memory, we will still require some memory for other processes.

Question 2

If it's possible, maybe use a list and treat the index as key?

Question 3

@Arunmozhi I think that's possible. Additionally, maybe a lot of the memory usage is coming from Python representing the integers as objects instead of a true integer?

Question 4

Is there an upper bound on the magnitude of those integers?

Question 5

@PaulPanzer Yes, uint32 will be sufficient.

Question 6

In that case you can simply make a linear lookup table (a nump array): 2^32 * 4 are 16 GB. And lookup will be faster than with a dict. And as a free bonus you can bulk lookup.

Question 7

Given your numbers 2 * 10^9 key-value pairs of uint32 a memory addressed numpy lookup table will be hard to beat memory and speed wise as well as for sheer simplicity. The dead space will just be ~50% - roughly the same as the space you will be saving by not having to store the keys.

Question 8

What is "the dead space" referring to?

Question 9

@AthenaWisdom Addresses in the lookup table that do not correspond to a valid key. One would mark those with a special value.

Question 10

The most memory-efficient way to store key / value pairs is as a list of pair of tuples/lists, but lookup of course will be very slow (even if you sort the list and use bisect for the lookup, it's still going to be extremely slower than a dict).

Consider using shelve instead -- that will use little memory (since the data reside on disk) and still offer pretty spiffy lookup performance (not as fast as an in-memory dict, of course, but for a large amount of data it will be much faster than lookup on a list of tuples, even a sorted one, can ever be!-).

Question 11

A list of tuples is not the most memory-efficient way to store key-value pairs, since it spends so much memory on tuples. It's easily beaten by a pair of lists, one for keys and one for values. Compressed options can do even better.

Question 12

In fact, a dict easily beats a list of tuples in typical cases.

Question 13

@user2357112supportsMonica What do you mean by "compressed options"?

Question 14

@AthenaWisdom: I was thinking about actually applying compression algorithms to compress your data, assuming it's actually compressible and not just random like your example.

Question 15

Even without compression, you can store your data in a way that has less object overhead, like NumPy arrays or serialized bytestrings holding a large number of ints' worth of data instead of one object per integer.

score 1 · Answer 1 · 2020-07-06 05:02:00Z

1

Given your numbers 2 * 10^9 key-value pairs of uint32 a memory addressed numpy lookup table will be hard to beat memory and speed wise as well as for sheer simplicity. The dead space will just be ~50% - roughly the same as the space you will be saving by not having to store the keys.

Share

Improve this answer

answered Jul 6, 2020 at 5:02

Paul Panzer's user avatar

Paul Panzer Paul PanzerPaul Panzer

53.3k3 gold badges59 silver badges103 bronze badges

2

What is "the dead space" referring to?

Athena Wisdom
– Athena Wisdom

2020年07月06日 06:22:02 +00:00
Commented Jul 6, 2020 at 6:22
@AthenaWisdom Addresses in the lookup table that do not correspond to a valid key. One would mark those with a special value.

Paul Panzer
– Paul Panzer

2020年07月06日 06:30:16 +00:00
Commented Jul 6, 2020 at 6:30

Add a comment |

Anonymous AnonymousAnonymous 5683 silver badges18 bronze badges · Answer 2 · 2020-07-05 16:45:44Z

1

The most memory-efficient way to store key / value pairs is as a list of pair of tuples/lists, but lookup of course will be very slow (even if you sort the list and use bisect for the lookup, it's still going to be extremely slower than a dict).

Consider using shelve instead -- that will use little memory (since the data reside on disk) and still offer pretty spiffy lookup performance (not as fast as an in-memory dict, of course, but for a large amount of data it will be much faster than lookup on a list of tuples, even a sorted one, can ever be!-).

Share

Improve this answer

edited Jul 6, 2020 at 17:01

answered Jul 5, 2020 at 16:45

Anonymous's user avatar

Anonymous AnonymousAnonymous

5683 silver badges18 bronze badges

7

A list of tuples is not the most memory-efficient way to store key-value pairs, since it spends so much memory on tuples. It's easily beaten by a pair of lists, one for keys and one for values. Compressed options can do even better.

user2357112
– user2357112

2020年07月06日 01:41:48 +00:00
Commented Jul 6, 2020 at 1:41
In fact, a dict easily beats a list of tuples in typical cases.

user2357112
– user2357112

2020年07月06日 01:53:11 +00:00
Commented Jul 6, 2020 at 1:53
@user2357112supportsMonica What do you mean by "compressed options"?

Athena Wisdom
– Athena Wisdom

2020年07月06日 03:44:59 +00:00
Commented Jul 6, 2020 at 3:44
@AthenaWisdom: I was thinking about actually applying compression algorithms to compress your data, assuming it's actually compressible and not just random like your example.

user2357112
– user2357112

2020年07月06日 04:22:58 +00:00
Commented Jul 6, 2020 at 4:22
1

Even without compression, you can store your data in a way that has less object overhead, like NumPy arrays or serialized bytestrings holding a large number of ints' worth of data instead of one object per integer.

user2357112
– user2357112

2020年07月06日 04:25:12 +00:00
Commented Jul 6, 2020 at 4:25

| Show 2 more comments

CollectivesTM on Stack Overflow

Memory Efficient Hashmap Alternative to Python Dictionary (Integer to Integer)

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related