Reducing memory used by a large dict

Question 1

I need to create an in memory object that has keys that are a 9 digit integer and a boolean value associated with each key. I've been using a dict as in the simplified example below:

#!/usr/bin/python
from __future__ import print_function
import sys
myDict = {}
for n in range(56000):
 myDict[n] = True
print('Count:',len(myDict),' Size:', sys.getsizeof(myDict))

I need to be able to look up and retrieve the boolean value associated with each key. The problem is the size of the dict. Using Python 2.7 on a 64 bit Linux system and the above example, the size of the dict is 3.1 megabytes according to sys.getsizeof(). (about 56 bytes per entry to store 9 digits plus a boolean value)

I need to store the boolean state of (approx) 55.000 entries in the dict. Each dict key is a 9 digit integer. I've tried using the integer and str(theInteger) as keys with no change in the size of the dict.

Is there some other sort of data structure or methodology I should be using to conserve memory with such a large data set?

Question 2

Are you truly interested in the tristate (True, False, no key in dict), or just the boolean value? In the latter case, a set for just the True values is more appropriate.

Question 3

If you look up your boolean with an integer key, and the range of keys starts with 0 and is continuous, there's really no reason not to use a list:

my_list = []
for n in range(56000):
 my_list[n] = True

or better:

my_list = [True for n in range(5600])

If that is not enough, try the array module and use one byte per bool:

import array
my_array = array.array("b", (True for n in range(56000)))

And if that is not good enough, try the bitarray module on PyPi.

Another idea is to use a set: Say you have many more False than True, simply have a set:

my_true_numbers = {0, 2323, 23452} # just the True ones

and check with

value = number in my_true_numbers

If you have more True than False, do it the other way round.

Question 4

the set idea is a good one, depending on the distribution (is there more True than False or the opposite) a value in the set can mean both True or False. whichever is best.

Question 5

Thanks to Christian and others who contributed. I'll try Christian's set method. Looks quite promising.

Question 6

To make a list of 56000 True use [ True ] * 56000. Notice however that the same reference is repeated 56000 times, but because True is immutable, it does not matter.

Question 7

The accepted answer of Python: Reducing memory usage of dictionary has the conclusion that there it not much you can do and i agree with it. The overall memory overhead of a dictionary is small but the number of key-value pairs of your example raises the memory footprint.

There might one think possible: If the keys are always linear you could create a list of booleans directly or better use bitarray. The keys would then be implicit. But if this is only in your example the case you can't do much.

Question 8

Thanks for the suggestions. Unfortunately the keys aren't linear. The problem is that even though there are only 56,000 entries, each entry key is some integer from 100,000,000 through 800,000,000 so I think my list or bitarray would need to be 800,000,000 elements long to directly access the boolean in a non-keyed object.

Question 9

One more bit of evidence that leads me to believe that there is more to dict memory allocation than meets the eye. If my dict is using around 12,000 bytes, then when the next single entry is added, the size of the dict jumps to around 48,000 bytes. It then stays at 48,000 for a while and suddenly jumps to 192,000 bytes, etc etc, increasing by 4x the previous size each time it extends.

Question 10

Probably some kind of opimization to prevent memory allocation with every new entry. Most list... implementations do this.

Question 11

Christian Schramm idea with the sets seems worth a try. In the worst case you need two sets. One for true and one for false but you don't need to store the boolean value.

Question 12

@tea2code: that's going to be too expensive. A set saves only 30% space compared to a dictionary; having two sets will require more space than a single dictionary (because of set overallocation).

Question 13

If "key not found" is not an important state for you (i.e. you are OK with treating keys not in the array as False), you may use a set instead to store just the elements mapping to True. This requires about 30% less space because each entry consists of just two 64-bit quantities (hash and key) instead of three quantities (hash, key, value).

Keep in mind that sys.getsizeof(dict) only tells you the size of the dict itself, not the objects contained within. Creating 56000 ints as the keys will also carry its own cost: 24 bytes per integer (type pointer, refcount, value). That will amount to 1.3MB just by itself, in addition to the memory taken by the dictionary.

To really save space, you could use a NumPy compressed sparse row matrix:

from scipy.sparse import lil_matrix # linked-list matrix, used to construct the csr matrix
vals = lil_matrix((1,1000000000), dtype='int8'))
# We will use 0 = no such key, 1 = True, 2 = False
for n in myIndices:
 vals[n] = 1
vals = vals.tocsr()

The memory usage of vals is very small: 56KB for the data, 224KB for the indices, and less than 1KB for other structures. The total size is thus less than 281KB (10x smaller than the dict), with no extra allocated integers. Looking elements up and changing non-zero elements is very fast (binary search in a sorted array), but inserting a new nonzero value or zeroing an existing nonzero value are expensive operations.

Question 14

Why not using a gigantic bitfield ? You code your data on two bits since you need at least three values : true, false and not_init/UB. The overall memory used would be 55.000*2 bits = 110 000 bits = 13 kBytes.

A badly drawn diagram

The set flag is here to ensure that the value has been correctly set by the user (not necessary), and the second bit contains the value.

Using 64 bit unsigned integers, you will only need 203 of them to store the whole array.

Then you can access it using bit index : let's say you want to access the value at index 123. you will need to access bit #246 ans #247 (one for the set bool and the other one for the value).

Since 246 and 247 are inferior to 2**64, they are stored on the first uint. To access them :

return (( (1<<246) & array[0] ) >> 246 )

To access any bit :

return (( (1<<n) & array[ n/(2**64) ] ) >> n)

( not tested the bit accessor)

Set a bit :

array[ n/(2**64) ] = array[ n/(2**64) ] | (1<<n)

Bitwise operations are tricky (arithmetic shifting vs logical one ) and not easily debuggable, but they can be extremely powerful.

Question 15

Depending on what exactly your needs are, you could use a list to store your values. This will only uses around 16% of the space a dictionary uses, but certain operations such as lookup and insertion will be (possibly a lot) slower.

values = list(range(56000))

If you use the bisect module and store your values in a sorted list, your lookups will still be slower than with a dict, but much faster than the naïve x in my_list check.

The list must always be kept in sorted order. To check whether a value is in your list, you can use this function:

def is_in_list(values, x):
 i = bisect_left(values, x)
 return i != len(values) and values[i] == x

Works like this:

>>> is_in_list([2, 4, 14, 15], 4)
True
>>> is_in_list([2, 4, 14, 15], 1)
False
>>> is_in_list([2, 4, 14, 15], 13)
False

This method will reduce memory usage significantly, but – compared to a dict or set – lookup takes O(log n) time instead of O(1) and insertion takes O(n) instead of O(1).

Christian Schramm 3041 silver badge2 bronze badges · Accepted Answer · 2013-09-04 13:32:20Z

If you look up your boolean with an integer key, and the range of keys starts with 0 and is continuous, there's really no reason not to use a list:

my_list = []
for n in range(56000):
 my_list[n] = True

or better:

my_list = [True for n in range(5600])

If that is not enough, try the array module and use one byte per bool:

import array
my_array = array.array("b", (True for n in range(56000)))

And if that is not good enough, try the bitarray module on PyPi.

Another idea is to use a set: Say you have many more False than True, simply have a set:

my_true_numbers = {0, 2323, 23452} # just the True ones

and check with

value = number in my_true_numbers

If you have more True than False, do it the other way round.

the set idea is a good one, depending on the distribution (is there more True than False or the opposite) a value in the set can mean both True or False. whichever is best.
Thanks to Christian and others who contributed. I'll try Christian's set method. Looks quite promising.
To make a list of 56000 True use [ True ] * 56000. Notice however that the same reference is repeated 56000 times, but because True is immutable, it does not matter.

CollectivesTM on Stack Overflow

Reducing memory used by a large dict

5 Answers 5

3 Comments

7 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

5 Answers 5

3 Comments

7 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related