I need to create an in memory object that has keys that are a 9 digit integer and a boolean value associated with each key. I've been using a dict as in the simplified example below:
#!/usr/bin/python
from __future__ import print_function
import sys
myDict = {}
for n in range(56000):
myDict[n] = True
print('Count:',len(myDict),' Size:', sys.getsizeof(myDict))
I need to be able to look up and retrieve the boolean value associated with each key. The problem is the size of the dict. Using Python 2.7 on a 64 bit Linux system and the above example, the size of the dict is 3.1 megabytes according to sys.getsizeof(). (about 56 bytes per entry to store 9 digits plus a boolean value)
I need to store the boolean state of (approx) 55.000 entries in the dict. Each dict key is a 9 digit integer. I've tried using the integer and str(theInteger) as keys with no change in the size of the dict.
Is there some other sort of data structure or methodology I should be using to conserve memory with such a large data set?
5 Answers 5
If you look up your boolean with an integer key, and the range of keys starts with 0 and is continuous, there's really no reason not to use a list:
my_list = []
for n in range(56000):
my_list[n] = True
or better:
my_list = [True for n in range(5600])
If that is not enough, try the array module and use one byte per bool:
import array
my_array = array.array("b", (True for n in range(56000)))
And if that is not good enough, try the bitarray module on PyPi.
Another idea is to use a set: Say you have many more False than True, simply have a set:
my_true_numbers = {0, 2323, 23452} # just the True ones
and check with
value = number in my_true_numbers
If you have more True than False, do it the other way round.
3 Comments
True use [ True ] * 56000. Notice however that the same reference is repeated 56000 times, but because True is immutable, it does not matter.The accepted answer of Python: Reducing memory usage of dictionary has the conclusion that there it not much you can do and i agree with it. The overall memory overhead of a dictionary is small but the number of key-value pairs of your example raises the memory footprint.
There might one think possible: If the keys are always linear you could create a list of booleans directly or better use bitarray. The keys would then be implicit. But if this is only in your example the case you can't do much.
7 Comments
set saves only 30% space compared to a dictionary; having two sets will require more space than a single dictionary (because of set overallocation).If "key not found" is not an important state for you (i.e. you are OK with treating keys not in the array as False), you may use a set instead to store just the elements mapping to True. This requires about 30% less space because each entry consists of just two 64-bit quantities (hash and key) instead of three quantities (hash, key, value).
Keep in mind that sys.getsizeof(dict) only tells you the size of the dict itself, not the objects contained within. Creating 56000 ints as the keys will also carry its own cost: 24 bytes per integer (type pointer, refcount, value). That will amount to 1.3MB just by itself, in addition to the memory taken by the dictionary.
To really save space, you could use a NumPy compressed sparse row matrix:
from scipy.sparse import lil_matrix # linked-list matrix, used to construct the csr matrix
vals = lil_matrix((1,1000000000), dtype='int8'))
# We will use 0 = no such key, 1 = True, 2 = False
for n in myIndices:
vals[n] = 1
vals = vals.tocsr()
The memory usage of vals is very small: 56KB for the data, 224KB for the indices, and less than 1KB for other structures. The total size is thus less than 281KB (10x smaller than the dict), with no extra allocated integers. Looking elements up and changing non-zero elements is very fast (binary search in a sorted array), but inserting a new nonzero value or zeroing an existing nonzero value are expensive operations.
Comments
Why not using a gigantic bitfield ?
You code your data on two bits since you need at least three values : true, false and not_init/UB. The overall memory used would be 55.000*2 bits = 110 000 bits = 13 kBytes.
A badly drawn diagram
The set flag is here to ensure that the value has been correctly set by the user (not necessary), and the second bit contains the value.
Using 64 bit unsigned integers, you will only need 203 of them to store the whole array.
Then you can access it using bit index : let's say you want to access the value at index 123. you will need to access bit #246 ans #247 (one for the set bool and the other one for the value).
Since 246 and 247 are inferior to 2**64, they are stored on the first uint. To access them :
return (( (1<<246) & array[0] ) >> 246 )
To access any bit :
return (( (1<<n) & array[ n/(2**64) ] ) >> n)
( not tested the bit accessor)
Set a bit :
array[ n/(2**64) ] = array[ n/(2**64) ] | (1<<n)
Bitwise operations are tricky (arithmetic shifting vs logical one ) and not easily debuggable, but they can be extremely powerful.
Comments
Depending on what exactly your needs are, you could use a list to store your values. This will only uses around 16% of the space a dictionary uses, but certain operations such as lookup and insertion will be (possibly a lot) slower.
values = list(range(56000))
If you use the bisect module and store your values in a sorted list, your lookups will still be slower than with a dict, but much faster than the naïve x in my_list check.
The list must always be kept in sorted order. To check whether a value is in your list, you can use this function:
def is_in_list(values, x):
i = bisect_left(values, x)
return i != len(values) and values[i] == x
Works like this:
>>> is_in_list([2, 4, 14, 15], 4)
True
>>> is_in_list([2, 4, 14, 15], 1)
False
>>> is_in_list([2, 4, 14, 15], 13)
False
This method will reduce memory usage significantly, but – compared to a dict or set – lookup takes O(log n) time instead of O(1) and insertion takes O(n) instead of O(1).
True,False, no key in dict), or just the boolean value? In the latter case, asetfor just theTruevalues is more appropriate.