I create a long list of python objects, with many identical entries. In order to save some memory, I want to ensure that identical elements share their memory. For example:
In [1]: alist = [(3, 4), (3, 4)]
In [2]: alist[0] == alist[1]
Out[2]: True
In [3]: alist[0] is alist[1]
Out[3]: False
The two list elements are equal, but use separate memory. In this simple example the problem can be fixed by
In [4]: alist[1] = alist[0]
In [5]: alist[1] is alist[0]
Out[5]: True
How can this be done in a more general way? The list is not sorted, so identical elements are usually not next to each other. One solution I came up with is this:
g_dupdict = dict()
def dedup(x):
try:
return g_dupdict[x]
except KeyError:
g_dupdict[x] = x
return x
for k in range(len(alist)):
alist[k] = dedup(alist[k])
This works, but introduces new problems. It seems silly to use a dict, a set should do, but I don't know how to make this work. And then the dict holds an additional reference to each object, so the memory doesn't get freed when the elements are deleted from the list. To fix this, I delete the dict occasionally, but as a result it must be re-created whenever new elements get added to the list. Is there is better solution to this problem? Thanks.
1 Answer 1
I imagine you are manipulating huge data structures in order to be concern about that. One way to keep g_dedupdict scoped to a given data structure is to encapsulate it inside your own list implementation:
class CachedList(list):
def __init__(self, iterable=(), cache=None):
self.__cache = cache or {}
super().__init__([self.__cached(item) for item in iterable])
def append(self, e):
super().append(self.__cached(e))
def clear(self):
self.__cache.clear()
super().clear()
def copy(self):
return CachedList(self, self.__cache)
def extend(self, iterable):
super().extend([self.__cached(item) for item in iterable])
def insert(self, index, e):
super().insert(index, self.__cached(e))
def remove(self, e):
super().remove(self.__cached(e))
del self.__cache[hash(e)]
def __cached(self, e):
if hash(e) not in self.__cache:
self.__cache[hash(e)] = e
return self.__cache[hash(e)]
def __add__(self, iterable):
cached_iterable = [self.__cached(item) for item in iterable]
return CachedList(super().__add__(cached_iterable), self.__cache)
def __delitem__(self, e):
super().__delitem__(self.__cached(e))
del self.__cache[hash(e)]
def __iadd__(self, e):
self.append(self.__cached(e))
return self
def __setitem__(self, index, e):
super().__setitem__(index, self.__cached(e))
if __name__ == '__main__':
e1 = (3, 4)
e2 = (3, 4)
assert e1 == e2
assert e1 is not e2
assert hash(e1) == hash(e2)
alist = CachedList([(3, 4), (3, 4)])
assert alist[0] == alist[1]
assert alist[0] is alist[1]
alist[0] = (3, 4)
assert alist[0] == alist[1]
assert alist[0] is alist[1]
alist += (3, 4)
assert alist[0] == alist[-1]
assert alist[0] is alist[-1]
newlist = alist + [(3, 4)]
assert newlist[0] == newlist[-1]
assert newlist[0] is newlist[-1]
The implementation may not be complete, but I think can get the idea from the code snippet above.
1 Comment
collections.UserList than to subclass list, although I guess there may be some performance benefits in subclassing list itself.
dedupfunction more efficient by usingg_dupdict.setdefault(x, x).