Python lazy dict with iterator in constructor

Question 1

This question began as an off-topic answer to this question, but the code here serves a different goal.

I wrote the following class for the purpose of populating a dict on demand from an iterator. The intent of the iterator passed to the constructor is that it could alternatively be passed to dict, which would consume the entire iterator in its constructor; an instance of this class consumes the iterator just far enough to locate a requested item. Such an iterator would be similar in spirit to a return value from dict.items.

class LazyDict(dict):
 """ A dict built on demand from an iterator """
 def __init__(self, iterator):
 super().__init__()
 self.iterator = iterator
 def __getitem__(self, item):
 while not self.get(item):
 try:
 (key, value) = next(self.iterator)
 self[key] = value
 except StopIteration:
 raise AttributeError
 return super().__getitem__(item)
 def __contains__(self, item):
 try:
 self[item] # pylint: disable=pointless-statement 
 return True
 except AttributeError:
 return False

Here is my calling code (with other details of the Directory code omitted; if more of that is needed in this context I can provide it):

class Directory(object):
 @Lazy
 def hash(self):
 """ Lazy dict mapping entry names to entries """
 return LazyDict((self.name(entry), entry) for entry in self.readdir())
 def __contains__(self, name):
 return name in self.hash # pylint: disable=unsupported-membership-test 
 def __getitem__(self, name):
 return self.hash[name] # pylint: disable=unsubscriptable-object

The code for Lazy is equivalent to lazy_property in this answer.

As I write this, it occurs to me that perhaps Directory should itself be a subclass of LazyDict rather than containing a LazyDict (that might be a better way to stifle those pylint warnings). Whether that seems right might be one specific question to fall out of this. Upon further investigation, here is an alternate version of the calling code:

class Directory(LazyDict):
 def __init__(self, mem, size):
 super().__init__((self.name(entry), entry) for entry in self.readdir())
 self.mem = mem
 self.size = size

As before, some details of Directory are omitted (the implementations of name and readdir are not included for either version), but this version of the code inherits __getitem__ and __contains__ from LazyDict rather than overriding them. The most visible difference between the two versions of the calling code is due to Directory inheriting __repr__ from dict rather than object.

Another specific question might be the role of keys and values and items in LazyDict. The methods inherited from dict reveal how much of the iterator has been consumed; convincing them to reveal all items is one AttributeError away (I can attempt to fetch self[None] within Directory to fully populate the cache). My inclination is to limit on-demand operations to __getitem__ and __contains__, but opinions on that point are welcome.

Suggestions for other ways to approach this are also welcome.

Question 2

Can you describe a use case for this data structure?

Question 3

The Directory class mentioned here is part of a module which lists contents of a squashfs image. Directory.__getitem__ is used for directory lookups. If only readdir is called on a directory (to list all of its entries), there is no need to populate the cache used for lookups within that directory. In a large directory, the time saved by not populating that cache unnecessarily might be significant.

Question 4

Hi, I rolled back your edit. Please see What you may and may not do once you've received answers

Question 5

1. Design

I think that it's a mistake to inherit from dict. My reasoning is as follows:

LazyDict is effectively read-only: that is, setting an item does not update the underlying iterator. (In the described use case, you can't update the compressed file image through this dictionary.) So it is misleading to offer (as you do) a __setitem__ method (and similarly for update, pop, setdefault and other mutating methods).
Programmers are used to the equivalence of dictionary methods: for example, they "know" that d.get(k) behaves just the same as d[k] if k in d else None. But in your implementation it does not — calling d.get(k) does not consult the iterator but calling k in d or d[k] does. This seems like a recipe for confusion and bugs.

So I think the better approach is not to inherit from dict, but to have a dictionary as an attribute. This means that users will only be able to call the methods that you choose to implement, instead of accidentally calling through to methods on the underlying dict. This approach also makes the code clearer, because you can distinguish key in self from key in self._dict without needing to use super() or have pylint annnotations.

2. Other review comments

The parameter to the __getitem__ and __contains__ methods would be better named key.
The exception raised from a failed key lookup should be KeyError, not AttributeError, for consistency with dict.
The exception raised on a failed key lookup should include the failed key, for consistency with dict and to help programmers track down errors.
It's a good idea to keep try: ... except: ... blocks as small as possible, so that you don't accidentally capture exceptions that you weren't expecting. In this case you are expecting a StopIteration from the call to next so that's the only line that needs to be protected.
It's conventional to give attributes that are not intended to be used outside of the class (like iterator) the prefix _.

3. Revised code

class LazyDict:
 """A dictionary built on demand from an iterator."""
 def __init__(self, iterator):
 self._dict = {}
 self._iterator = iterator
 def __getitem__(self, key):
 if key in self:
 return self._dict[key]
 else:
 raise KeyError(key)
 def __contains__(self, key):
 while key not in self._dict:
 try:
 k, v = next(self._iterator)
 except StopIteration:
 return False
 self._dict[k] = v
 return True

Now someone who tries to call the get method will get an exception instead of silently getting the wrong result.

Question 6

Here is an improved version of LazyDict which incorporates some of the recommendations in this answer (thanks, Gareth!):

class LazyDict(dict):
 """ A dict built on demand from an iterator """
 def __init__(self, iterator):
 super().__init__()
 self.iterator = iterator
 def __getitem__(self, item):
 if item in self:
 return super().__getitem__(item)
 else:
 raise KeyError(item)
 def __contains__(self, item):
 while not super().__contains__(item):
 try:
 (key, value) = next(self.iterator)
 except StopIteration:
 return False
 self[key] = value
 return True

Consuming the iterator in __contains__ rather than in __getitem__ eliminates the pylint cruft, as well as the call to the get method. Minimizing the try: block protects only the call to next. Raising KeyError instead of AttributeError is consistent with dict behavior.

This version still inherits the __setitem__ and __delitem__ (and other) methods from dict, which allows it to provide a copy-on-write cache of the underlying iterator. In a squashfs application, that might provide the basis for an image editor in which the final step of saving the modified image consists of generating a new image from the contents of the cache.

Gareth Rees Gareth Rees 50.1k3 gold badges130 silver badges210 bronze badges · Accepted Answer · 2016-06-22 10:42:41Z

1. Design

I think that it's a mistake to inherit from dict. My reasoning is as follows:

LazyDict is effectively read-only: that is, setting an item does not update the underlying iterator. (In the described use case, you can't update the compressed file image through this dictionary.) So it is misleading to offer (as you do) a __setitem__ method (and similarly for update, pop, setdefault and other mutating methods).
Programmers are used to the equivalence of dictionary methods: for example, they "know" that d.get(k) behaves just the same as d[k] if k in d else None. But in your implementation it does not — calling d.get(k) does not consult the iterator but calling k in d or d[k] does. This seems like a recipe for confusion and bugs.

So I think the better approach is not to inherit from dict, but to have a dictionary as an attribute. This means that users will only be able to call the methods that you choose to implement, instead of accidentally calling through to methods on the underlying dict. This approach also makes the code clearer, because you can distinguish key in self from key in self._dict without needing to use super() or have pylint annnotations.

2. Other review comments

The parameter to the __getitem__ and __contains__ methods would be better named key.
The exception raised from a failed key lookup should be KeyError, not AttributeError, for consistency with dict.
The exception raised on a failed key lookup should include the failed key, for consistency with dict and to help programmers track down errors.
It's a good idea to keep try: ... except: ... blocks as small as possible, so that you don't accidentally capture exceptions that you weren't expecting. In this case you are expecting a StopIteration from the call to next so that's the only line that needs to be protected.
It's conventional to give attributes that are not intended to be used outside of the class (like iterator) the prefix _.

3. Revised code

class LazyDict:
 """A dictionary built on demand from an iterator."""
 def __init__(self, iterator):
 self._dict = {}
 self._iterator = iterator
 def __getitem__(self, key):
 if key in self:
 return self._dict[key]
 else:
 raise KeyError(key)
 def __contains__(self, key):
 while key not in self._dict:
 try:
 k, v = next(self._iterator)
 except StopIteration:
 return False
 self._dict[k] = v
 return True

Now someone who tries to call the get method will get an exception instead of silently getting the wrong result.

Stack Exchange Network

Python lazy dict with iterator in constructor

2 Answers 2

1. Design

2. Other review comments

3. Revised code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Python lazy dict with iterator in constructor

2 Answers 2

1. Design

2. Other review comments

3. Revised code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions