This question began as an off-topic answer to this question, but the code here serves a different goal.
I wrote the following class for the purpose of populating a dict on demand from an iterator. The intent of the iterator passed to the constructor is that it could alternatively be passed to dict
, which would consume the entire iterator in its constructor; an instance of this class consumes the iterator just far enough to locate a requested item. Such an iterator would be similar in spirit to a return value from dict.items
.
class LazyDict(dict):
""" A dict built on demand from an iterator """
def __init__(self, iterator):
super().__init__()
self.iterator = iterator
def __getitem__(self, item):
while not self.get(item):
try:
(key, value) = next(self.iterator)
self[key] = value
except StopIteration:
raise AttributeError
return super().__getitem__(item)
def __contains__(self, item):
try:
self[item] # pylint: disable=pointless-statement
return True
except AttributeError:
return False
Here is my calling code (with other details of the Directory
code omitted; if more of that is needed in this context I can provide it):
class Directory(object):
@Lazy
def hash(self):
""" Lazy dict mapping entry names to entries """
return LazyDict((self.name(entry), entry) for entry in self.readdir())
def __contains__(self, name):
return name in self.hash # pylint: disable=unsupported-membership-test
def __getitem__(self, name):
return self.hash[name] # pylint: disable=unsubscriptable-object
The code for Lazy
is equivalent to lazy_property
in this answer.
As I write this, it occurs to me that perhaps Directory
should itself be a subclass of LazyDict
rather than containing a LazyDict
(that might be a better way to stifle those pylint
warnings). Whether that seems right might be one specific question to fall out of this. Upon further investigation, here is an alternate version of the calling code:
class Directory(LazyDict):
def __init__(self, mem, size):
super().__init__((self.name(entry), entry) for entry in self.readdir())
self.mem = mem
self.size = size
As before, some details of Directory
are omitted (the implementations of name
and readdir
are not included for either version), but this version of the code inherits __getitem__
and __contains__
from LazyDict
rather than overriding them. The most visible difference between the two versions of the calling code is due to Directory
inheriting __repr__
from dict
rather than object
.
Another specific question might be the role of keys
and values
and items
in LazyDict
. The methods inherited from dict
reveal how much of the iterator has been consumed; convincing them to reveal all items is one AttributeError
away (I can attempt to fetch self[None]
within Directory
to fully populate the cache). My inclination is to limit on-demand operations to __getitem__
and __contains__
, but opinions on that point are welcome.
Suggestions for other ways to approach this are also welcome.
2 Answers 2
1. Design
I think that it's a mistake to inherit from dict
. My reasoning is as follows:
LazyDict
is effectively read-only: that is, setting an item does not update the underlying iterator. (In the described use case, you can't update the compressed file image through this dictionary.) So it is misleading to offer (as you do) a__setitem__
method (and similarly forupdate
,pop
,setdefault
and other mutating methods).Programmers are used to the equivalence of dictionary methods: for example, they "know" that
d.get(k)
behaves just the same asd[k] if k in d else None
. But in your implementation it does not — callingd.get(k)
does not consult the iterator but callingk in d
ord[k]
does. This seems like a recipe for confusion and bugs.
So I think the better approach is not to inherit from dict
, but to have a dictionary as an attribute. This means that users will only be able to call the methods that you choose to implement, instead of accidentally calling through to methods on the underlying dict
. This approach also makes the code clearer, because you can distinguish key in self
from key in self._dict
without needing to use super()
or have pylint annnotations.
2. Other review comments
The parameter to the
__getitem__
and__contains__
methods would be better namedkey
.The exception raised from a failed key lookup should be
KeyError
, notAttributeError
, for consistency withdict
.The exception raised on a failed key lookup should include the failed key, for consistency with
dict
and to help programmers track down errors.It's a good idea to keep
try: ... except: ...
blocks as small as possible, so that you don't accidentally capture exceptions that you weren't expecting. In this case you are expecting aStopIteration
from the call tonext
so that's the only line that needs to be protected.It's conventional to give attributes that are not intended to be used outside of the class (like
iterator
) the prefix_
.
3. Revised code
class LazyDict:
"""A dictionary built on demand from an iterator."""
def __init__(self, iterator):
self._dict = {}
self._iterator = iterator
def __getitem__(self, key):
if key in self:
return self._dict[key]
else:
raise KeyError(key)
def __contains__(self, key):
while key not in self._dict:
try:
k, v = next(self._iterator)
except StopIteration:
return False
self._dict[k] = v
return True
Now someone who tries to call the get
method will get an exception instead of silently getting the wrong result.
Here is an improved version of LazyDict
which incorporates some of the recommendations in this answer (thanks, Gareth!):
class LazyDict(dict):
""" A dict built on demand from an iterator """
def __init__(self, iterator):
super().__init__()
self.iterator = iterator
def __getitem__(self, item):
if item in self:
return super().__getitem__(item)
else:
raise KeyError(item)
def __contains__(self, item):
while not super().__contains__(item):
try:
(key, value) = next(self.iterator)
except StopIteration:
return False
self[key] = value
return True
Consuming the iterator in __contains__
rather than in __getitem__
eliminates the pylint
cruft, as well as the call to the get
method. Minimizing the try:
block protects only the call to next
. Raising KeyError
instead of AttributeError
is consistent with dict
behavior.
This version still inherits the __setitem__
and __delitem__
(and other) methods from dict
, which allows it to provide a copy-on-write cache of the underlying iterator. In a squashfs
application, that might provide the basis for an image editor in which the final step of saving the modified image consists of generating a new image from the contents of the cache.
Explore related questions
See similar questions with these tags.
Directory
class mentioned here is part of a module which lists contents of asquashfs
image.Directory.__getitem__
is used for directory lookups. If onlyreaddir
is called on a directory (to list all of its entries), there is no need to populate the cache used for lookups within that directory. In a large directory, the time saved by not populating that cache unnecessarily might be significant. \$\endgroup\$