I have set of values represented by a class. The class should be initialised with a generator to limit memory use.
The class itself is the iterable, having __iter__
, and the iterator, having __next__
(see below).
Since I want to be able to iterate over the set of values multiple times, I cache the generated values in a list. In __iter__
, I check if I already have iterated over all values, and then return iter
of the cached values, or self
to continue with next, respectively.
Here's the code stripped down to the relevant parts:
class SetOfValues(object):
def __init__(self, values):
self._values = values
self._values_generated = []
self._done = False
def __iter__(self):
if self._done:
return iter(self._values_generated)
else:
return self
def __next__(self):
try:
value = next(self._values)
self._values_generated.append(value)
return value
except StopIteration:
self._done = True
raise StopIteration("the end")
Then call it with:
x = SetOfValues((value for value in [1,2,3]))
for i in x:
print(i)
- Is this generally a good way to do it when you may need generated values more than once?
- Might it be better if
__iter__
yields the value? - In terms of usage of the class: might it be better to always rewind and let iter start with pos 0? Currently, iterating through the values, stopping early, then iterating again, would obviously just continue at the last index when not all values have been generated yet.
Thanks.
1 Answer 1
You have a bug. Take the following code, and output:
s = SetOfValues(iter([1, 2, 3, 4, 5, 6]))
print(list(zip(s, s)))
print(list(zip(s, s)))
[(1, 2), (3, 4), (5, 6)]
[(1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6)]
This is quite a substantial bug. As I would only use SetOfValues
over a list
, if I can get the correct list without consuming the iterator. Because if the iterator is consumed, by the time I loop through the list again, then I've used the most long-winded version of list
known.
The way I'd resolve this is to build the list whilst consuming the iterator. This would look something like:
def cache(it, l):
for i in it:
l.append(i)
yield i
However, we don't know if the iterator has been consumed if we stick with that. So I'd abuse lists a little more, so that we have a 'return' list, that returns the stored data, and whether the iterator has been consumed.
Other than that, I'd use itertools.chain
so that we can change the list and the iterator together. However if the iterator has been consumed, then I'd return the list as an iterator.
Resulting in:
def cache(it, ret):
l = ret[0]
for i in it:
l.append(i)
yield i
ret[1] = True
class SetOfValues(object):
def __init__(self, values):
self._info = [[], False]
self._it = cache(values, self._info)
def __iter__(self):
l, finished = self._info
if finished:
return iter(l)
else:
return itertools.chain(l, self._it)
However, I'd heed the advice of itertools.tee
.
In general, if one iterator uses most or all of the data before another iterator starts, it is faster to use
list()
instead oftee()
.
And so, I'd use list
instead in most cases.
Explore related questions
See similar questions with these tags.