Working with cached generator values

Question 1

I have set of values represented by a class. The class should be initialised with a generator to limit memory use.

The class itself is the iterable, having __iter__, and the iterator, having __next__ (see below).

Since I want to be able to iterate over the set of values multiple times, I cache the generated values in a list. In __iter__, I check if I already have iterated over all values, and then return iter of the cached values, or self to continue with next, respectively.

Here's the code stripped down to the relevant parts:

class SetOfValues(object):
 def __init__(self, values):
 self._values = values
 self._values_generated = []
 self._done = False
 def __iter__(self):
 if self._done:
 return iter(self._values_generated)
 else:
 return self
 def __next__(self):
 try:
 value = next(self._values)
 self._values_generated.append(value)
 return value
 except StopIteration:
 self._done = True
 raise StopIteration("the end")

Then call it with:

x = SetOfValues((value for value in [1,2,3]))
for i in x:
 print(i)

Is this generally a good way to do it when you may need generated values more than once?
Might it be better if __iter__ yields the value?
In terms of usage of the class: might it be better to always rewind and let iter start with pos 0? Currently, iterating through the values, stopping early, then iterating again, would obviously just continue at the last index when not all values have been generated yet.

Thanks.

Question 2

You have a bug. Take the following code, and output:

s = SetOfValues(iter([1, 2, 3, 4, 5, 6]))
print(list(zip(s, s)))
print(list(zip(s, s)))

[(1, 2), (3, 4), (5, 6)]
[(1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6)]

This is quite a substantial bug. As I would only use SetOfValues over a list, if I can get the correct list without consuming the iterator. Because if the iterator is consumed, by the time I loop through the list again, then I've used the most long-winded version of list known.

The way I'd resolve this is to build the list whilst consuming the iterator. This would look something like:

def cache(it, l):
 for i in it:
 l.append(i)
 yield i

However, we don't know if the iterator has been consumed if we stick with that. So I'd abuse lists a little more, so that we have a 'return' list, that returns the stored data, and whether the iterator has been consumed.

Other than that, I'd use itertools.chain so that we can change the list and the iterator together. However if the iterator has been consumed, then I'd return the list as an iterator.

Resulting in:

def cache(it, ret):
 l = ret[0]
 for i in it:
 l.append(i)
 yield i
 ret[1] = True
class SetOfValues(object):
 def __init__(self, values):
 self._info = [[], False]
 self._it = cache(values, self._info)
 def __iter__(self):
 l, finished = self._info
 if finished:
 return iter(l)
 else:
 return itertools.chain(l, self._it)

However, I'd heed the advice of itertools.tee.

In general, if one iterator uses most or all of the data before another iterator starts, it is faster to use list() instead of tee().

And so, I'd use list instead in most cases.

Peilonrayz ♦Peilonrayz 44.4k7 gold badges80 silver badges157 bronze badges · Accepted Answer · 2017-10-25 14:26:27Z

You have a bug. Take the following code, and output:

s = SetOfValues(iter([1, 2, 3, 4, 5, 6]))
print(list(zip(s, s)))
print(list(zip(s, s)))

[(1, 2), (3, 4), (5, 6)]
[(1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6)]

This is quite a substantial bug. As I would only use SetOfValues over a list, if I can get the correct list without consuming the iterator. Because if the iterator is consumed, by the time I loop through the list again, then I've used the most long-winded version of list known.

The way I'd resolve this is to build the list whilst consuming the iterator. This would look something like:

def cache(it, l):
 for i in it:
 l.append(i)
 yield i

However, we don't know if the iterator has been consumed if we stick with that. So I'd abuse lists a little more, so that we have a 'return' list, that returns the stored data, and whether the iterator has been consumed.

Other than that, I'd use itertools.chain so that we can change the list and the iterator together. However if the iterator has been consumed, then I'd return the list as an iterator.

Resulting in:

def cache(it, ret):
 l = ret[0]
 for i in it:
 l.append(i)
 yield i
 ret[1] = True
class SetOfValues(object):
 def __init__(self, values):
 self._info = [[], False]
 self._it = cache(values, self._info)
 def __iter__(self):
 l, finished = self._info
 if finished:
 return iter(l)
 else:
 return itertools.chain(l, self._it)

However, I'd heed the advice of itertools.tee.

In general, if one iterator uses most or all of the data before another iterator starts, it is faster to use list() instead of tee().

And so, I'd use list instead in most cases.

Stack Exchange Network

Working with cached generator values

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Working with cached generator values

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions