Lazy split and semi-lazy split

Question 1

Sometimes I need to be able to split data into chunks, and so something like str.split would be helpful. This comes with two downsides:

Input has to be strings
You consume all input when generating the output.

I have a couple of requirements:

It needs to work with any iterable / iterator. Where the items have the != comparator.
I don't want to consume the chunk of data when returning it.
Rather than returning a tuple, I need to return a generator.

And so this left me with two ways to implement the code. A fully lazy version isplit. And one that is semi-lazy where it consumes some of the generator, when moving to the next chunk, without fully consuming it.

And so I created:

from __future__ import generator_stop
import itertools
def _takewhile(predicate, iterator, has_data):
 """
 Return successive entries from an iterable as long as the 
 predicate evaluates to true for each entry.
 has_data outputs if the iterator has been consumed in the process.
 """
 for item in iterator:
 if predicate(item):
 yield item
 else:
 break
 else:
 has_data[0] = False
def isplit(iterator, value):
 """Return a lazy generator of items in an iterator, seperating by value."""
 iterator = iter(iterator)
 has_data = [True]
 while has_data[0]:
 yield _takewhile(value.__ne__, iterator, has_data)
def split(iterator, value):
 """Return a semi-lazy generator of items in an iterator, seperating by value."""
 iterator = iter(iterator)
 has_data = [True]
 while True:
 carry = []
 d = _takewhile(value.__ne__, iterator, has_data)
 try:
 first = next(d)
 except StopIteration:
 if not has_data[0]:
 break
 yield iter([])
 else:
 yield itertools.chain([first], d, carry)
 carry.extend(d)

An example of these working are below. There is an edge case with isplit, which is as far as I know inherent from the code being fully lazy. This is shown below too.

print('isplit')
print([list(i) for i in isplit('abc def ghi', ' ')])
print([list(i) for i in isplit(' abc def ghi', ' ')])
s = isplit('abc def ghi', ' ')
print(list(itertools.zip_longest(*itertools.islice(s, 4))))
print('\nsplit')
print([list(i) for i in split('abc def ghi', ' ')])
print([list(i) for i in split(' abc def ghi', ' ')])
s = split('abc def ghi', ' ')
print(list(itertools.zip_longest(*itertools.islice(s, 4))))

Which outputs:

isplit
[['a', 'b', 'c'], ['d', 'e', 'f'], ['g', 'h', 'i']]
[[], ['a', 'b', 'c'], ['d', 'e', 'f'], ['g', 'h', 'i']]
[('a', 'b', 'c', None), ('d', 'e', 'f', None), (None, 'g', 'h', None), (None, 'i', None, None)]
split
[['a', 'b', 'c'], ['d', 'e', 'f'], ['g', 'h', 'i']]
[[], ['a', 'b', 'c'], ['d', 'e', 'f'], ['g', 'h', 'i']]
[('a', 'd', 'g'), ('b', 'e', 'h'), ('c', 'f', 'i')]

Question 2

Is has_data necessary? Furthermore, why is it a list of booleans instead of just a boolean? If you get rid of has_data you can use itertools.takewhile also...

Question 3

@Dair If you remove has_data, or change it to just a boolean, then the code will return an infinite amount of empty generators...

Question 4

I would prefer the name iterable for the iterable argument (compare the documentation for the itertools module), and sep for the seperator argument (compare the documentation for str.split).
isplit has the unsatisfactory feature that you cannot ignore any of the returned iterators — you have to consume each one fully before moving on to the next, otherwise the iteration goes wrong.

For example, suppose we want to select words starting with a capital letter. We might try:
```
for word in isplit('Abc def Ghi', ' '):
 first = next(word)
 if first == first.upper():
 print(first + ''.join(word))
```
But this produces the output:
```
Abc
Traceback (most recent call last):
 File "<stdin>", line 2, in <module>
StopIteration
```
Instead, we have to ensure that we consume each word iterator fully, even if we don't care about it:
```
for word in isplit('Abc def Ghi', ' '):
 first = next(word)
 if first == first.upper():
 print(first + ''.join(word))
 else:
 for _ in word:
 pass
```
The same issue arises with the standard library function itertools.groupby, where calling code might move on to the next group before it has finished iterating over the previous group. groupby solves this problem for us by fully consuming the previous group as soon as the caller moves on to the next group. It would be helpful for isplit to do the same.
The similarity with itertools.groupby suggests that we could implement isplit very simply in terms of groupby, like this:
```
from itertools import groupby
def isplit(iterable, sep):
 """Generate the contiguous groups of items from the iterable that are
 not equal to sep.
 The returned groups are themselves iterators that share the
 underlying iterable with isplit(). Because the source is shared,
 when the isplit() object is advanced, the previous group is no
 longer visible. So, if that data is needed later, it should be
 stored as a list.
 """
 for key, group in groupby(iterable, sep.__ne__):
 if key:
 yield group
```
Note that this code behaves like plain str.split() in that it coalesces adjacent separators. If you need the behaviour to be more like str.split(' '), with empty groups when there are adjacent separators, then it should be straightforward to add an else: clause to generate the necessary empty iterators, like this:
```
for key, group in groupby(chain((sep,), iterable, (sep,)), sep.__ne__):
 if key:
 yield group
 else:
 for _ in islice(group, 1, None):
 yield iter(())
```
This uses itertools.chain and itertools.islice.

(There are a couple of minor optimizations you could make here: the 1-element tuple (sep,) could be stored in a variable and used twice, and iter(()) could be a global constant since you don't need a new empty iterator each time.)

Question 5

Holy ****, I didn't think to use groupby here... :O Your split isn't lazy like mine, and consumes the entire group before any of it is ever used. To achieve the same lazyness, your split should be your isplit. I think changing most isplits to split should be enough to fix this.

Question 6

I removed the comment about split.

Question 7

Unfortunately isplit, as you said, needs more code to work in the same way. I'm now unsure which way is simpler.

Question 8

@Peilonrayz: It doesn't need to be as complicated as that; see the revised answer.

Question 9

That's clever. But, you'd still need carry to make it work with zip.

Question 10

There is a bug with your code.

>>> print([list(i) for i in split(' abc def ghi ', ' ')])
[[], ['a', 'b', 'c'], ['d', 'e', 'f'], ['g', 'h', 'i']]

This should however end in an empty list.

To fix this you only need to change the while True loop to while has_data[0]. Following this, you can merge the except and else together, meaning that you don't need the try at all. And so you can use:

def split(iterator, value):
 iterator = iter(iterator)
 has_data = [True]
 while has_data[0]:
 carry = []
 d = _takewhile(value.__ne__, iterator, has_data)
 yield itertools.chain(d, carry)
 carry.extend(d)

Gareth Rees Gareth Rees 50.1k3 gold badges130 silver badges210 bronze badges · Accepted Answer · 2018-02-22 10:35:42Z

I would prefer the name iterable for the iterable argument (compare the documentation for the itertools module), and sep for the seperator argument (compare the documentation for str.split).
isplit has the unsatisfactory feature that you cannot ignore any of the returned iterators — you have to consume each one fully before moving on to the next, otherwise the iteration goes wrong.

For example, suppose we want to select words starting with a capital letter. We might try:
```
for word in isplit('Abc def Ghi', ' '):
 first = next(word)
 if first == first.upper():
 print(first + ''.join(word))
```
But this produces the output:
```
Abc
Traceback (most recent call last):
 File "<stdin>", line 2, in <module>
StopIteration
```
Instead, we have to ensure that we consume each word iterator fully, even if we don't care about it:
```
for word in isplit('Abc def Ghi', ' '):
 first = next(word)
 if first == first.upper():
 print(first + ''.join(word))
 else:
 for _ in word:
 pass
```
The same issue arises with the standard library function itertools.groupby, where calling code might move on to the next group before it has finished iterating over the previous group. groupby solves this problem for us by fully consuming the previous group as soon as the caller moves on to the next group. It would be helpful for isplit to do the same.
The similarity with itertools.groupby suggests that we could implement isplit very simply in terms of groupby, like this:
```
from itertools import groupby
def isplit(iterable, sep):
 """Generate the contiguous groups of items from the iterable that are
 not equal to sep.
 The returned groups are themselves iterators that share the
 underlying iterable with isplit(). Because the source is shared,
 when the isplit() object is advanced, the previous group is no
 longer visible. So, if that data is needed later, it should be
 stored as a list.
 """
 for key, group in groupby(iterable, sep.__ne__):
 if key:
 yield group
```
Note that this code behaves like plain str.split() in that it coalesces adjacent separators. If you need the behaviour to be more like str.split(' '), with empty groups when there are adjacent separators, then it should be straightforward to add an else: clause to generate the necessary empty iterators, like this:
```
for key, group in groupby(chain((sep,), iterable, (sep,)), sep.__ne__):
 if key:
 yield group
 else:
 for _ in islice(group, 1, None):
 yield iter(())
```
This uses itertools.chain and itertools.islice.

(There are a couple of minor optimizations you could make here: the 1-element tuple (sep,) could be stored in a variable and used twice, and iter(()) could be a global constant since you don't need a new empty iterator each time.)

Holy ****, I didn't think to use groupby here... :O Your split isn't lazy like mine, and consumes the entire group before any of it is ever used. To achieve the same lazyness, your split should be your isplit. I think changing most isplits to split should be enough to fix this.
Unfortunately isplit, as you said, needs more code to work in the same way. I'm now unsure which way is simpler.
@Peilonrayz: It doesn't need to be as complicated as that; see the revised answer.
That's clever. But, you'd still need carry to make it work with zip.

Stack Exchange Network

Lazy split and semi-lazy split

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Lazy split and semi-lazy split

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions