Sometimes I need to be able to split data into chunks, and so something like str.split
would be helpful. This comes with two downsides:
- Input has to be strings
- You consume all input when generating the output.
I have a couple of requirements:
- It needs to work with any iterable / iterator. Where the items have the
!=
comparator. - I don't want to consume the chunk of data when returning it.
Rather than returning a tuple, I need to return a generator.
And so this left me with two ways to implement the code. A fully lazy version isplit
. And one that is semi-lazy where it consumes some of the generator, when moving to the next chunk, without fully consuming it.
And so I created:
from __future__ import generator_stop
import itertools
def _takewhile(predicate, iterator, has_data):
"""
Return successive entries from an iterable as long as the
predicate evaluates to true for each entry.
has_data outputs if the iterator has been consumed in the process.
"""
for item in iterator:
if predicate(item):
yield item
else:
break
else:
has_data[0] = False
def isplit(iterator, value):
"""Return a lazy generator of items in an iterator, seperating by value."""
iterator = iter(iterator)
has_data = [True]
while has_data[0]:
yield _takewhile(value.__ne__, iterator, has_data)
def split(iterator, value):
"""Return a semi-lazy generator of items in an iterator, seperating by value."""
iterator = iter(iterator)
has_data = [True]
while True:
carry = []
d = _takewhile(value.__ne__, iterator, has_data)
try:
first = next(d)
except StopIteration:
if not has_data[0]:
break
yield iter([])
else:
yield itertools.chain([first], d, carry)
carry.extend(d)
An example of these working are below. There is an edge case with isplit
, which is as far as I know inherent from the code being fully lazy. This is shown below too.
print('isplit')
print([list(i) for i in isplit('abc def ghi', ' ')])
print([list(i) for i in isplit(' abc def ghi', ' ')])
s = isplit('abc def ghi', ' ')
print(list(itertools.zip_longest(*itertools.islice(s, 4))))
print('\nsplit')
print([list(i) for i in split('abc def ghi', ' ')])
print([list(i) for i in split(' abc def ghi', ' ')])
s = split('abc def ghi', ' ')
print(list(itertools.zip_longest(*itertools.islice(s, 4))))
Which outputs:
isplit
[['a', 'b', 'c'], ['d', 'e', 'f'], ['g', 'h', 'i']]
[[], ['a', 'b', 'c'], ['d', 'e', 'f'], ['g', 'h', 'i']]
[('a', 'b', 'c', None), ('d', 'e', 'f', None), (None, 'g', 'h', None), (None, 'i', None, None)]
split
[['a', 'b', 'c'], ['d', 'e', 'f'], ['g', 'h', 'i']]
[[], ['a', 'b', 'c'], ['d', 'e', 'f'], ['g', 'h', 'i']]
[('a', 'd', 'g'), ('b', 'e', 'h'), ('c', 'f', 'i')]
2 Answers 2
I would prefer the name
iterable
for the iterable argument (compare the documentation for theitertools
module), andsep
for the seperator argument (compare the documentation forstr.split
).isplit
has the unsatisfactory feature that you cannot ignore any of the returned iterators — you have to consume each one fully before moving on to the next, otherwise the iteration goes wrong.For example, suppose we want to select words starting with a capital letter. We might try:
for word in isplit('Abc def Ghi', ' '): first = next(word) if first == first.upper(): print(first + ''.join(word))
But this produces the output:
Abc Traceback (most recent call last): File "<stdin>", line 2, in <module> StopIteration
Instead, we have to ensure that we consume each
word
iterator fully, even if we don't care about it:for word in isplit('Abc def Ghi', ' '): first = next(word) if first == first.upper(): print(first + ''.join(word)) else: for _ in word: pass
The same issue arises with the standard library function
itertools.groupby
, where calling code might move on to the next group before it has finished iterating over the previous group.groupby
solves this problem for us by fully consuming the previous group as soon as the caller moves on to the next group. It would be helpful forisplit
to do the same.The similarity with
itertools.groupby
suggests that we could implementisplit
very simply in terms ofgroupby
, like this:from itertools import groupby def isplit(iterable, sep): """Generate the contiguous groups of items from the iterable that are not equal to sep. The returned groups are themselves iterators that share the underlying iterable with isplit(). Because the source is shared, when the isplit() object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list. """ for key, group in groupby(iterable, sep.__ne__): if key: yield group
Note that this code behaves like plain
str.split()
in that it coalesces adjacent separators. If you need the behaviour to be more likestr.split(' ')
, with empty groups when there are adjacent separators, then it should be straightforward to add anelse:
clause to generate the necessary empty iterators, like this:for key, group in groupby(chain((sep,), iterable, (sep,)), sep.__ne__): if key: yield group else: for _ in islice(group, 1, None): yield iter(())
This uses
itertools.chain
anditertools.islice
.(There are a couple of minor optimizations you could make here: the 1-element tuple
(sep,)
could be stored in a variable and used twice, anditer(())
could be a global constant since you don't need a new empty iterator each time.)
-
\$\begingroup\$ Holy ****, I didn't think to use
groupby
here... :O Yoursplit
isn't lazy like mine, and consumes the entire group before any of it is ever used. To achieve the same lazyness, yoursplit
should be yourisplit
. I think changing mostisplit
s tosplit
should be enough to fix this. \$\endgroup\$2018年02月22日 10:52:53 +00:00Commented Feb 22, 2018 at 10:52 -
\$\begingroup\$ I removed the comment about
split
. \$\endgroup\$Gareth Rees– Gareth Rees2018年02月22日 10:59:55 +00:00Commented Feb 22, 2018 at 10:59 -
\$\begingroup\$ Unfortunately
isplit
, as you said, needs more code to work in the same way. I'm now unsure which way is simpler. \$\endgroup\$2018年02月22日 11:15:59 +00:00Commented Feb 22, 2018 at 11:15 -
\$\begingroup\$ @Peilonrayz: It doesn't need to be as complicated as that; see the revised answer. \$\endgroup\$Gareth Rees– Gareth Rees2018年02月22日 14:58:23 +00:00Commented Feb 22, 2018 at 14:58
-
\$\begingroup\$ That's clever. But, you'd still need
carry
to make it work withzip
. \$\endgroup\$2018年02月22日 15:02:05 +00:00Commented Feb 22, 2018 at 15:02
There is a bug with your code.
>>> print([list(i) for i in split(' abc def ghi ', ' ')])
[[], ['a', 'b', 'c'], ['d', 'e', 'f'], ['g', 'h', 'i']]
This should however end in an empty list.
To fix this you only need to change the while True
loop to while has_data[0]
. Following this, you can merge the except
and else
together, meaning that you don't need the try
at all. And so you can use:
def split(iterator, value):
iterator = iter(iterator)
has_data = [True]
while has_data[0]:
carry = []
d = _takewhile(value.__ne__, iterator, has_data)
yield itertools.chain(d, carry)
carry.extend(d)
has_data
necessary? Furthermore, why is it a list of booleans instead of just a boolean? If you get rid ofhas_data
you can useitertools.takewhile
also... \$\endgroup\$has_data
, or change it to just a boolean, then the code will return an infinite amount of empty generators... \$\endgroup\$