homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: str.split(): allow removing empty strings (when sep is not None)
Type: enhancement Stage: patch review
Components: Library (Lib) Versions: Python 3.10
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: ZackerySpytz Nosy List: Catherine.Devlin, Mark.Bell, Philippe Cloutier, ZackerySpytz, andrei.avk, barry, cheryl.sabella, corona10, gvanrossum, karlcow, mrabarnett, serhiy.storchaka, syeberman, veky
Priority: normal Keywords: patch

Created on 2016年12月11日 15:11 by barry, last changed 2022年04月11日 14:58 by admin.

Files
File name Uploaded Description Edit
split_prune_1.patch abarry, 2016年12月12日 16:42 review
Pull Requests
URL Status Linked Edit
PR 26196 open Catherine.Devlin, 2021年05月17日 19:09
PR 26222 open Mark.Bell, 2021年05月18日 22:47
Messages (48)
msg282923 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2016年12月11日 15:11
This has finally bugged me enough to file an issue, although I wouldn't be able to use it until Python 3.7. There's a subtle but documented difference in str.split() when sep=None:
>>> help(''.split)
Help on built-in function split:
split(...) method of builtins.str instance
 S.split(sep=None, maxsplit=-1) -> list of strings
 
 Return a list of the words in S, using sep as the
 delimiter string. If maxsplit is given, at most maxsplit
 splits are done. If sep is not specified or is None, any
 whitespace string is a separator and empty strings are
 removed from the result.
I.e., that empty strings are removed from the result. This does not happen when sep is given, leading to this type of unfortunate code:
>>> 'foo,bar,baz'.split(',')
['foo', 'bar', 'baz']
>>> 'foo,bar,baz'.replace(',', ' ').split()
['foo', 'bar', 'baz']
>>> ''.split(',')
['']
>>> ''.replace(',', ' ').split()
[]
Specifically, code that wants to split on say commas, but has to handle the case where the source string is empty, shouldn't have to also filter out the single empty string item.
Obviously we can't change existing behavior, so I propose to add a keyword argument `prune` that would make these two bits of code identical:
>>> ''.split()
[]
>>> ''.split(' ', prune=True)
[]
and would handle the case of ''.split(',') without having to resort to creating an ephemeral intermediate string.
`prune` should be a keyword-only argument, defaulting to False.
msg282925 - (view) Author: Anilyka Barry (abarry) * (Python triager) Date: 2016年12月11日 15:26
I understand the feeling. However, in a project I maintain, we want the other way around - to be able to never have an empty list, even if the string is empty (we resorted to using re.split in the end, which has this behaviour). Consider:
rest = re.split(" +", rest)[0].strip()
This gives us None-like behaviour in splitting, at the cost of not actually using str.split.
I'm +1 on the idea, but I'd like some way to better generalize str.split use (not everyone knows you can pass None and/or an integer).
(At the same time, the counter arguments where str has too many methods, or that methods shouldn't do too much, also apply here.)
But I don't like bikeshedding too much, so let's just count me as +1 for your way, if there's no strong momentum for mine :)
msg282926 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016年12月11日 15:32
Current behavior is consistent with str.count():
 len(string.split(sep)) == string.count(sep) + 1
and re.split():
 re.split(re.escape(sep), string) == string.split(sep)
May be the behavior when sep is None should be changed for consistency with the behavior when sep is not None?
msg282927 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2016年12月11日 15:35
On Dec 11, 2016, at 03:32 PM, Serhiy Storchaka wrote:
>Current behavior is consistent with str.count():
>
> len(string.split(sep)) == string.count(sep) + 1
>
>and re.split():
>
> re.split(re.escape(sep), string) == string.split(sep)
Yep. My suggestion is a straight up 'practicality beats purity' request.
>May be the behavior when sep is None should be changed for consistency with
>the behavior when sep is not None?
I'm very strongly -1 on changing any existing behavior.
msg282928 - (view) Author: Anilyka Barry (abarry) * (Python triager) Date: 2016年12月11日 15:38
Changing the behaviour when sep is None is a big backwards-compatibility break, and I'm not sure we'd even want that. It's logical to allow passing None to mean the same thing as NULL (i.e. no arguments), and the behaviour in that case has been like that for... well, long enough that changing it isn't really feasible.
I agree with Barry here, especially since this is a completely opt-in feature, and existing behaviour isn't changed without the user's knowledge.
msg282929 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016年12月11日 15:57
I meant adding boolean argument that changes the behavior when sep is None, not when it is not None.
msg282930 - (view) Author: Anilyka Barry (abarry) * (Python triager) Date: 2016年12月11日 16:01
That would work for my case, but it wouldn't for Barry's (unless I missed something). He wants a non-None argument to not leave empty strings, but I want a None argument to leave empty strings... I don't think there's a one-size-fits-all solution in this case, but feel free to prove me wrong :)
msg282931 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2016年12月11日 16:03
On Dec 11, 2016, at 03:57 PM, Serhiy Storchaka wrote:
>I meant adding boolean argument that changes the behavior when sep is None,
>not when it is not None.
Ah, I understand now, thanks. However, I'm not sure that addresses my
particular use case. It's actually kind of handy to filter out the empty
strings. But I'm open to counter arguments.
msg282932 - (view) Author: Anilyka Barry (abarry) * (Python triager) Date: 2016年12月11日 16:29
Actually, there might be a way. We could make prune default to True if sep is None, and default to False if sep is not None. That way, we get to keep the existing behaviour for either case, while satisfying both of our use cases :)
If that's a bad idea (and it quite probably is), I'll retract it. But it's an interesting possibility to at least consider.
msg282935 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2016年12月11日 18:17
So prune would default to None?
None means current behaviour (prune if sep is None else don't prune)
True means prune empty strings
False means don't prune empty string
msg282936 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2016年12月11日 19:00
A few randomly ordered thoughts about splitting:
* The best general purpose text splitter I've ever seen is in MS Excel and is called "Text to Columns". It has a boolean flag, "treat consecutive delimiters as one" which is off by default.
* There is a nice discussion on the complexities of the current design on StackOverflow: http://stackoverflow.com/questions/16645083 In addition, there are many other SO questions about the behavior of str.split().
* The learning curve for str.split() is already high. The doc entry for it has been revised many times to try and explain what it does. I'm concerned that adding another algorithmic option to it may make it more difficult to learn and use in the common cases (API design principle: giving users more options can impair usability). Usually in Python courses, I recommend using str.split() for the simple, common cases and using regex when you need more control.
* What I do like about the proposal is that that there is no clean way to take the default whitespace splitting algorithm and customize to a particular subset of whitespace (i.e. tabs only).
* A tangential issue is that it was a mistake to expose the maxsplit=-1 implementation detail. In Python 2.7, the help was "S.split([sep [,maxsplit]])". But folks implementing the argument clinic have no way of coping with optional arguments that don't have a default value (like dict.pop), so they changed the API so that the implementation detail was exposed, "S.split(sep=None, maxsplit=-1)". IMO, this is an API regression. We really don't want people passing in -1 to indicate that there are no limits. The Python way would have been to use None as a default or to stick with the existing API where the number of arguments supplied is part of the API (much like type() has two different meanings depending on whether it has an arity of 1 or 3).
Overall, I'm +0 on the proposal but there should be good consideration given to 1) whether there is a sufficient need to warrant increasing API complexity, making split() more difficult to learn and remember, 2) considering whether "prune" is the right word (can someone who didn't write the code read it clearly afterwards), 3) or addressing this through documentation (i.e. showing the simple regexes needed for cases not covered by str.split).
msg282954 - (view) Author: Anilyka Barry (abarry) * (Python triager) Date: 2016年12月11日 23:58
Matthew: Yes, that's exactly the way I was going about it.
Thank you Raymond for your comments (and your informative answer on that SO question).
I think that part of the problem is that no delimiter (or None) behaves differently than with a delimiter. If we wanted proper consistency, we would have needed to make passing None (or nothing) the same as passing whitespace, but alas, we have to work with what we have.
As you said, API complexity is a concern that needs to be addressed. I think that the most important part is how it's documented, and, if phrased correctly (which is non-trivial), could actually simplify the explanation.
Consider this draft:
***
The value of the `prune` keyword argument determines whether or not consecutive delimiters should be grouped together. If `prune` is not given or None, it defaults to True if sep is None or not given, and False otherwise.
If `prune` is True, consecutive delimiters (all whitespace if None or not given) are regarded as a single separator, and the result will not contain any empty string. The resulting list may be empty.
Otherwise, if `prune` is False, consecutive delimiters are not grouped together, and the result may contain one or more empty string. The resulting list will never be empty.
***
I may be oversimplifying this, but it seems to me that this might help some people by hopefully streamlining the explanation.
This still doesn't solve the issue where a user can't say "split on a space or a tab, but not newlines", which IMO is lacking in the design, but that may be for another issue ;)
I've wrapped up a basic patch which probably doesn't work at all; I'll put it up when it's at least partly working for everyone to look at.
msg282958 - (view) Author: Anilyka Barry (abarry) * (Python triager) Date: 2016年12月12日 02:31
Here's an initial patch. It works exactly as discussed earlier, doesn't break any tests, and retains full backwards compatibility. No doc changes (except for the docstrings of str.[r]split) and no tests, as this is just a preliminary patch to see if there's any merit to the idea.
msg282961 - (view) Author: Sye van der Veen (syeberman) * Date: 2016年12月12日 04:22
In the sep!=None case, there are existing alternatives to prune=True that aren't many more keystrokes:
>>> ''.split(' ', prune=True)
[]
>>> [x for x in ''.split(' ') if x]
[]
>>> list(filter(bool, ''.split(' '))) # or drop list() and use the iterator directly
[]
This becomes even fewer keystrokes for users that create a prune() or split_prune() function.
For the sep==None case, I agree there are no alternatives to prune=False (aside from rolling your own split function). However, instead of prune, what if sep accepted a tuple of strings, similar to startswith. In this case, each string would be considered one possible, yet distinct, delimiter:
>> ''.split(prune=False)
['']
>> ''.split((' ', '\t')) # typical whitespace
['']
>> ''.split(tuple(string.whitespace)) # ASCII whitespace
['']
Once again, this becomes even easier for users that create a split_no_prune() function, or that assign tuple(string.whitespace) to a variable. It would also nicely handle strings with non-homogeneous delimiters:
>>> '1?2,,3;'.split((',', ';', '?'))
['1', '2', '', '3', '']
I personally find the 0-argument str.split() one of the great joys of Python. It's common to have to split out words from a sentence, and having that functionality just 8 characters away at all times has been very useful.
msg282962 - (view) Author: Anilyka Barry (abarry) * (Python triager) Date: 2016年12月12日 04:38
Yes, I agree that being able to pass in a tuple would be really useful. As far as rolling out a custom function goes, I'd sooner reach for re.split than do that, so I don't really have a strong argument for either side. Feel free to play with the patch or make an entirely new one, though! I mainly submitted the patch to keep the discussion going, and eventually come to a concensus, but I don't have a strong opinion either way :)
msg282963 - (view) Author: Vedran Čačić (veky) * Date: 2016年12月12日 05:09
The problem with .split is its split (pun intended) personality: it really is two functions that have separate use cases, and have different algorithms; and people think about them as separate functions. In that view, we have just fallen afoul of Guido's rule of no literal passing bool arguments. The true solution would probably be to bite the bullet and have two separate methods. After all, .splitlines is a separate method for precisely such a reason.
msg282966 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2016年12月12日 06:18
Guido, do you have an option on this? IIRC, this was an API you created.
Nick's thought (posted on twitter) is that 'filter(None, sep.split(input)' already covers the "drop the empty values" case. 
My feelings are mixed. Though I've never needed in practice, it would be nice if the whitespace removal algorithm could be customized to just a space or just a tab. On the other hand, I think the new parameter would make the API more confusing and harder to learn. It might be better to just document either the filter(None) approach or a simple regex for the less common cases.
msg283011 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2016年12月12日 15:32
I really appreciate all the feedback. Here are some thoughts.
I'm well aware of the filter(), re, and other options, and certainly those can
be made to work, but they're non-obvious. The reason I suggested an
enhancement to str.split() is because I've seen the replace().split() being used
far too often, and what I think is happening is that people take the most
natural path to accomplish their goals: they know they just want to do a
simple string split on a token (usually one character) so they start out with
the obvious str.split(',') or whatever. Then they notice that it doesn't work
consistent with their mental model in some corner cases.
The next common step isn't from there to filter() or re. The former isn't a
well-known API and the latter is viewed as "too complex". Their next mental
step is "oh, so providing a sep has different behavior that I don't want, so
I'll just replace the comma with a space and now don't have to provide sep".
And now str.split() does what they want. Done. Move along.
I do wish the str.split() API was consistent w.r.t. to sep=None, but it's what
we have and is a very well known API.
@rhettinger: I'm of mixed opinion on it too! I really wanted to get this in
the tracker and see if we could come up with something better, but so far I
still like `prune` the best.
@ebarry: Thanks for the draft docs, but that's not how I think about this.
I'd be utilitarian and get right to the point, e.g.:
"""
The value of `prune` controls whether empty strings are removed from the
resulting list. The default value (None) says to use the default behavior,
which for backward compatibility reasons is different whether sep is None or
not (see above). Regardless of the value of sep, when prune is True empty
strings are removed and when prune is False they are not.
"""
So @mrabarnett, +1 on the suggested defaults.
Lastly, as for Guido's admonition against boolean arguments, I would make
prune a keyword-only argument, so that forces the code to be readable and
should alleviate those concerns. The trade-off is the extra typing, but
that's actually besides the point. The win here is that the solution is
easily discoverable and avoids the intermediate string object.
msg283019 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2016年12月12日 16:16
I like the proposal. I agree that filter(None, ...) is not discoverable (and has its own magic).
So the proposal would be: prune=False -> empty strings stay, prune=True, empty strings are dropped, prune=None (default) use True if sep is None, False otherwise. Right?
Some end cases:
- ''.split(None, prune=True) -> ['']
- 'x x'.split(None, prune=True) -> ['x', '', 'x']
Right?
While we're here I wish there was a specific argument we could translate .split(None) into, e.g. x.split() == x.split((' ', '\t', '\n', '\r', '\f')) # or whatever set of strings
msg283021 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2016年12月12日 16:26
On Dec 12, 2016, at 04:16 PM, Guido van Rossum wrote:
>So the proposal would be: prune=False -> empty strings stay, prune=True,
>empty strings are dropped, prune=None (default) use True if sep is None,
>False otherwise. Right?
Yep!
>Some end cases:
>
>- ''.split(None, prune=True) -> ['']
>- 'x x'.split(None, prune=True) -> ['x', '', 'x']
>
>Right?
Isn't that what you'd expect if prune=False instead? (i.e. prune=True always
drops empty strings from the results)
>While we're here I wish there was a specific argument we could translate
>.split(None) into, e.g. x.split() == x.split((' ', '\t', '\n', '\r', '\f')) #
>or whatever set of strings
Is that the sep=<some tuple> idea that @syeberman suggested earlier? If so,
then you could do:
>>> x.split(tuple(string.whitespace))
Would that suffice?
msg283025 - (view) Author: Anilyka Barry (abarry) * (Python triager) Date: 2016年12月12日 16:42
Barry: Sure, the docs example was just a quick write-up, you can word it however you want!
Guido: Pretty much, except the other way around (when prune is False, i.e. "don't remove empty strings").
The attached patch exposes the behaviour (it's identical to last night's, but I'm re-uploading it as an unrelated file went in), except that the `prune` argument isn't keyword-only (I didn't know how to do this, and didn't bother searching for just a proof-of-concept).
msg283029 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2016年12月12日 17:12
> except the other way around
Whoops. Indeed. So all's well here.
> x.split(tuple(string.whitespace))
Yes, that's what I was after. (But it can be a separate PR.)
msg283031 - (view) Author: Vedran Čačić (veky) * Date: 2016年12月12日 17:28
I think Guido's mistake is relevant here. It tripped me too. Too much negatives, and "prune" is not really well-known verb. Besides, we already have str.splitlines' keepends, which works the opposite way.
msg338412 - (view) Author: Cheryl Sabella (cheryl.sabella) * (Python committer) Date: 2019年03月19日 22:34
@ebarry, any interest in converting your patch to a GitHub pull request? Thanks!
msg338422 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2019年03月19日 23:41
@veky - Thank you for pointing out splitlines(keepends=True). If we wanted consistency, then we'd change the sense and use something like .split(keepempty=True), however:
* I don't like run-on names, so I would suggest keep_empty
* Maybe just `keep` is enough
* Either way, this should be a keyword only argument
* The default would still be None (i.e. current behavior), but keep_empty=True would be equivalent to prune=False and keep_empty=False would be equivalent to prune=True in the previous discussion.
msg338429 - (view) Author: Anilyka Barry (abarry) * (Python triager) Date: 2019年03月20日 01:41
Unfortunately not. I no longer have the time or means to work on this, sorry. I hope someone else can pick it up.
msg354907 - (view) Author: Philippe Cloutier (Philippe Cloutier) Date: 2019年10月18日 16:10
I understood the current (only) behavior, but coming from a PHP background, I really didn't expect it. Thank you for this request, I would definitely like the ability to get behavior matching PHP's explode().
msg354908 - (view) Author: Philippe Cloutier (Philippe Cloutier) Date: 2019年10月18日 16:26
I assume the "workaround" suggested by Raymond in msg282966 is supposed to read...
filter(None, str.split(sep)
... rather than filter(None, sep.split(input)).
msg384306 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2021年01月04日 02:21
This issue probably needs a new champion. There is broad agreement but some
bike shedding, so a PEP isn’t needed.--
--Guido (mobile)
msg384321 - (view) Author: Zackery Spytz (ZackerySpytz) * (Python triager) Date: 2021年01月04日 11:15
I am working on this issue.
msg384337 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2021年01月04日 17:50
Excellent!
msg393822 - (view) Author: Catherine Devlin (Catherine.Devlin) * Date: 2021年05月17日 19:12
@ZackerySpytz - I made https://github.com/python/cpython/pull/26196 with a test for the desired behavior; hopefully it helps. I could try to adapt Barry's old patch myself, but it's probably better if somebody C-competent does so...
msg393871 - (view) Author: Mark Bell (Mark.Bell) * Date: 2021年05月18日 13:13
So I have taken a look at the original patch that was provided and I have been able to update it so that it is compatible with the current release. I have also flipped the logic in the wrapping functions so that they take a `keepempty` flag (which is the opposite of the `prune` flag). 
I had to make a few extra changes since there are now some extra catches in things like PyUnicode_Split which spot that if len(self) > len(sep) then they can just return [self]. However that now needs an extra test since that shortcut can only be used if len(self) > 0. You can find the code here: https://github.com/markcbell/cpython/tree/split-keepempty
However in exploring this, I'm not sure that this patch interacts correctly with maxsplit. For example, 
 ' x y z'.split(maxsplit=1, keepempty=True)
results in
 ['', '', 'x', 'y z']
since the first two empty strings items are "free" and don't count towards the maxsplit. I think the length of the result returned must be <= maxsplit + 1, is this right?
I'm about to rework the logic to avoid this, but before I go too far could someone double check my test cases to make sure that I have the correct idea about how this is supposed to work please. Only the 8 lines marked "New case" show new behaviour, all the other come from how string.split works currently. Of course the same patterns should apply to bytestrings and bytearrays.
 ''.split() == []
 ''.split(' ') == ['']
 ''.split(' ', keepempty=False) == [] # New case
 ' '.split(' ') == ['', '', '']
 ' '.split(' ', maxsplit=1) == ['', ' ']
 ' '.split(' ', maxsplit=1, keepempty=False) == [' '] # New case
 ' a b c '.split() == ['a', 'b', 'c']
 ​' a b c '.split(maxsplit=0) == ['a b c ']
 ​' a b c '.split(maxsplit=1) == ['a', 'b c ']
 ' a b c '.split(' ') == ['', '', 'a', 'b', 'c', '', '']
 ​' a b c '.split(' ', maxsplit=0) == [' a b c ']
 ​' a b c '.split(' ', maxsplit=1) == ['', ' a b c ']
 ​' a b c '.split(' ', maxsplit=2) == ['', '', 'a b c ']
 ​' a b c '.split(' ', maxsplit=3) == ['', '', 'a', 'b c ']
 ​' a b c '.split(' ', maxsplit=4) == ['', '', 'a', 'b', 'c ']
 ​' a b c '.split(' ', maxsplit=5) == ['', '', 'a', 'b', 'c', ' ']
 ​' a b c '.split(' ', maxsplit=6) == ['', '', 'a', 'b', 'c', '', '']
 ​' a b c '.split(' ', keepempty=False) == ['a', 'b', 'c'] # New case
 ​' a b c '.split(' ', maxsplit=0, keepempty=False) == [' a b c '] # New case
 ​' a b c '.split(' ', maxsplit=1, keepempty=False) == ['a', 'b c '] # New case
 ​' a b c '.split(' ', maxsplit=2, keepempty=False) == ['a', 'b', 'c '] # New case
 ​' a b c '.split(' ', maxsplit=3, keepempty=False) == ['a', 'b', 'c', ' '] # New case
 ​' a b c '.split(' ', maxsplit=4, keepempty=False) == ['a', 'b', 'c'] # New case
msg393883 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2021年05月18日 16:41
The case:
 ' a b c '.split(maxsplit=1) == ['a', 'b c ']
suggests that empty strings don't count towards maxsplit, otherwise it would return [' a b c '] (i.e. the split would give ['', ' a b c '] and dropping the empty strings would give [' a b c ']).
msg393889 - (view) Author: Mark Bell (Mark.Bell) * Date: 2021年05月18日 17:07
> suggests that empty strings don't count towards maxsplit
Thank you for the confirmation. Although just to clarify I guess you really mean "empty strings *that are dropped from the output* don't count towards maxsplit". Just to double check this, what do we expect the output of
 ' x y z'.split(maxsplit=1, keepempty=True)
to be?
I think it should be ['', ' x y z'] since in this case we are retaining empty strings and they should count towards the maxsplit.
(In the current patch this crashes with a core dump since it tries to write to unallocated memory)
msg393892 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2021年05月18日 17:38
The best way to think of it is that .split() is like .split(' '), except that it's splitting on any whitespace character instead of just ' ', and keepempty is defaulting to False instead of True.
Therefore:
 ' x y z'.split(maxsplit=1, keepempty=True) == ['', ' x y z']
because:
 ' x y z'.split(' ', maxsplit=1) == ['', ' x y z']
but:
 ' x y z'.split(maxsplit=1, keepempty=False) == ['x y z']
At least, I think that's the case!
msg393896 - (view) Author: Mark Bell (Mark.Bell) * Date: 2021年05月18日 18:34
So I think I agree with you about the difference between .split() and .split(' '). However wouldn't that mean that
 ' x y z'.split(maxsplit=1, keepempty=False) == ['x', 'y z']
since it should do one split.
msg393902 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2021年05月18日 19:41
We have that already, although it's spelled:
 ' x y z'.split(maxsplit=1) == ['x', 'y z']
because the keepempty option doesn't exist yet.
msg394012 - (view) Author: Mark Bell (Mark.Bell) * Date: 2021年05月20日 10:54
Thank you very much for confirming these test cases. Using these I believe that I have now been able to complete a patch that would implement this feature. The PR is available at https://github.com/python/cpython/pull/26222. As I am a first-time contributor, please could a maintainer approve running the CI workflows so that I can confirm that all the (new) tests pass.
msg394128 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2021年05月21日 17:13
I've only just realised that the test cases don't cover all eventualities: none of them test what happens with multiple spaces _between_ the letters, such as:
 ' a b c '.split(maxsplit=1) == ['a', 'b c ']
Comparing that with:
 ' a b c '.split(' ', maxsplit=1)
you see that passing None as the split character does not mean "any whitespace character". There's clearly a little more to it than that.
msg394928 - (view) Author: Andrei Kulakov (andrei.avk) * (Python triager) Date: 2021年06月02日 17:42
I'm not sure I understand why the discussion was focused on removing *all* empty values.
Consider this in a context of a cvs-like string:
1. 'a,b,c' => [a,b,c] # of course
2. ',,' => ['','',''] # follows naturally from above
3. '' => [] # arguably most intuitive
4. '' => [''] # less intuitive but can be correct
From the point of view of intent of the initial string, the first two
are clear - 3 values are provided, in 2) they just happen to be empty.
It's up to the later logic to skip empty values if needed.
The empty string is ambiguous because the intent may be no values or a single empty value.
So ideally the new API would let me choose explicitly between 3) and 4). But I don't see why it would affect 2) !!
The processing of 2) is already not ambiguous. That's what I would want any version of split() to do, and later filter or skip empty values.
Current patch either forces me to choose 4) or to explicitly choose but
also break normal, "correct" handling of 2). 
It can lead to bugs as follows:
Let's say I have a csv-like string:
col1,col2,col3
1,2,3
a,b,c
I note that row 2 creates an empty col1 value, which is probably not what I want. I look at split() args and think that keepempty=False is designed for this use case. I use it in my code. Next time the code will break when someone adds a row:
a,,c
msg395159 - (view) Author: Mark Bell (Mark.Bell) * Date: 2021年06月05日 09:46
Andrei: That is a very interesting observation, thank you for pointing it out. I guess your example / argument also currently applies to whitespace separation too. For example, if we have a whitespace separated string with contents:
col1 col2 col3
a b c
x y z
then using [row.split() for row in contents.splitlines()] results in
[['col1', 'col2', 'col3'], ['a', 'b', 'c'], [], ['x', 'y', 'z']]
However if later a user appends the row:
p q
aiming to have p, and empty cell and then q then they will actually get
[['col1', 'col2', 'col3'], ['a', 'b', 'c'], [], ['x', 'y', 'z'], ['p', 'q']]
So at least this patch results in behaviour that is consistent with how split currently works. 
Are you suggesting that this is something that could be addressed by clearer documentation or using a different flag name?
msg395167 - (view) Author: Andrei Kulakov (andrei.avk) * (Python triager) Date: 2021年06月05日 15:43
Mark:
With sep=None, I don't think there is an issue. My only concern is when sep is set to some other value.
The original issue was that the single empty str result is removed when using sep=None and that it's kept when sep is some other value. So the most direct solution would seem to be to have a flag that controls the removal/retention of a single empty str in results.
Instead, the discussion was focused on removing *all* empty strings from the result.
My concern is that this doesn't solve the original issue in some cases, i.e. if I want to use a sep other than None, and I want an empty line to mean there are no values (result=[]), but I do want to keep empty values (a,, => [a,'','']) -- all of these seem like fairly normal, not unusual requirements.
The second concern, as I noted in previous message, is a potential for bugs if this flag being interpreted narrowly as a solution for the original issue only.
[Note I don't think it would be a very widespread bug but I can see it happening occasionally.]
I think to avoid both of these issues we could change the flag to narrowly target the original issue, i.e. one empty str only. The name of the flag can remain the same or possibly something like `keep_single_empty` would be more explicit (though a bit awkward).
The downside is that we'd lose the convenience of splitting and filtering out all empties in one operation.
Sorry that I bring this up only now when the discussion was finished and the work on PR completed; I wish I had seen the issue sooner.
msg395169 - (view) Author: Andrei Kulakov (andrei.avk) * (Python triager) Date: 2021年06月05日 15:50
To clarify with pseudocode, this is how it could work:
'' => [] # sep=None, keep_single_empty=False
'' => [''] # sep=None, keep_single_empty=True
'' => [] # sep=',', keep_single_empty=False
'a,,' => ['a','',''] # sep=',', keep_single_empty=False
I guess `keepempty=False` could be too easily confused for filtering out all empties.
msg395184 - (view) Author: Mark Bell (Mark.Bell) * Date: 2021年06月05日 20:18
> Instead, the discussion was focused on removing *all* empty strings from the result.
I imagine that the discussion focussed on this since this is precisely what happens when sep=None. For example, 'a b c ​'.split() == ['a', 'b', 'c']. I guess that the point was to provide users with explicit, manual control over whether the behaviour of split should drop all empty strings or retain all empty strings instead of this decision just being made on whether sep is None or not.
So I wonder whether the "expected" solution for parsing CSV like strings is for you to actually filter out the empty strings yourself and never pass them to split at all. For example by doing something like:
[line.split(sep=',') for line in content.splitlines() if line]
but if this is the case then this is the kind of thing that would require careful thought about what is the right name for this parameter / right way to express this in the documentation to make sure that users don't fall into the trap that you mentioned.
> Sorry that I bring this up only now when the discussion was finished and the work on PR completed; I wish I had seen the issue sooner.
Of course, but the main thing is that you spotted this before the PR was merged :)
msg395193 - (view) Author: Andrei Kulakov (andrei.avk) * (Python triager) Date: 2021年06月06日 01:22
> I imagine that the discussion focussed on this since this is precisely what happens when sep=None. For example, 'a b c ​'.split() == ['a', 'b', 'c']. I guess that the point was to provide users with explicit, manual control over whether the behaviour of split should drop all empty strings or retain all empty strings instead of this decision just being made on whether sep is None or not.
That's true on some level but it seems to me that it's somewhat more nuanced than that.
The intent of sep=None is not to remove empties but to collapse invisible whitespace of mixed types into a single separator. ' \t ' probably means a single separator because it looks like one visually. Yes, the effect is the same as removing empties but it's a relevant distinction when designing (and naming) a flag to make split() consistent with this behaviour when sep is ',', ';', etc.
Because when you have 'a,,,' - the most likely intent is to have 3 empty values, NOT to collapse 3 commas into a single sep; - and then you might potentially have additional processing that gets rid of empties, as part of split() operation. So it's quite a different operation, even though the end effect is the same. So is this change really making the behaviour consistent? To me, consistency implies that intent is roughly the same, and outcome is also roughly the same. 
You might say, but: practicality beats purity?
However, there are some real issues here:
- harder to explain, remember, document.
- naming issue
- not completely solving the initial issue (and it would most likely leave no practical way to patch up that corner case if this PR is accepted)
Re: naming, for example, using keep_empty=False for sep=None is confusing, - it would seem that most (or even all) users would think of the operation as collapsing contiguous mixed whitespace into a single separator rather than splitting everything up and then purging empties. So this name could cause a fair bit of confusion for this case.
What if we call it `collapse_contiguous_separators`? I can live with an awkward name, but even then it doesn't work for the case like 'a,,,,' -- it doesn't make sense (mostly) to collapse 4 commas into one separator. Here you are actually purging empty values.
So the consistency seems labored in that any name you pick would be confusing for some cases.
And is the consistency for this case really needed? Is it common to have something like 'a,,,,' and say "I wish to get rid of those empty values but I don't want to use filter(None, values)"?
In regard to the workaround you suggested, that seems fine. If this PR is accepted, any of the workarounds that people now use for ''.split(',') or similar would still work just as before..
msg395194 - (view) Author: Andrei Kulakov (andrei.avk) * (Python triager) Date: 2021年06月06日 01:25
> Of course, but the main thing is that you spotted this before the PR was merged :)
I know, better late then never but also better sooner than late :-)
msg395926 - (view) Author: Andrei Kulakov (andrei.avk) * (Python triager) Date: 2021年06月16日 14:27
Just to sum up the current state the way I see it, as well as the history of the discussion, I think there were 2 initial requests based on experience and one additional, more theoretical "nice to have":
A. ''.split() => ['']
B. ''.split(sep) => [] # where sep!=None
C. a way to get the current semantics of sep=None, but with specific whitespace separators like just spaces or just tabs. 'a b'.split(' ') => ['a','b']
The idea was to cover all 3 enhancements with the current patch.
As I pointed out in the comments above, current patch does not "cleanly" cover case B, potentially leading to confusion and/or bugs.
My suggestion was to cover cases A and B, and leave out C, potentially for some future patch.
If we go with the current patch, there will be no practical way to fix the issue with B later, other than adding a new `str.split2`. Conversely, it would be possible to add a new flag to handle C in the future.
This leads to a few questions:
- will the issue I brought up not really be a problem in practice?
- what's more important, B or C?
- if both B and C are important, can we leave C for a future patch?
History
Date User Action Args
2022年04月11日 14:58:40adminsetgithub: 73123
2021年06月16日 14:27:14andrei.avksetmessages: + msg395926
2021年06月06日 01:25:14andrei.avksetmessages: + msg395194
2021年06月06日 01:22:50andrei.avksetmessages: + msg395193
2021年06月05日 20:18:34Mark.Bellsetmessages: + msg395184
2021年06月05日 15:50:36andrei.avksetmessages: + msg395169
2021年06月05日 15:43:11andrei.avksetmessages: + msg395167
2021年06月05日 09:46:30Mark.Bellsetmessages: + msg395159
2021年06月02日 17:42:54andrei.avksetnosy: + andrei.avk
messages: + msg394928
2021年05月21日 17:13:13mrabarnettsetmessages: + msg394128
2021年05月20日 10:54:07Mark.Bellsetmessages: + msg394012
2021年05月18日 22:47:21Mark.Bellsetpull_requests: + pull_request24839
2021年05月18日 19:41:05mrabarnettsetmessages: + msg393902
2021年05月18日 18:34:35Mark.Bellsetmessages: + msg393896
2021年05月18日 17:38:20mrabarnettsetmessages: + msg393892
2021年05月18日 17:07:52Mark.Bellsetmessages: + msg393889
2021年05月18日 16:41:41mrabarnettsetmessages: + msg393883
2021年05月18日 13:13:51Mark.Bellsetnosy: + Mark.Bell
messages: + msg393871
2021年05月17日 19:12:49Catherine.Devlinsetmessages: + msg393822
2021年05月17日 19:09:13Catherine.Devlinsetnosy: + Catherine.Devlin

pull_requests: + pull_request24813
stage: test needed -> patch review
2021年01月12日 14:05:00corona10setnosy: + corona10
2021年01月04日 17:50:49gvanrossumsetmessages: + msg384337
2021年01月04日 11:15:53ZackerySpytzsetversions: + Python 3.10, - Python 3.8
nosy: + ZackerySpytz

messages: + msg384321

assignee: ZackerySpytz
2021年01月04日 03:37:27rhettingersetnosy: - rhettinger
2021年01月04日 02:21:42gvanrossumsetmessages: + msg384306
2021年01月04日 02:06:27karlcowsetnosy: + karlcow
2019年10月18日 16:26:59Philippe Cloutiersetmessages: + msg354908
2019年10月18日 16:10:05Philippe Cloutiersetnosy: + Philippe Cloutier

messages: + msg354907
title: str.split(): remove empty strings when sep is not None -> str.split(): allow removing empty strings (when sep is not None)
2019年03月20日 01:41:46abarrysetnosy: - abarry
2019年03月20日 01:41:30abarrysetmessages: + msg338429
2019年03月19日 23:41:08barrysetmessages: + msg338422
2019年03月19日 22:34:53cheryl.sabellasetnosy: + cheryl.sabella

messages: + msg338412
versions: + Python 3.8, - Python 3.7
2016年12月12日 17:28:41vekysetmessages: + msg283031
2016年12月12日 17:12:14gvanrossumsetmessages: + msg283029
2016年12月12日 16:42:20abarrysetfiles: + split_prune_1.patch

messages: + msg283025
2016年12月12日 16:41:55abarrysetfiles: - split_prune_1.patch
2016年12月12日 16:26:08barrysetmessages: + msg283021
2016年12月12日 16:16:06gvanrossumsetmessages: + msg283019
2016年12月12日 15:32:35barrysetmessages: + msg283011
2016年12月12日 06:18:20rhettingersetnosy: + gvanrossum
messages: + msg282966
2016年12月12日 05:09:27vekysetnosy: + veky
messages: + msg282963
2016年12月12日 04:38:29abarrysetmessages: + msg282962
2016年12月12日 04:22:01syebermansetnosy: + syeberman
messages: + msg282961
2016年12月12日 02:31:57abarrysetfiles: + split_prune_1.patch
keywords: + patch
messages: + msg282958

stage: test needed
2016年12月11日 23:58:37abarrysetmessages: + msg282954
2016年12月11日 19:00:22rhettingersetnosy: + rhettinger
messages: + msg282936
2016年12月11日 18:17:02mrabarnettsetnosy: + mrabarnett
messages: + msg282935
2016年12月11日 16:29:47abarrysetmessages: + msg282932
2016年12月11日 16:03:13barrysetmessages: + msg282931
2016年12月11日 16:01:56abarrysetmessages: + msg282930
2016年12月11日 15:57:39serhiy.storchakasetmessages: + msg282929
2016年12月11日 15:38:20abarrysetmessages: + msg282928
2016年12月11日 15:35:54barrysetmessages: + msg282927
2016年12月11日 15:32:40serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg282926
2016年12月11日 15:26:42abarrysettype: enhancement

messages: + msg282925
nosy: + abarry
2016年12月11日 15:11:42barrycreate

AltStyle によって変換されたページ (->オリジナル) /