homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: str.splitlines splitting on non-\r\n characters
Type: behavior Stage: needs patch
Components: Documentation Versions: Python 3.4, Python 3.5, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: docs@python Nosy List: Alexander Schrijver, barry, docs@python, ezio.melotti, gregory.p.smith, jwilk, lemburg, martin.panter, nascheme, python-dev, r.david.murray, scharron, serhiy.storchaka, steven.daprano, terry.reedy, vstinner
Priority: normal Keywords: patch

Created on 2014年08月20日 10:01 by scharron, last changed 2022年04月11日 14:58 by admin.

Files
File name Uploaded Description Edit
cpython3.5_splitlines.diff Alexander Schrijver, 2016年05月31日 21:14 review
cpython3.5_splitlines.diff Alexander Schrijver, 2016年05月31日 21:16 review
cpython2.7_splitlines.diff Alexander Schrijver, 2016年05月31日 21:20 review
Messages (34)
msg225561 - (view) Author: Samuel Charron (scharron) Date: 2014年08月20日 10:01
According to the documentation, str.splitlines uses the universal newlines to split lines.
The documentation says it's all about \r, \n, and \r\n (https://docs.python.org/3.5/glossary.html#term-universal-newlines)
However, it's also splitting on other characters. Reading the code, it seems the list of characters is from Objects/unicodeobject.c , in _PyUnicode_Init, the linebreak array.
When testing any of these characters, it splits the string.
Other libraries are using str.splitlines assuming it only breaks on these \r and \n characters. This is the case of email.feedparser for instance, used by http.client to parse headers. These HTTP headers should be separated by CLRF as specified by http://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4. 
Either the documentation should state that splitlines splits on other characters or it should stick to the documentation and split only on \r and \n characters.
If it splits on other characters, the list could be improved, as the unicode reference lists the mandatory characters for line breaking : http://www.unicode.org/reports/tr14/tr14-32.html#BK 
msg225564 - (view) Author: Samuel Charron (scharron) Date: 2014年08月20日 12:21
For an example of a serious bug caused by this, see http://bugs.python.org/issue22233 
msg225705 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2014年08月22日 20:34
Objects/unicodeobject.c linebreak is at 266. With 3.4.1:
>>> 'a\x0ab\x0bc\x0cd\x0d1c\x1c1d\x1d1e\x1e'.splitlines()
['a', 'b', 'c', 'd', '1c', '1d', '1e']
\x0a == \n, \x0d == \r
The \r\n pair is a special case, as promised, but other pairs are not.
>>> 'a\r\nb'.splitlines()
['a', 'b']
>>> 'a\x0b\nb'.splitlines()
['a', '', 'b']
msg225709 - (view) Author: Samuel Charron (scharron) Date: 2014年08月22日 20:49
It's also at line #14941 for unicode strings if I understand correctly
With 3.4.0: 
>>> "a\x85b\x1ec".splitlines()
['a', 'b', 'c']
msg225723 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014年08月23日 02:44
See issue 7643 for some technical background. There are some other interesting issues to read if you seach the tracker for 'splitlines unicode', one of which is an open doc issue. Clearly the docs about this are inadequate.
Basically, though, I think you are correct. email should not be using splitlines(). It was more or less correct when email was splitting binary data, but even then it wasn't exactly correct per the letter of RFC.
Unfortunately not using splitlines has some performance implications...but then again we haven't done any sort of performance improvement pass on the new email code, so it may well be marginal in the overall scheme of things.
msg225736 - (view) Author: Samuel Charron (scharron) Date: 2014年08月23日 07:47
This is a known issue, and will be resolved by improving documentation, I'm closing this bug
Thanks !
msg225747 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014年08月23日 13:45
OK, we'll use issue 22232 to resolve the issue of email using splitlines.
msg225748 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014年08月23日 14:00
May be add a parameter to str.splitlines() which will switch behavior to split on '\n', '\r' and '\r\n' only?
msg225755 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2014年08月23日 18:24
Unless there is already another issue for improving the doc, this should at least be left open as a doc issue.
But I had the same thought as Serhiy, that we should at least optionally make the current doc correct. Two possibilities:
newlines=False If true, only split on \r, \n, \r\n; otherwise split on all latin-1 linebreak characters -- <list>. {This is rather awkward.}
linebreak=True If true, split on all latin-1 linebreak characters <list>; otherwise only split on \r, \n, \r\n. {Better, to me}
Changing both code and doc, at least in 3.5, says that both are wrong. If we agree on this, there is still the awkward issue of what to do for 3.4. Just change the doc? Then email must do something different in 3.4 to work around the code behavior. I think this may warrant a pydev discussion.
Another issue is whether latin-1 linebreaks are privileged. Why not implement the full unicode linebreak algorithm.
An additional complication is that in 2.x, .splitlines acts as advertised.
>>> 'a\x0ab\x0bc\x0cd\x0dda\x0d\x0a1c\x1c1d\x1d1e\x1e85\x85end'.splitlines()
['a', 'b\x0bc\x0cd', 'da', '1c\x1c1d\x1d1e\x1e85\x85end']
msg225758 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014年08月23日 19:25
I don't understand why you say about latin-1. splitlines() supports linebreaks outside latin-1 range.
>>> [hex(i) for i in range(sys.maxunicode + 1) if len(('%cx' % i).splitlines()) == 2]
['0xa', '0xb', '0xc', '0xd', '0x1c', '0x1d', '0x1e', '0x85', '0x2028', '0x2029']
"newlines" and "linebreak" don't look good to me. And it is not obvious why true or false value corresponds to one or another variant.
msg225766 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2014年08月23日 22:01
I was not aware of the remainder of the undocumented behavior. Thanks for the code that makes it clear .
linebreak (or linebreaks)=True means that splitting occurs on some (approximation?*) of unicode mandatory linebreaks, as opposed to just the ascii 'universal newline' sequences, as defined in our glossary. Possible alternative: restrict=False (restrict to u. newlines?)
*I did not read the annex in enough detail to know either way.
The following pair of experiments, which I should have run before, show that there has been no real change of behavior from 2.x to 3.x.
# 2.7.8
>>> u'a\x0ab\x0bc\x0cd\x0dda\x0d\x0a1c\x1c1d\x1d1e\x1e85\x852028\u20282029\u2029end'.splitlines()
[u'a', u'b', u'c', u'd', u'da', u'1c', u'1d', u'1e', u'85', u'2028', u'2029', u'end']
# 3.4.1
b'a\x0ab\x0bc\x0cd\x0dda\x0d\x0a1c\x1c1d\x1d1e\x1e85\x85end'.splitlines()
[b'a', b'b\x0bc\x0cd', b'da', b'1c\x1c1d\x1d1e\x1e85\x85end']
Given this, I am a bit dubious about adding a new parameter in 3.5 to make the unicode method act like the bytes method. Part of my support for that was thinking that it would help porting code. But that is not true. In both 2 and 3, there is the possibility to latin-1 encode, split, and latin-1 decode the pieces.
The doc correction clearly needed is that the 3.4+ universal newlines glossary entry needs to be updated from 'str.splitlines' to 'bytes.newlines'. I will try to do this.
A second doc problem is that the docstrings (given by help(x.splitlines) are exactly the same for bytes.splitlines and unicode.splitlines, in both 2.x and 3.x, even though the behavior is different even for ascii. I think each should list what they split on. Ditto for the doc is not already. This should have a patch posted for review.
msg225767 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2014年08月23日 22:30
New changeset 3ad59ed0f4f0 by Terry Jan Reedy in branch '3.4':
Issue #22232 (partial fix): update Universal newlines Glossary entry.
http://hg.python.org/cpython/rev/3ad59ed0f4f0 
msg225769 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2014年08月23日 23:12
Glossary fixed. I changed the components to Documention as you will handle email elsewhere.
For library references: The key sentence currently used in all entries is "This method uses the universal newlines approach to splitting lines.", where *universal newlines* is linked to the glossary.
2.x has one entry for str and unicode. I propose to add "Unicode.splitlines also splits on '\x0b' ('\v'), '\x0c' ('\f'), '\x1c', '\x1d', '\x1e', '\x85', '\u2028', and '\u2029'." 
3.x bytes entry is good as is.
3.x str entry is wrong. Replace with "This method splits on universal newlines and also on '\x0b' ('\v'), '\x0c' ('\f'), '\x1c', '\x1d', '\x1e', '\x85', '\u2028', and '\u2029'." 
The docstrings now contain about the same as the docs, minus the key line above.
" Return a list of the lines in S, breaking at line boundaries.
 Line breaks are not included in the resulting list unless keepends
 is given and true."
Between the sentences, I propose to add:
"Boundaries are indicated by 'universal newlines' ('\x0a' ('\n'), '\x0d' ('\r'), and '\x0d\x0a' ('\r\n'))." for bytes,
 with the addition of "and '\x0b' ('\v'), '\x0c' ('\f'), '\x1c', '\x1d', '\x1e', '\x85', '\u2028', and '\u2029'" for unicode.
msg225873 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014年08月25日 14:35
The existing related open doc issue issue 12855.
msg225874 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014年08月25日 14:40
Ideally str.splitlines would split on whatever the unicode database says are mandatory line break characters. I take it this is currently not true? That is, that the list is hardcoded?
msg230138 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2014年10月28日 11:03
Looks like str.splitlines is using STRINGLIB_ISLINEBREAK which in turn uses Py_UNICODE_ISLINEBREAK, so the behavior should be correct. If splitting on \n, \r, and \r\n only is common enough with might add a bool arg to splitlines to restrict the splitting on those 3 only, but I can't think about any good name for such arg.
msg230143 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014年10月28日 12:35
With Terry's explanation "linebreak" looks better to me. Yet one alternative is ascii=False (or unicode=True?). And may be worth to add this parameter to strip/rstrip/lstrip/split too. On other hand regular expressions can be used in such special cases.
msg230144 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2014年10月28日 12:45
There are some ascii line breaks other than \n, \r, \r\n.
unicode=True might be better, but might be confused with unicode strings.
Maybe unicode_linebreaks or unicode_newlines?
msg230152 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014年10月28日 14:19
See also issue18236.
msg246547 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015年07月10日 08:09
The main documentation has been updated and Issue 12855 has been closed. What is left to do here, considering this is marked as a documenation bug? Just modify the doc strings, as Terry suggested in <https://bugs.python.org/issue22232#msg225766>?
msg246570 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2015年07月10日 17:11
If this isn't already mentioned in 2 to 3 porting notes it is worth highlighting there. code which uses a str in python 2 and still uses a str in python 3 is now splitting on many more characters.
That seems to be the source of bugs like issue22233. splitlines() used to work for the strict \r\n splitting task. now that code needs to made explicit about its splitting desires.
msg266781 - (view) Author: Alexander Schrijver (Alexander Schrijver) Date: 2016年05月31日 21:14
This diff updates the cpython (tip) documentation to document the different behaviour when using splitlines on bytes objects or string objects.
msg266782 - (view) Author: Alexander Schrijver (Alexander Schrijver) Date: 2016年05月31日 21:16
This diff synchronizes the cpython 2.7 with that from 3.5 and also describes the difference between bytes objects and unicode objects (from the other diff)
msg266785 - (view) Author: Alexander Schrijver (Alexander Schrijver) Date: 2016年05月31日 21:20
Oops, wrong diff. Sorry, this is the correct one for 2.7.
msg266789 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016年05月31日 23:05
For Python 3, the bytes.splitlines() and bytearray.splitlines() documentation has been moved to a separate section out (Issue 21777). I don’t think it is good to add much detail of bytes.splitlines() in the str.splitlines() documentation.
For Python 2, perhaps see Matthew’s patches for 2.7 in Issue 12855. IMO we could reopen that bug if that helps, because only the Python 3 branches were comitted.
msg266800 - (view) Author: Alexander Schrijver (Alexander Schrijver) Date: 2016年06月01日 07:27
I appeared to have missed the reference to that issue when I read this issue the first time. Re-opening that issue makes sense to me.
msg327105 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2018年10月05日 06:14
If we introduce a keyword parameter, I think the default of str.splitlines() should be changed to match bytes.splitlines (and match Python 2 str.splitlines()). I.e. split on \r and \n by default. I looked through the stdline and I can't find any calls that should actually by splitting on the extra characters. I will check it again though.
Does anyone have an example of where the current behaviour is actually wanted?
msg327106 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018年10月05日 06:42
If change the default behavior we need to wait several releases after adding this option. Users should be able to pick the current behavior explicitly.
Currently the workaround is using regular expressions.
For s.splitlines(keepends=False):
 re.split(r'\n|\r\n?', s)
For s.splitlines(keepends=True):
 re.split(r'(?<=\n)|(?<=\r)(?!\n)', s)
msg327113 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2018年10月05日 08:17
I am -1 on changing the default behavior. The Unicode standard defines what a linebreak code point is (all code points with character properties Zl or bidirectional property B) and we adhere to that. This may confuse parsers coming from the ASCII world, but that's really a problem with those parsers assuming that .splitlines() only splits on ASCII line breaks, i.e. they are not written in a Unicode compatible way.
As mentioned in https://bugs.python.org/issue18291 we could add a parameter to .splitlines(), but this would render the method not much faster than re.split().
Using re.split() is not a work-around in his case, it's an explicit form of defining the character you want to split lines on, if the standards defining your file format as only accepting ASCII line break characters.
Since there are many such file formats, perhaps adding a parameter asciionly=True/False would make sense. .splitlines() could then be made to only split on ASCII linebreak characters. This new parameter would then have to default to False to maintain compatibility with Unicode and all previous releases.
msg327137 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2018年10月05日 14:04
I've created a topic on this inside the "Ideas" area of discuss.python.org. Sorry if that wasn't appropriate, not sure if I should have keep the discussion here.
Inada Naoki suggests creating a new method str.iterlines{[keepends]). Given that people are -1 on changing str.splitlines, I think that's a good solution. A new method is better yet if it would only split on '\n', that way fp.read().iterlines() matches fp.readlines(). It is what people seem to expect and is the most handy behaviour. So, str and bytes would both get the new method and they would both split on only '\n'.
If we do that, I think nearly every use of splitlines() should get changed to iterlines().
msg327141 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2018年10月05日 15:23
Why not simply add a new parameter, to make people who want
ASCII linebreaks continue to use .splitlines() ?
It think it would be less than ideal to have one method break on
all Unicode line breaks and another only on ASCII ones.
msg327162 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2018年10月05日 18:13
> Why not simply add a new parameter, to make people who want ASCII linebreaks continue to use .splitlines() ?
That could work but I think in nearly every case you don't want to use splitlines() without supplying the parameter. So, it seems like a bit of trap for new users. Worse, because in Python 2, str.splitlines() does what they want, they will do the simple thing which is likely wrong.
If we do stick with just splitlines(), perhaps it should get a 'newline' parameter that mostly matches io.open (i.e. it controls universal newline behavior). So if you don't want to change behavior, str.splitlines(newline=None) would split as it currently does. To make it split like io files do, you would have to do newline='\n'.
To me, it seems attractive that:
fp.readlines() == fp.read().iterlines()
You suggestion would make it something like:
fp.readlines() == fp.read().splitlines(newline='\n')
I guess I could live with that but it seems unnecessarily ugly and verbose for what is the most common usage.
msg327269 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2018年10月07日 08:49
I don't like the idea of adding a second bool parameter to splitlines. Guido has a rough rule of thumb (which I agree with) of "no constant bool parameters". If people will typically call a function with some sort of "mode" parameter using a hard-coded bool, then we should usually prefer to split the two modes into distinct functions.
As an example, we have statistics.stdev and pstdev rather than stdev(data, population=False).
Obviously this is a guideline, not a hard rule, and there are exceptions. Such as str.splitlines :-)
In any case, I suggest a separate string method. Even though the name is slightly inaccurate, I suggest "ascii_splitlines" which I think is accurate enough to capture the spirit of what we intend (split on *only* \n \r and \r\n) and we can leave the details in the docs.
msg327308 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2018年10月07日 20:48
I too would prefer a new method name rather than overloading splitlines() with more keyword args (passed as hardcoded constants, usually). Again, I think we want:
list(open(..).read().<splitmethod>()) == list(open(..))
readlines() returns a list but I think this method should return an iterator (seems more Python 3 like to me, call list if you want a list). In that case, iterlines() seems like the right name to me. I think it should take a 'newline' keyword that behaves the same as the open() version of the keyword.
History
Date User Action Args
2022年04月11日 14:58:07adminsetgithub: 66428
2018年10月07日 20:48:13naschemesetmessages: + msg327308
2018年10月07日 08:49:53steven.dapranosetnosy: + steven.daprano
messages: + msg327269
2018年10月05日 18:13:08naschemesetmessages: + msg327162
2018年10月05日 15:23:49lemburgsetmessages: + msg327141
2018年10月05日 14:04:21naschemesetmessages: + msg327137
2018年10月05日 08:17:41lemburgsetnosy: + lemburg
messages: + msg327113
2018年10月05日 06:42:34serhiy.storchakasetmessages: + msg327106
2018年10月05日 06:14:04naschemesetnosy: + nascheme
messages: + msg327105
2016年06月01日 07:27:48Alexander Schrijversetmessages: + msg266800
2016年05月31日 23:05:51martin.pantersetmessages: + msg266789
2016年05月31日 21:20:36Alexander Schrijversetfiles: + cpython2.7_splitlines.diff

messages: + msg266785
2016年05月31日 21:16:10Alexander Schrijversetfiles: + cpython3.5_splitlines.diff

messages: + msg266782
2016年05月31日 21:14:20Alexander Schrijversetfiles: + cpython3.5_splitlines.diff

nosy: + Alexander Schrijver
messages: + msg266781

keywords: + patch
2015年07月10日 17:11:24gregory.p.smithsetnosy: + gregory.p.smith
messages: + msg246570
2015年07月10日 16:52:40gregory.p.smithlinkissue24601 superseder
2015年07月10日 08:09:29martin.pantersetmessages: + msg246547
2015年03月17日 06:42:25martin.pantersetnosy: + martin.panter
2014年10月28日 14:29:38jwilksetnosy: + jwilk
2014年10月28日 14:19:42serhiy.storchakasetmessages: + msg230152
2014年10月28日 12:45:46ezio.melottisetmessages: + msg230144
2014年10月28日 12:35:12serhiy.storchakasetmessages: + msg230143
2014年10月28日 11:03:26ezio.melottisetmessages: + msg230138
2014年08月25日 14:40:08r.david.murraysetmessages: + msg225874
2014年08月25日 14:35:53r.david.murraysetmessages: + msg225873
2014年08月23日 23:12:15terry.reedysetassignee: docs@python
components: + Documentation, - Library (Lib), Unicode, email
versions: + Python 2.7
nosy: + docs@python

messages: + msg225769
stage: needs patch
2014年08月23日 22:30:00python-devsetnosy: + python-dev
messages: + msg225767
2014年08月23日 22:01:18terry.reedysetmessages: + msg225766
2014年08月23日 19:25:20serhiy.storchakasetmessages: + msg225758
2014年08月23日 18:24:44terry.reedysetstatus: closed -> open

messages: + msg225755
2014年08月23日 14:00:52serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg225748
2014年08月23日 13:45:54r.david.murraysetmessages: + msg225747
2014年08月23日 07:47:26scharronsetstatus: open -> closed

messages: + msg225736
2014年08月23日 02:44:27r.david.murraysetnosy: + barry
messages: + msg225723
components: + email
2014年08月22日 20:49:52scharronsetmessages: + msg225709
2014年08月22日 20:34:34terry.reedysetnosy: + terry.reedy

messages: + msg225705
title: str.splitlines splitting on none-\r\n characters -> str.splitlines splitting on non-\r\n characters
2014年08月20日 12:21:58scharronsetmessages: + msg225564
2014年08月20日 11:59:30r.david.murraysetnosy: + r.david.murray
2014年08月20日 10:01:51scharroncreate

AltStyle によって変換されたページ (->オリジナル) /