Issue 12855: linebreak sequences should be better documented

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/57064

classification

Title:	linebreak sequences should be better documented
Type:	behavior	Stage:	resolved
Components:	Documentation, Unicode	Versions:	Python 3.4, Python 3.5, Python 2.7

process

Dependencies:	Superseder:
Status:	closed	Resolution:	fixed
Assigned To:	docs@python	Nosy List:	Alexander Schrijver, Matthew.Boehm, SMRUTI RANJAN SAHOO, davidhalter, docs@python, martin.panter, python-dev, r.david.murray, vstinner
Priority:	normal	Keywords:	patch

Created on 2011年08月29日 21:42 by Matthew.Boehm, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
linebreakdoc.py27.patch	Matthew.Boehm, 2011年08月30日 04:45	review
linebreakdoc.v2.py27.patch	Matthew.Boehm, 2011年08月31日 02:35	review
linebreakdoc.v2.py32.patch	Matthew.Boehm, 2011年08月31日 02:35	review
linebreakdoc.v3.py3.5.patch	martin.panter, 2015年02月20日 02:05	review
python.JPG	SMRUTI RANJAN SAHOO, 2015年03月17日 14:52	Bug resolved
linebreakdoc.v4.py3.5.patch	martin.panter, 2015年03月31日 01:44	review
linebreakdoc.v5.py2.7.patch	martin.panter, 2016年06月01日 09:44	review

Messages (24)
msg143182 - (view)	Author: Matthew Boehm (Matthew.Boehm)	Date: 2011年08月29日 21:42
A file opened with codecs.open() splits on a form feed character (\x0c) while a file opened with open() does not. >>> with open("formfeed.txt", "w") as f: ... f.write("line \fone\nline two\n") ... >>> with open("formfeed.txt", "r") as f: ... s = f.read() ... >>> s 'line \x0cone\nline two\n' >>> print s line one line two >>> import codecs >>> with open("formfeed.txt", "rb") as f: ... lines = f.readlines() ... >>> lines ['line \x0cone\n', 'line two\n'] >>> with codecs.open("formfeed.txt", "r", encoding="ascii") as f: ... lines2 = f.readlines() ... >>> lines2 [u'line \x0c', u'one\n', u'line two\n'] >>> Note that lines contains two items while lines2 has 3. Issue 7643 has a good discussion on newlines in python, but I did not see this discrepancy mentioned.
msg143185 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2011年08月29日 21:55
U+000C (Form feed) is considered as a line boundary in Unicode (unicode type), but no for a byte string (str type). Example: >>> u'line \x0cone\nline two\n'.splitlines(True) [u'line \x0c', u'one\n', u'line two\n'] >>> 'line \x0cone\nline two\n'.splitlines(True) ['line \x0cone\n', 'line two\n']
msg143187 - (view)	Author: Matthew Boehm (Matthew.Boehm)	Date: 2011年08月29日 22:07
Thanks for explaining the reasoning. Perhaps I should add this to the python wiki (http://wiki.python.org/moin/Unicode) ? It would be nice if it fit in the docs somewhere, but I'm not sure where. I'm curious how (or if) 2to3 would handle this as well, but I'm closing this issue as it's now clear to me why these two are expected to act differently.
msg143188 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2011年08月29日 22:11
> It would be nice if it fit in the docs somewhere, > but I'm not sure where. See: http://docs.python.org/library/codecs.html#codecs.StreamReader.readline Can you suggest a patch for the documentation? Source code of this document: http://hg.python.org/cpython/file/bb7b14dd5ded/Doc/library/codecs.rst
msg143189 - (view)	Author: Matthew Boehm (Matthew.Boehm)	Date: 2011年08月29日 22:24
I'll suggest a patch for the documentation when I get to my home computer in an hour or two.
msg143194 - (view)	Author: Matthew Boehm (Matthew.Boehm)	Date: 2011年08月30日 00:57
I'm taking a look at the docs now. I'm considering adding a table/list of characters python treats as newlines, but it seems like this might fit better as a note in http://docs.python.org/library/stdtypes.html#str.splitlines or somewhere else in stdtypes. I'll start working on it now, but please let me know what you think about this. This is my first attempt at a patch, so I greatly appreciate your help so far.
msg143199 - (view)	Author: Matthew Boehm (Matthew.Boehm)	Date: 2011年08月30日 04:45
I've attached a patch for python2.7 that adds a small not to library/stdtypes.html#str.splitlines explaining which sequences are treated as line breaks: """ Note: Python recognizes "\r", "\n", and "\r\n" as line boundaries for strings. In addition to these, Unicode strings can have line boundaries of u"\x0b", u"\x0c", u"\x85", u"\u2028", and u"\u2029" """ Additional thoughts: * Would it be better to put this note in a different place? * It looks like \x0b and \x0c (vertical tab and form feed) were first considered line breaks in Python 2.7, probably related to this note from "What's New in 2.7": "The Unicode database provided by the unicodedata module is now used internally to determine which characters are numeric, whitespace, or represent line breaks." It might be worth putting a "changed in 2.7" note somewhere in the docs. Please let me know of any thoughts you have and I'll be glad to make any desired changes and submit a new patch.
msg143204 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2011年08月30日 08:22
> Would it be better to put this note in a different place? You may just say that StreamReader.readline() uses unicode.splitlines(), and so point to unicode.splitlines() doc (use :meth:`unicode.splitlines` syntax). unicode.splitlines() is now well documented: line boundaries are not listed, even in Python 3 documentation. Unicode line boundaries used by Python 2.7 and 3.3: U+000A: Line feed U+000B: Line tabulation U+000C: Form feed U+000D: Carriage return U+001C: File separator U+001D: Group separator U+001E: Record separator U+0085: "control" U+2028: Line separator U+2029: Paragraph separator > It looks like \x0b and \x0c (vertical tab and form feed) were first > considered line breaks in Python 2.7 Correct: U+000B and U+000C were added to Python 2.7 and 3.2. > It might be worth putting a "changed in 2.7" note somewhere in the docs We add the following syntax exactly for this: .. versionchanged:: 2.6 Also unset environment variables when calling :meth:`os.environ.clear` and :meth:`os.environ.pop`. If you downloaded Python source code, go into Doc/ directory and run "make html" to compile the doc to HTML. http://docs.python.org/devguide/setup.html http://docs.python.org/devguide/docquality.html
msg143217 - (view)	Author: Matthew Boehm (Matthew.Boehm)	Date: 2011年08月30日 14:46
I can fix the patch to list all the unicode line boundaries. The three places I've considered putting it are: 1. On the howto/unicode.html 2. Somewhere in the stdtypes.html#typesseq description (maybe with other notes at the bottom) 3. As a note to the stdtypes.html#str.splitlines method description (where it is in the previous patch.) I can move it to any of these places if you think it's a better fit. I'll fix the list so that it's complete, add a note about \x0b and \x0c being added in 2.7/3.2, and possibly reference it from StreamReader.readline. After confirming that my documentation matches the style guide, I'll make the docs, test the output, and upload a patch. I can do this for 2.7, 3.2 and 3.3 separately. Let me know if that sounds good and if you have any further thoughts. I should be able to upload new patches in 10 hours (after work today).
msg143220 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2011年08月30日 15:02
> 1. On the howto/unicode.html > 2. Somewhere in the stdtypes.html#typesseq description (maybe with other notes at the bottom) > 3. As a note to the stdtypes.html#str.splitlines method description (where it is in the previous patch.) (3) is the best place. For Python 2, you should add a new unicode.splitlines entry, whereas the str.splitlines should be updated in Python 3. > I can do this for 2.7, 3.2 and 3.3 separately. You don't have to do it for 3.3: 2.7 and 3.2 are enough (I will do the change in 3.3 using Mercurial).
msg143245 - (view)	Author: Matthew Boehm (Matthew.Boehm)	Date: 2011年08月31日 02:35
I've attached a patch for 2.7 and will attach one for 3.2 in a minute. I built the docs for both 2.7 and 3.2 and verified that there were no warnings and that the resulting web pages looked okay. Things to consider: * Placement of unicode.splitlines() method: I placed it next to str.splitlines. I didn't want to place it with the unicode methods further down because docs say "The following methods are present only on unicode objects" * The docs for codecs.readlines() already mentions "Line-endings are implemented using the codec’s decoder method and are included in the list entries if keepends is true." * Feel free to make any wording/style suggestions.
msg223411 - (view)	Author: David Halter (davidhalter)	Date: 2014年07月18日 14:29
I would vote for the inclusion of that patch. I just stumbled over this.
msg225938 - (view)	Author: Martin Panter (martin.panter) * (Python committer)	Date: 2014年08月26日 23:25
Any reason why characters 1C–1E are excluded?
msg236247 - (view)	Author: Martin Panter (martin.panter) * (Python committer)	Date: 2015年02月20日 02:05
Posting linebreakdoc.v3.py3.5.patch: * Rebased onto recent "default" (3.5) branch * Add missing 1C–1E codes * Dropped reference to "universal newlines", since that only handles CRs and LFs as I understand it The newlines are already tested by test_unicodedata.UnicodeMiscTest.test_linebreak_7643() when the VT and FF codes were added in Issue 7643.
msg238262 - (view)	Author: Martin Panter (martin.panter) * (Python committer)	Date: 2015年03月17日 06:40
Note to self, or anyone else handling this patch: See <https://bugs.python.org/issue22232#msg225769> for further improvement ideas: * Might be good to bring back the reference to universal newlines, but say it accepts additional line boundaries * Terry also suggested a doc string improvement
msg238298 - (view)	Author: SMRUTI RANJAN SAHOO (SMRUTI RANJAN SAHOO)	Date: 2015年03月17日 14:52
i think in this, "line \fone\nline two\n" ,the space after line taking some garbage value or you can say hex value of "\". so that's why that is showing some hex value. if you write "\n " instead of"\" then you can't find that hex value. i attached my idle image here.
msg238450 - (view)	Author: R. David Murray (r.david.murray) * (Python committer)	Date: 2015年03月18日 15:13
SMRUTI: \f is the python escape code for the ASCII formfeed character. It is the handling of that ASCII character (among others) that this issue is discussing.
msg239653 - (view)	Author: Martin Panter (martin.panter) * (Python committer)	Date: 2015年03月31日 01:44
Patch v4 adds back the reference to "universal newlines". I did not alter the doc string, because I don’t think doc strings need to be as detailed as the main documentation.
msg239767 - (view)	Author: Roundup Robot (python-dev) (Python triager)	Date: 2015年04月01日 01:21
New changeset 6244a5dbaf84 by Benjamin Peterson in branch '3.4': document what exactly str.splitlines() splits on (closes #12855) https://hg.python.org/cpython/rev/6244a5dbaf84 New changeset 87af6deb5d26 by Benjamin Peterson in branch 'default': merge 3.4 (#12855) https://hg.python.org/cpython/rev/87af6deb5d26
msg266806 - (view)	Author: Martin Panter (martin.panter) * (Python committer)	Date: 2016年06月01日 07:55
Reopening to change the Python 2 documentation. A starting point may be Matthew’s patch and/or Alexander’s patch in Issue 22232.
msg266812 - (view)	Author: Martin Panter (martin.panter) * (Python committer)	Date: 2016年06月01日 09:44
Here is an updated patch for Python 2, based on Benjamin’s commit, Matthew’s earlier py27 patch, and Alexander’s backport of related changes from Python 3. Let me know what you think.
msg268491 - (view)	Author: Martin Panter (martin.panter) * (Python committer)	Date: 2016年06月14日 00:44
Alexander: does my latest patch linebreakdoc.v5.py2.7.patch address your concerns about the 2.7 documentation? If so, I can push it to the repository.
msg268582 - (view)	Author: Alexander Schrijver (Alexander Schrijver)	Date: 2016年06月14日 19:39
Martin: Yes, it does, thank you. Sorry, I didn't know you where waiting for my approval.
msg268599 - (view)	Author: Roundup Robot (python-dev) (Python triager)	Date: 2016年06月15日 01:43
New changeset 2e6fda267a20 by Martin Panter in branch '2.7': Issue #12855: Document what exactly unicode.splitlines() splits on https://hg.python.org/cpython/rev/2e6fda267a20

History
Date	User	Action	Args
2022年04月11日 14:57:21	admin	set	github: 57064
2016年06月15日 02:01:19	martin.panter	set	status: open -> closed stage: patch review -> resolved
2016年06月15日 01:43:37	python-dev	set	messages: + msg268599
2016年06月14日 19:39:26	Alexander Schrijver	set	messages: + msg268582
2016年06月14日 00:44:22	martin.panter	set	messages: + msg268491
2016年06月01日 09:44:04	martin.panter	set	files: + linebreakdoc.v5.py2.7.patch nosy: + Alexander Schrijver messages: + msg266812 stage: resolved -> patch review
2016年06月01日 07:55:13	martin.panter	set	status: closed -> open messages: + msg266806
2015年04月01日 01:21:32	python-dev	set	status: open -> closed nosy: + python-dev messages: + msg239767 resolution: fixed stage: patch review -> resolved
2015年03月31日 01:44:24	martin.panter	set	files: + linebreakdoc.v4.py3.5.patch messages: + msg239653
2015年03月18日 15:13:14	r.david.murray	set	nosy: + r.david.murray messages: + msg238450
2015年03月17日 14:52:02	SMRUTI RANJAN SAHOO	set	files: + python.JPG nosy: + SMRUTI RANJAN SAHOO messages: + msg238298
2015年03月17日 06:40:32	martin.panter	set	messages: + msg238262
2015年02月20日 02:05:25	martin.panter	set	files: + linebreakdoc.v3.py3.5.patch messages: + msg236247
2014年08月26日 23:25:28	martin.panter	set	nosy: + martin.panter messages: + msg225938
2014年07月21日 20:12:55	zach.ware	set	stage: patch review versions: + Python 3.4, Python 3.5, - Python 3.2, Python 3.3
2014年07月18日 14:29:05	davidhalter	set	nosy: + davidhalter messages: + msg223411
2011年08月31日 02:35:54	Matthew.Boehm	set	files: + linebreakdoc.v2.py32.patch
2011年08月31日 02:35:38	Matthew.Boehm	set	files: + linebreakdoc.v2.py27.patch messages: + msg143245
2011年08月30日 15:02:51	vstinner	set	messages: + msg143220
2011年08月30日 14:46:45	Matthew.Boehm	set	messages: + msg143217
2011年08月30日 08:23:07	vstinner	set	components: + Unicode versions: + Python 3.2, Python 3.3
2011年08月30日 08:22:54	vstinner	set	messages: + msg143204
2011年08月30日 04:45:19	Matthew.Boehm	set	files: + linebreakdoc.py27.patch keywords: + patch messages: + msg143199 title: open() and codecs.open() treat form-feed differently -> linebreak sequences should be better documented
2011年08月30日 00:57:55	Matthew.Boehm	set	messages: + msg143194
2011年08月29日 22:24:37	Matthew.Boehm	set	status: closed -> open assignee: docs@python components: + Documentation, - Interpreter Core nosy: + docs@python messages: + msg143189 resolution: wont fix -> (no value)
2011年08月29日 22:11:58	vstinner	set	messages: + msg143188
2011年08月29日 22:08:33	Matthew.Boehm	set	status: open -> closed resolution: wont fix
2011年08月29日 22:07:57	Matthew.Boehm	set	messages: + msg143187
2011年08月29日 21:55:56	vstinner	set	nosy: + vstinner messages: + msg143185
2011年08月29日 21:42:30	Matthew.Boehm	create

homepage