homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author flox
Recipients flox, lemburg, michael.foord
Date 2010年01月08日.10:32:05
SpamBayes Score 5.1629745e-06
Marked as misclassified No
Message-id <1262946727.6.0.739397040405.issue7643@psf.upfronthosting.co.za>
In-reply-to
Content
Some technical background.
== Unicode ==
According to the Unicode Standard Annex #9, a character with
bidirectional class B is a "Paragraph Separator". And "Because a
Paragraph Separator breaks lines, there will be at most one per line,
at the end of that line."
As a consequence, there's 3 reasons to identify a character as a
linebreak:
 - General Category Zl "Line Separator"
 - General Category Zp "Paragraph Separator"
 - Bidirectional Class B "Paragraph Separator"
There's 8 linebreaks in the current Unicode Database (5.2):
------------------------------------------------------------------------
000A LF LINE FEED Cc B
000D CR CARRIAGE RETURN Cc B
001C FS INFORMATION SEPARATOR FOUR Cc B (UCD 3.1 FILE SEPARATOR)
001D GS INFORMATION SEPARATOR THREE Cc B (UCD 3.1 GROUP SEPARATOR)
001E RS INFORMATION SEPARATOR TWO Cc B (UCD 3.1 RECORD SEPARATOR)
0085 NEL NEXT LINE Cc B (C1 Control Code)
2028 LS LINE SEPARATOR Zl WS (Unicode)
2029 PS PARAGRAPH SEPARATOR Zp B (Unicode)
------------------------------------------------------------------------
== ASCII ==
The Standard ASCII control codes (C0) are in the range 00-1F.
It limits the list to LF, CR, FS, GS, RS.
Regarding the last three, they are not considered as linebreaks:
"The separators (File, Group, Record, and Unit: FS, GS, RS and US) were made to
structure data, usually on a tape, in order to simulate punched cards. End of
medium (EM) warns that the tape (or whatever) is ending. While many systems use
CR/LF and TAB for structuring data, it is possible to encounter the separator
control characters in data that needs to be structured. The separator control
characters are not overloaded; there is no general use of them except to
separate data into structured groupings. Their numeric values are contiguous
with the space character, which can be considered a member of the group, as a
word separator."
(Ref: http://en.wikipedia.org/wiki/Control_character#Data_structuring)
In conclusion, it may be better to keep things unchanged.
We may add some words to the documentation for str.splitlines() and bytes.splitlines() to explain what is considered a line break character.
References:
 - The Unicode Character Database (UCD): http://www.unicode.org/ucd/
 - UCD Property Values: http://unicode.org/reports/tr44/#Property_Values
 - The Bidirectional Algorithm: http://www.unicode.org/reports/tr9/
 - C0 and C1 Control Codes:
 http://en.wikipedia.org/wiki/C0_and_C1_control_codes 
History
Date User Action Args
2010年01月08日 10:32:07floxsetrecipients: + flox, lemburg, michael.foord
2010年01月08日 10:32:07floxsetmessageid: <1262946727.6.0.739397040405.issue7643@psf.upfronthosting.co.za>
2010年01月08日 10:32:06floxlinkissue7643 messages
2010年01月08日 10:32:05floxcreate

AltStyle によって変換されたページ (->オリジナル) /