homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: What is a Unicode line break character?
Type: behavior Stage: resolved
Components: Interpreter Core, Unicode Versions: Python 3.2, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: flox Nosy List: amaury.forgeotdarc, flox, lemburg
Priority: normal Keywords: patch

Created on 2010年01月06日 08:46 by flox, last changed 2022年04月11日 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
issue7643_use_LineBreak_v2.diff flox, 2010年03月19日 00:30 Patch, apply to 2.x
Messages (19)
msg97299 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010年01月06日 08:46
Bytes objects and Unicode objects do not agree on ASCII linebreaks.
## Python 2
for s in '\x0a\x0d\x1c\x1d\x1e':
 print u'a{}b'.format(s).splitlines(1), 'a{}b'.format(s).splitlines(1)
# [u'a\n', u'b'] ['a\n', 'b']
# [u'a\r', u'b'] ['a\r', 'b']
# [u'a\x1c', u'b'] ['a\x1cb']
# [u'a\x1d', u'b'] ['a\x1db']
# [u'a\x1e', u'b'] ['a\x1eb']
## Python 3
for s in '\x0a\x0d\x1c\x1d\x1e':
 print('a{}b'.format(s).splitlines(1),
 bytes('a{}b'.format(s), 'utf-8').splitlines(1))
['a\n', 'b'] [b'a\n', b'b']
['a\r', 'b'] [b'a\r', b'b']
['a\x1c', 'b'] [b'a\x1cb']
['a\x1d', 'b'] [b'a\x1db']
['a\x1e', 'b'] [b'a\x1eb']
msg97300 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010年01月06日 09:14
Florent Xicluna wrote:
> 
> New submission from Florent Xicluna <laxyf@yahoo.fr>:
> 
> Bytes objects and Unicode objects do not agree on ASCII linebreaks.
> 
> ## Python 2
> 
> for s in '\x0a\x0d\x1c\x1d\x1e':
> print u'a{}b'.format(s).splitlines(1), 'a{}b'.format(s).splitlines(1)
> 
> # [u'a\n', u'b'] ['a\n', 'b']
> # [u'a\r', u'b'] ['a\r', 'b']
> # [u'a\x1c', u'b'] ['a\x1cb']
> # [u'a\x1d', u'b'] ['a\x1db']
> # [u'a\x1e', u'b'] ['a\x1eb']
> 
> 
> ## Python 3
> 
> for s in '\x0a\x0d\x1c\x1d\x1e':
> print('a{}b'.format(s).splitlines(1),
> bytes('a{}b'.format(s), 'utf-8').splitlines(1))
> 
> ['a\n', 'b'] [b'a\n', b'b']
> ['a\r', 'b'] [b'a\r', b'b']
> ['a\x1c', 'b'] [b'a\x1cb']
> ['a\x1d', 'b'] [b'a\x1db']
> ['a\x1e', 'b'] [b'a\x1eb']
Unicode has more line break characters defined than ASCII, which
only has a single line break character \n, but also uses the
conventions \r and \r\n for meaning "start a new line,
go to position 1".
See e.g. http://en.wikipedia.org/wiki/Ascii#ASCII_control_characters
The three extra code points Unicode defines for line breaks are
group separators that are not in common use.
msg97333 - (view) Author: Michael Foord (michael.foord) * (Python committer) Date: 2010年01月07日 00:03
'\x85' when decoded using latin-1 is just transcoded to u'\x85' which is treated as the NEL (a C1 control code equivalent to end of line). This changes iteration over the file when you decode and actually broke our csv parsing code when we got some latin-1 encoded data with \x85 in it from our customer.
msg97407 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010年01月08日 10:32
Some technical background.
== Unicode ==
According to the Unicode Standard Annex #9, a character with
bidirectional class B is a "Paragraph Separator". And "Because a
Paragraph Separator breaks lines, there will be at most one per line,
at the end of that line."
As a consequence, there's 3 reasons to identify a character as a
linebreak:
 - General Category Zl "Line Separator"
 - General Category Zp "Paragraph Separator"
 - Bidirectional Class B "Paragraph Separator"
There's 8 linebreaks in the current Unicode Database (5.2):
------------------------------------------------------------------------
000A LF LINE FEED Cc B
000D CR CARRIAGE RETURN Cc B
001C FS INFORMATION SEPARATOR FOUR Cc B (UCD 3.1 FILE SEPARATOR)
001D GS INFORMATION SEPARATOR THREE Cc B (UCD 3.1 GROUP SEPARATOR)
001E RS INFORMATION SEPARATOR TWO Cc B (UCD 3.1 RECORD SEPARATOR)
0085 NEL NEXT LINE Cc B (C1 Control Code)
2028 LS LINE SEPARATOR Zl WS (Unicode)
2029 PS PARAGRAPH SEPARATOR Zp B (Unicode)
------------------------------------------------------------------------
== ASCII ==
The Standard ASCII control codes (C0) are in the range 00-1F.
It limits the list to LF, CR, FS, GS, RS.
Regarding the last three, they are not considered as linebreaks:
"The separators (File, Group, Record, and Unit: FS, GS, RS and US) were made to
structure data, usually on a tape, in order to simulate punched cards. End of
medium (EM) warns that the tape (or whatever) is ending. While many systems use
CR/LF and TAB for structuring data, it is possible to encounter the separator
control characters in data that needs to be structured. The separator control
characters are not overloaded; there is no general use of them except to
separate data into structured groupings. Their numeric values are contiguous
with the space character, which can be considered a member of the group, as a
word separator."
(Ref: http://en.wikipedia.org/wiki/Control_character#Data_structuring)
In conclusion, it may be better to keep things unchanged.
We may add some words to the documentation for str.splitlines() and bytes.splitlines() to explain what is considered a line break character.
References:
 - The Unicode Character Database (UCD): http://www.unicode.org/ucd/
 - UCD Property Values: http://unicode.org/reports/tr44/#Property_Values
 - The Bidirectional Algorithm: http://www.unicode.org/reports/tr9/
 - C0 and C1 Control Codes:
 http://en.wikipedia.org/wiki/C0_and_C1_control_codes 
msg97408 - (view) Author: Michael Foord (michael.foord) * (Python committer) Date: 2010年01月08日 10:33
Documenting the characters that splitlines treats as newlines for Unicode should definitely be done.
msg97410 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010年01月08日 11:42
It's confusing.
There's a specific annex UAX #14 which defines "Line Breaking Properties".
Some properties are defines as "Mandatory Line Breaks (non-tailorable)":
 BK, CR, LF, NL
And the resulting list is different:
 CAT BIDI BRK
------------------------------------------------------------------------000A LF LINE FEED Cc B LF
000B VT LINE TABULATION Cc S BK (since Unicode 5.0) 
000C FF FORM FEED Cc WS BK
000D CR CARRIAGE RETURN Cc B CR
0085 NEL NEXT LINE Cc B NL (C1 Control Code)
2028 LS LINE SEPARATOR Zl WS BK
2029 PS PARAGRAPH SEPARATOR Zp B BK
------------------------------------------------------------------------
Differences:
 - VT and FF are mandatory breaks (even if "implementations are not
 required to support the VT character")
 - FS, GS, US are combined marks (CM): "Prohibit a line break between
 the character and the preceding character"
According to this Annex, the current splitlines() implementation violates the Unicode standard.
References:
 - Unicode Standard Annex #14 - Line Breaking Algorithm
 http://www.unicode.org/reports/tr14/
 - UCD LineBreak.txt
 http://www.unicode.org/Public/5.2.0/ucd/LineBreak.txt 
msg97438 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010年01月08日 20:18
Florent Xicluna wrote:
> 
> Florent Xicluna <laxyf@yahoo.fr> added the comment:
> 
> Some technical background.
> 
> == Unicode ==
> 
> According to the Unicode Standard Annex #9, a character with
> bidirectional class B is a "Paragraph Separator". And "Because a
> Paragraph Separator breaks lines, there will be at most one per line,
> at the end of that line."
> 
> As a consequence, there's 3 reasons to identify a character as a
> linebreak:
> - General Category Zl "Line Separator"
> - General Category Zp "Paragraph Separator"
> - Bidirectional Class B "Paragraph Separator"
This definition is what we use in Python for Py_UNICODE_ISLINEBREAK(ch).
> There's 8 linebreaks in the current Unicode Database (5.2):
> ------------------------------------------------------------------------
> 000A LF LINE FEED Cc B
> 000D CR CARRIAGE RETURN Cc B
> 001C FS INFORMATION SEPARATOR FOUR Cc B (UCD 3.1 FILE SEPARATOR)
> 001D GS INFORMATION SEPARATOR THREE Cc B (UCD 3.1 GROUP SEPARATOR)
> 001E RS INFORMATION SEPARATOR TWO Cc B (UCD 3.1 RECORD SEPARATOR)
> 0085 NEL NEXT LINE Cc B (C1 Control Code)
> 2028 LS LINE SEPARATOR Zl WS (Unicode)
> 2029 PS PARAGRAPH SEPARATOR Zp B (Unicode)
> ------------------------------------------------------------------------
And that's the list we're currently using.
> == ASCII ==
> 
> The Standard ASCII control codes (C0) are in the range 00-1F.
> It limits the list to LF, CR, FS, GS, RS.
> Regarding the last three, they are not considered as linebreaks:
> "The separators (File, Group, Record, and Unit: FS, GS, RS and US) were made to
> structure data, usually on a tape, in order to simulate punched cards. End of
> medium (EM) warns that the tape (or whatever) is ending. While many systems use
> CR/LF and TAB for structuring data, it is possible to encounter the separator
> control characters in data that needs to be structured. The separator control
> characters are not overloaded; there is no general use of them except to
> separate data into structured groupings. Their numeric values are contiguous
> with the space character, which can be considered a member of the group, as a
> word separator."
> (Ref: http://en.wikipedia.org/wiki/Control_character#Data_structuring)
> 
> In conclusion, it may be better to keep things unchanged.
Agreed.
> We may add some words to the documentation for str.splitlines() and bytes.splitlines() to explain what is considered a line break character.
For ASCII we should make the list of characters explicit.
For Unicode, we should mention the above definition and give
the table as example list (the Unicode database may add more
such characters in the future).
> References:
> - The Unicode Character Database (UCD): http://www.unicode.org/ucd/
> - UCD Property Values: http://unicode.org/reports/tr44/#Property_Values
> - The Bidirectional Algorithm: http://www.unicode.org/reports/tr9/
> - C0 and C1 Control Codes:
> http://en.wikipedia.org/wiki/C0_and_C1_control_codes 
msg97440 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010年01月08日 21:08
Florent Xicluna wrote:
> 
> Florent Xicluna <laxyf@yahoo.fr> added the comment:
> 
> It's confusing.
> 
> There's a specific annex UAX #14 which defines "Line Breaking Properties".
> Some properties are defines as "Mandatory Line Breaks (non-tailorable)":
> BK, CR, LF, NL
Note that a line breaking algorithm is something different than
a line split algorithm. The latter is used to separate lines at
pre-defined positions in the text, the former is used to format
a piece of text to fit e.g. into a certain width of available
character positions.
.splitlines() implements a line splitting algorithm, not a line
breaking one.
> And the resulting list is different:
> CAT BIDI BRK
> ------------------------------------------------------------------------
> 000A LF LINE FEED Cc B LF
> 000B VT LINE TABULATION Cc S BK (since Unicode 5.0) 
> 000C FF FORM FEED Cc WS BK
> 000D CR CARRIAGE RETURN Cc B CR
> 0085 NEL NEXT LINE Cc B NL (C1 Control Code)
> 2028 LS LINE SEPARATOR Zl WS BK
> 2029 PS PARAGRAPH SEPARATOR Zp B BK
> ------------------------------------------------------------------------
>
> Differences:
> - VT and FF are mandatory breaks (even if "implementations are not
> required to support the VT character")
> - FS, GS, US are combined marks (CM): "Prohibit a line break between
> the character and the preceding character"
> 
> According to this Annex, the current splitlines() implementation violates the Unicode standard.
It appears so and I guess that's an oversight on my part when
writing the code: in Unicode 2.1 (the version I started with),
FF was marked as "B", later on Unicode 3.0 was published and
the new LineBreak.txt file was added to the standard. FF was
changed to "WS" and instead marked as "BK" in that new LineBreak.txt
file.
Since we only used the main UnicodeData.txt file as basis for
the type database, the "FF" code point dropped out of the
line break code point set.
I guess we'll have to add FF and VT to the generator makeunicodedata.py
to remedy this.
> References:
> - Unicode Standard Annex #14 - Line Breaking Algorithm
> http://www.unicode.org/reports/tr14/
> - UCD LineBreak.txt
> http://www.unicode.org/Public/5.2.0/ucd/LineBreak.txt
Thanks,
-- 
Marc-Andre Lemburg
eGenix.com
________________________________________________________________________
::: Try our new mxODBC.Connect Python Database Interface for free ! ::::
 eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
 Registered at Amtsgericht Duesseldorf: HRB 46611
 http://www.egenix.com/company/contact/ 
msg97483 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010年01月10日 00:45
Here is draft of the patch to do what is proposed by Marc André on msg97440 (add VT and FF).
Additionnally I upgraded the UCD 5.1 -> 5.2.
The implementation uses field 16 as defined in "py3k" implementation of "makeunicodedata.py". It should minimize differences between Py2 and Py3 implementations.
Documentation and tests are missing.
I can provide a "diff.gz" containing "Modules/unicodedata_db.h", "Modules/unicodename_db.h" and "Objects/unicodetype_db.h", if needed.
- /* Returns 1 for Unicode characters having the category 'Zl',
- * 'Zp' or type 'B', 0 otherwise.
+ /* Returns 1 for Unicode characters having the line break
+ * property 'BK', 'CR', 'LF' or 'NL' or having bidirectional
+ * type 'B', 0 otherwise.
 */
Note: the "remove_deprecation" should be applied before to remove "-3" warnings.
msg97502 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010年01月10日 10:28
I don't know what to do about this:
> - FS, GS, RS are combined marks (CM): "Prohibit a line break between
> the character and the preceding character"
I know they are not commonly used. So we can keep them as line breaks.
But if we comply strictly with UAX 14 we do not consider them as line breaks.
msg97531 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010年01月10日 18:04
Florent Xicluna wrote:
> 
> Florent Xicluna <laxyf@yahoo.fr> added the comment:
> 
> I don't know what to do about this:
> 
>> - FS, GS, RS are combined marks (CM): "Prohibit a line break between
>> the character and the preceding character"
> 
> I know they are not commonly used. So we can keep them as line breaks.
> But if we comply strictly with UAX 14 we do not consider them as line breaks.
Right. The only update we'd have to do is add FF and VT.
I am a little worried about the possible breakage this may cause,
though. E.g. if you look at a file with FFs in Emacs, the FFs don't
show up as line breaks. FFs in CSV files are currently also not regarded
as line breaks and thus don't need to be placed in quotes.
VTs are probably a non-issue, since they are not in common use.
msg98485 - (view) Author: Chris Carter (Chris.Carter) Date: 2010年01月29日 00:15
Then I must ask, why did the string attribute behave differently? I added it to allow for that, and the behavior seems inconsistent.
msg98486 - (view) Author: Chris Carter (Chris.Carter) Date: 2010年01月29日 00:16
My bad, wrong bug.
msg101294 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010年03月19日 00:30
Cleanup committed as r78982
Patch for LineBreak.txt updated after UCD upgrade to 5.2.
See details: http://bugs.python.org/issue7643#msg97483
Tests added to test_unicodedata.
Backward compatibility concern:
 * it adds VT u'\x0b' and FF u'\x0c' as line breaks.
The choice is either to preserve backward compatibility, or to comply with the specification (UAX #14).
msg101306 - (view) Author: Chris Carter (Chris.Carter) Date: 2010年03月19日 05:01
unwatched
msg101494 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010年03月22日 11:56
Florent Xicluna wrote:
> Backward compatibility concern:
> * it adds VT u'\x0b' and FF u'\x0c' as line breaks.
> 
> The choice is either to preserve backward compatibility, or to comply with the specification (UAX #14).
I think we should correct this bug together with a clear warning in
the Misc/NEWS file.
msg101945 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010年03月30日 16:45
Which functions are affected by this change?
Py_UNICODE_ISLINEBREAK()? unicode.splitlines()?
msg101948 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010年03月30日 17:05
Committed to trunk: r79494 and r79496.
Afaict, it changes Py_UNICODE_ISLINEBREAK, _PyUnicode_IsLinebreak and the Unicode functions which depend on it (splitlines(), _sre module).
msg101955 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010年03月30日 20:21
Ported to 3.x with r79506 
History
Date User Action Args
2022年04月11日 14:56:56adminsetgithub: 51892
2010年03月30日 20:21:44floxsetstatus: open -> closed
resolution: fixed
messages: + msg101955

stage: resolved
2010年03月30日 17:05:56floxsetmessages: + msg101948
2010年03月30日 16:45:25amaury.forgeotdarcsetassignee: flox

messages: + msg101945
nosy: + amaury.forgeotdarc
2010年03月22日 11:56:00lemburgsetmessages: + msg101494
2010年03月19日 06:57:07floxsetnosy: - Chris.Carter
2010年03月19日 05:01:51Chris.Cartersetnosy: lemburg, flox, Chris.Carter
messages: + msg101306
2010年03月19日 00:31:00floxsetpriority: normal
files: + issue7643_use_LineBreak_v2.diff
messages: + msg101294
2010年03月18日 23:48:19floxsetfiles: - issue7643_use_LineBreak.diff
2010年03月18日 22:58:00michael.foordsetnosy: - michael.foord
2010年03月18日 22:57:18floxsetfiles: - issue7643_remove_deprecation.diff
2010年01月29日 00:16:20Chris.Cartersetmessages: + msg98486
2010年01月29日 00:15:43Chris.Cartersetnosy: + Chris.Carter
messages: + msg98485
2010年01月10日 18:05:00lemburgsetmessages: + msg97531
2010年01月10日 10:28:24floxsetnosy: lemburg, michael.foord, flox
messages: + msg97502
components: + Unicode
title: What is an ASCII linebreak? -> What is a Unicode line break character?
2010年01月10日 00:45:28floxsetfiles: + issue7643_use_LineBreak.diff

messages: + msg97483
2010年01月10日 00:36:01floxsetfiles: + issue7643_remove_deprecation.diff
keywords: + patch
2010年01月08日 21:08:20lemburgsetmessages: + msg97440
2010年01月08日 20:18:22lemburgsetmessages: + msg97438
2010年01月08日 11:42:41floxsetmessages: + msg97410
2010年01月08日 10:33:51michael.foordsetmessages: + msg97408
2010年01月08日 10:32:06floxsetmessages: + msg97407
2010年01月07日 00:03:17michael.foordsetnosy: + michael.foord
messages: + msg97333
2010年01月06日 09:14:08lemburgsetnosy: + lemburg
messages: + msg97300
2010年01月06日 08:46:45floxcreate

AltStyle によって変換されたページ (->オリジナル) /