Message 97440 - Python tracker

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	flox, lemburg, michael.foord
Date	2010年01月08日.21:08:19
SpamBayes Score	1.6125268e-11
Marked as misclassified	No
Message-id	<4B479EC1.8020009@egenix.com>
In-reply-to	<1262950962.97.0.235825438798.issue7643@psf.upfronthosting.co.za>

Content
Florent Xicluna wrote: > > Florent Xicluna <laxyf@yahoo.fr> added the comment: > > It's confusing. > > There's a specific annex UAX #14 which defines "Line Breaking Properties". > Some properties are defines as "Mandatory Line Breaks (non-tailorable)": > BK, CR, LF, NL Note that a line breaking algorithm is something different than a line split algorithm. The latter is used to separate lines at pre-defined positions in the text, the former is used to format a piece of text to fit e.g. into a certain width of available character positions. .splitlines() implements a line splitting algorithm, not a line breaking one. > And the resulting list is different: > CAT BIDI BRK > ------------------------------------------------------------------------ > 000A LF LINE FEED Cc B LF > 000B VT LINE TABULATION Cc S BK (since Unicode 5.0) > 000C FF FORM FEED Cc WS BK > 000D CR CARRIAGE RETURN Cc B CR > 0085 NEL NEXT LINE Cc B NL (C1 Control Code) > 2028 LS LINE SEPARATOR Zl WS BK > 2029 PS PARAGRAPH SEPARATOR Zp B BK > ------------------------------------------------------------------------ > > Differences: > - VT and FF are mandatory breaks (even if "implementations are not > required to support the VT character") > - FS, GS, US are combined marks (CM): "Prohibit a line break between > the character and the preceding character" > > According to this Annex, the current splitlines() implementation violates the Unicode standard. It appears so and I guess that's an oversight on my part when writing the code: in Unicode 2.1 (the version I started with), FF was marked as "B", later on Unicode 3.0 was published and the new LineBreak.txt file was added to the standard. FF was changed to "WS" and instead marked as "BK" in that new LineBreak.txt file. Since we only used the main UnicodeData.txt file as basis for the type database, the "FF" code point dropped out of the line break code point set. I guess we'll have to add FF and VT to the generator makeunicodedata.py to remedy this. > References: > - Unicode Standard Annex #14 - Line Breaking Algorithm > http://www.unicode.org/reports/tr14/ > - UCD LineBreak.txt > http://www.unicode.org/Public/5.2.0/ucd/LineBreak.txt Thanks, -- Marc-Andre Lemburg eGenix.com ________________________________________________________________________ ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

Content

Florent Xicluna wrote:
> 
> Florent Xicluna <laxyf@yahoo.fr> added the comment:
> 
> It's confusing.
> 
> There's a specific annex UAX #14 which defines "Line Breaking Properties".
> Some properties are defines as "Mandatory Line Breaks (non-tailorable)":
> BK, CR, LF, NL
Note that a line breaking algorithm is something different than
a line split algorithm. The latter is used to separate lines at
pre-defined positions in the text, the former is used to format
a piece of text to fit e.g. into a certain width of available
character positions.
.splitlines() implements a line splitting algorithm, not a line
breaking one.
> And the resulting list is different:
> CAT BIDI BRK
> ------------------------------------------------------------------------
> 000A LF LINE FEED Cc B LF
> 000B VT LINE TABULATION Cc S BK (since Unicode 5.0) 
> 000C FF FORM FEED Cc WS BK
> 000D CR CARRIAGE RETURN Cc B CR
> 0085 NEL NEXT LINE Cc B NL (C1 Control Code)
> 2028 LS LINE SEPARATOR Zl WS BK
> 2029 PS PARAGRAPH SEPARATOR Zp B BK
> ------------------------------------------------------------------------
>
> Differences:
> - VT and FF are mandatory breaks (even if "implementations are not
> required to support the VT character")
> - FS, GS, US are combined marks (CM): "Prohibit a line break between
> the character and the preceding character"
> 
> According to this Annex, the current splitlines() implementation violates the Unicode standard.
It appears so and I guess that's an oversight on my part when
writing the code: in Unicode 2.1 (the version I started with),
FF was marked as "B", later on Unicode 3.0 was published and
the new LineBreak.txt file was added to the standard. FF was
changed to "WS" and instead marked as "BK" in that new LineBreak.txt
file.
Since we only used the main UnicodeData.txt file as basis for
the type database, the "FF" code point dropped out of the
line break code point set.
I guess we'll have to add FF and VT to the generator makeunicodedata.py
to remedy this.
> References:
> - Unicode Standard Annex #14 - Line Breaking Algorithm
> http://www.unicode.org/reports/tr14/
> - UCD LineBreak.txt
> http://www.unicode.org/Public/5.2.0/ucd/LineBreak.txt
Thanks,
-- 
Marc-Andre Lemburg
eGenix.com
________________________________________________________________________
::: Try our new mxODBC.Connect Python Database Interface for free ! ::::
 eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
 Registered at Amtsgericht Duesseldorf: HRB 46611
 http://www.egenix.com/company/contact/

History
Date	User	Action	Args
2010年01月08日 21:08:22	lemburg	set	recipients: + lemburg, michael.foord, flox
2010年01月08日 21:08:20	lemburg	link	issue7643 messages
2010年01月08日 21:08:19	lemburg	create

homepage