homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author flox
Recipients flox, lemburg, michael.foord
Date 2010年01月08日.11:42:40
SpamBayes Score 7.804027e-05
Marked as misclassified No
Message-id <1262950962.97.0.235825438798.issue7643@psf.upfronthosting.co.za>
In-reply-to
Content
It's confusing.
There's a specific annex UAX #14 which defines "Line Breaking Properties".
Some properties are defines as "Mandatory Line Breaks (non-tailorable)":
 BK, CR, LF, NL
And the resulting list is different:
 CAT BIDI BRK
------------------------------------------------------------------------000A LF LINE FEED Cc B LF
000B VT LINE TABULATION Cc S BK (since Unicode 5.0) 
000C FF FORM FEED Cc WS BK
000D CR CARRIAGE RETURN Cc B CR
0085 NEL NEXT LINE Cc B NL (C1 Control Code)
2028 LS LINE SEPARATOR Zl WS BK
2029 PS PARAGRAPH SEPARATOR Zp B BK
------------------------------------------------------------------------
Differences:
 - VT and FF are mandatory breaks (even if "implementations are not
 required to support the VT character")
 - FS, GS, US are combined marks (CM): "Prohibit a line break between
 the character and the preceding character"
According to this Annex, the current splitlines() implementation violates the Unicode standard.
References:
 - Unicode Standard Annex #14 - Line Breaking Algorithm
 http://www.unicode.org/reports/tr14/
 - UCD LineBreak.txt
 http://www.unicode.org/Public/5.2.0/ucd/LineBreak.txt 
History
Date User Action Args
2010年01月08日 11:42:43floxsetrecipients: + flox, lemburg, michael.foord
2010年01月08日 11:42:42floxsetmessageid: <1262950962.97.0.235825438798.issue7643@psf.upfronthosting.co.za>
2010年01月08日 11:42:41floxlinkissue7643 messages
2010年01月08日 11:42:40floxcreate

AltStyle によって変換されたページ (->オリジナル) /