Message 155361 - Python tracker

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

In-reply-to
Author	poq
Recipients	Arfrever, Nicholas.Cole, ezio.melotti, inigoserna, loewis, poq, tchrist, vstinner, zeha
Date	2012年03月11日.00:32:14
SpamBayes Score	1.652261e-05
Marked as misclassified	No
Message-id	<1331425935.51.0.471117522786.issue12568@psf.upfronthosting.co.za>

Content
It seems this is a bit of a minefield... GNOME Terminal/libvte has an environment variable (VTE_CJK_WIDTH) to override the handling of ambiguous width characters. It bases its default on the locale (with the comment 'This is basically what GNU libc does'). urxvt just uses system wcwidth. Xterm uses some voodoo to decide between system wcwidth and mk_wcwidth(_cjk): http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c I think the simplest solution is to just expose libc's wc(s)width. It is widely used and is most likely to match the behaviour of the terminal. FWIW I wrote a little script to test the widths of all Unicode characters, and came up with the following logic to match libvte behaviour: def wcwidth(c, legacy_cjk=False): if c in u'\t\r\n10円13円14円': raise ValueError('character %r has no intrinsic width' % c) if c in u'0円5円7円16円17円': return 0 if u'\u1160' <= c <= u'\u11ff': return 0 # hangul jamo if unicodedata.category(c) in ('Mn', 'Me', 'Cf') and c != u'\u00ad': return 0 # 00ad = soft hyphen eaw = unicodedata.east_asian_width(c) if eaw in ('F', 'W'): return 2 if legacy_cjk and eaw == 'A': return 2 return 1

Content

It seems this is a bit of a minefield...
GNOME Terminal/libvte has an environment variable (VTE_CJK_WIDTH) to override the handling of ambiguous width characters. It bases its default on the locale (with the comment 'This is basically what GNU libc does').
urxvt just uses system wcwidth.
Xterm uses some voodoo to decide between system wcwidth and mk_wcwidth(_cjk): http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
I think the simplest solution is to just expose libc's wc(s)width. It is widely used and is most likely to match the behaviour of the terminal.
FWIW I wrote a little script to test the widths of all Unicode characters, and came up with the following logic to match libvte behaviour:
def wcwidth(c, legacy_cjk=False):
	if c in u'\t\r\n10円13円14円': raise ValueError('character %r has no intrinsic width' % c)
	if c in u'0円5円7円16円17円': return 0
	if u'\u1160' <= c <= u'\u11ff': return 0 # hangul jamo
	if unicodedata.category(c) in ('Mn', 'Me', 'Cf') and c != u'\u00ad': return 0 # 00ad = soft hyphen
	eaw = unicodedata.east_asian_width(c)
	if eaw in ('F', 'W'): return 2
	if legacy_cjk and eaw == 'A': return 2
	return 1

History
Date	User	Action	Args
2012年03月11日 00:32:15	poq	set	recipients: + poq, loewis, vstinner, ezio.melotti, Arfrever, inigoserna, zeha, Nicholas.Cole, tchrist
2012年03月11日 00:32:15	poq	set	messageid: <1331425935.51.0.471117522786.issue12568@psf.upfronthosting.co.za>
2012年03月11日 00:32:14	poq	link	issue12568 messages
2012年03月11日 00:32:14	poq	create

homepage