Message 153921 - Python tracker

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

In-reply-to
Author	ezio.melotti
Recipients	Ramchandra Apte, amaury.forgeotdarc, ezio.melotti, harveyang, mrabarnett
Date	2012年02月22日.02:01:11
SpamBayes Score	9.597445e-11
Marked as misclassified	No
Message-id	<1329876072.69.0.349013967937.issue14068@psf.upfronthosting.co.za>

Content
As long as you don't mix str and unicode everything works. With strings: >>> s = '与清新。阿德莱' >>> re.split('。', s) ['\xe4\xb8\x8e\xe6\xb8\x85\xe6\x96\xb0', '\xe9\x98\xbf\xe5\xbe\xb7\xe8\x8e\xb1'] >>> s.split('。') ['\xe4\xb8\x8e\xe6\xb8\x85\xe6\x96\xb0', '\xe9\x98\xbf\xe5\xbe\xb7\xe8\x8e\xb1'] With unicode: >>> u = u'与清新。阿德莱' >>> re.split(u'。', u) [u'\u4e0e\u6e05\u65b0', u'\u963f\u5fb7\u83b1'] >>> u.split(u'。') [u'\u4e0e\u6e05\u65b0', u'\u963f\u5fb7\u83b1'] Mixing str and unicode: >>> re.split(u'。', s) ['\xe4\xb8\x8e\xe6\xb8\x85\xe6\x96\xb0\xe3\x80\x82\xe9\x98\xbf\xe5\xbe\xb7\xe8\x8e\xb1'] >>> re.split('。', u) [u'\u4e0e\u6e05\u65b0\u3002\u963f\u5fb7\u83b1'] >>> >>> s.split(u'。') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128) >>> u.split('。') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128) The syntax error is raised for byte literals and can't be backported to 2.7. Raising an error when str and unicode are mixed in re is not backward compatible, and re does work as long as both are ASCII only. I'm therefore closing this as invalid.

Content

As long as you don't mix str and unicode everything works.
With strings:
>>> s = '与清新。阿德莱'
>>> re.split('。', s)
['\xe4\xb8\x8e\xe6\xb8\x85\xe6\x96\xb0', '\xe9\x98\xbf\xe5\xbe\xb7\xe8\x8e\xb1']
>>> s.split('。')
['\xe4\xb8\x8e\xe6\xb8\x85\xe6\x96\xb0', '\xe9\x98\xbf\xe5\xbe\xb7\xe8\x8e\xb1']
With unicode:
>>> u = u'与清新。阿德莱'
>>> re.split(u'。', u)
[u'\u4e0e\u6e05\u65b0', u'\u963f\u5fb7\u83b1']
>>> u.split(u'。')
[u'\u4e0e\u6e05\u65b0', u'\u963f\u5fb7\u83b1']
Mixing str and unicode:
>>> re.split(u'。', s)
['\xe4\xb8\x8e\xe6\xb8\x85\xe6\x96\xb0\xe3\x80\x82\xe9\x98\xbf\xe5\xbe\xb7\xe8\x8e\xb1']
>>> re.split('。', u)
[u'\u4e0e\u6e05\u65b0\u3002\u963f\u5fb7\u83b1']
>>>
>>> s.split(u'。')
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
>>> u.split('。')
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)
The syntax error is raised for byte literals and can't be backported to 2.7. Raising an error when str and unicode are mixed in re is not backward compatible, and re does work as long as both are ASCII only. I'm therefore closing this as invalid.

History
Date	User	Action	Args
2012年02月22日 02:01:12	ezio.melotti	set	recipients: + ezio.melotti, amaury.forgeotdarc, mrabarnett, Ramchandra Apte, harveyang
2012年02月22日 02:01:12	ezio.melotti	set	messageid: <1329876072.69.0.349013967937.issue14068@psf.upfronthosting.co.za>
2012年02月22日 02:01:12	ezio.melotti	link	issue14068 messages
2012年02月22日 02:01:11	ezio.melotti	create

homepage