Message 243660 - Python tracker

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

In-reply-to
Author	eryksun
Recipients	belopolsky, brian.curtin, eryksun, ocean-city, pitrou, python-dev, vstinner
Date	2015年05月20日.13:30:40
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1432128640.8.0.377125320908.issue10653@psf.upfronthosting.co.za>

Content
This solution no longer works. If the system is configured to use the Japanese system locale and language pack, then 3.4.3 returns codepage 932 mojibake for the "%Z" time zone name. Originally [this approach worked][1] because it called PyUnicode_Decode using the 'mbcs' encoding. Currently it calls PyUnicode_DecodeLocaleAndSize, which just ends up calling mbstowcs. That's pretty much what wcsftime does. In the default C locale, mbstowcs casts the byte values to wchar_t: >>> time.strftime('%Z') '\x91\xbe\x95\xbd\x97m\x89\xc4\x8e\x9e\x8a\xd4' >>> time.strftime('%Z').encode('latin-1').decode('932') '太平洋夏時間' The problem is worse for 3.5 built with VC++ 14. In the new CRT strftime decodes the format string via MultiByteToWideChar, calls _Wcsftime_l, and encodes the result back via WideCharToMultiByte. The outer conversions use the default LC_TIME codepage, which is ANSI (ACP), so they're not the problem. The problem is the internal _mbstowcs_s_l conversion of the ANSI time zone name, which creates the above-shown mojibake 'unicode' string. This is then compounded by calling WideCharToMultiByte on the result: >>> time.strftime('%Z') '?????m?A???O' There's no way to fix this by transcoding. The result is just garbage. [1]: https://hg.python.org/cpython/file/79e60977fc04/Modules/timemodule.c#l501

Content

This solution no longer works. If the system is configured to use the Japanese system locale and language pack, then 3.4.3 returns codepage 932 mojibake for the "%Z" time zone name. Originally [this approach worked][1] because it called PyUnicode_Decode using the 'mbcs' encoding.
Currently it calls PyUnicode_DecodeLocaleAndSize, which just ends up calling mbstowcs. That's pretty much what wcsftime does. In the default C locale, mbstowcs casts the byte values to wchar_t:
 >>> time.strftime('%Z')
 '\x91\xbe\x95\xbd\x97m\x89\xc4\x8e\x9e\x8a\xd4'
 >>> time.strftime('%Z').encode('latin-1').decode('932')
 '太平洋夏時間'
The problem is worse for 3.5 built with VC++ 14. In the new CRT strftime decodes the format string via MultiByteToWideChar, calls _Wcsftime_l, and encodes the result back via WideCharToMultiByte. The outer conversions use the default LC_TIME codepage, which is ANSI (ACP), so they're not the problem. The problem is the internal _mbstowcs_s_l conversion of the ANSI time zone name, which creates the above-shown mojibake 'unicode' string. This is then compounded by calling WideCharToMultiByte on the result:
 >>> time.strftime('%Z')
 '?????m?A???O'
There's no way to fix this by transcoding. The result is just garbage.
[1]: https://hg.python.org/cpython/file/79e60977fc04/Modules/timemodule.c#l501

History
Date	User	Action	Args
2015年05月20日 13:30:40	eryksun	set	recipients: + eryksun, belopolsky, pitrou, vstinner, ocean-city, brian.curtin, python-dev
2015年05月20日 13:30:40	eryksun	set	messageid: <1432128640.8.0.377125320908.issue10653@psf.upfronthosting.co.za>
2015年05月20日 13:30:40	eryksun	link	issue10653 messages
2015年05月20日 13:30:40	eryksun	create

homepage