Message 255133 - Python tracker

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

In-reply-to
Author	eryksun
Recipients	AndiDog_old, BreamoreBoy, belopolsky, eric.smith, eryksun, ezio.melotti, shimizukawa, terry.reedy, vstinner
Date	2015年11月23日.07:05:52
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1448262354.84.0.562273558937.issue8304@psf.upfronthosting.co.za>

Content
The problem from issue 10653 is that internally the CRT encodes the time zone name using the ANSI codepage (i.e. the default system codepage). wcsftime decodes this string using mbstowcs (i.e. multibyte string to wide-character string), which uses Latin-1 in the C locale. In other words, in the C locale on Windows, mbstowcs just casts the byte values to wchar_t. With the new Universal CRT, strftime is implemented by calling wcsftime, so the accepted solution for issue 10653 is broken in 3.5+. A simple way around the problem is to switch back to using wcsftime and temporarily (or permanently) set the thread's LC_CTYPE locale to the system default. This makes the internal mbstowcs call use the ANSI codepage. Note that on POSIX platforms 3.x already sets the default via setlocale(LC_CTYPE, "") in Python/pylifecycle.c. Why not set this for all platforms that have setlocale? > I only tested with my default US locale. If your system locale uses codepage 1252 (a superset of Latin-1), then you can still test this on a per thread basis if your system has additional language packs. For example: import ctypes kernel32 = ctypes.WinDLL('kernel32', use_last_error=True) if kernel32.GetModuleHandleW('ucrtbased'): # debug build crt = ctypes.CDLL('ucrtbased', use_errno=True) else: crt = ctypes.CDLL('ucrtbase', use_errno=True) MUI_LANGUAGE_NAME = 8 LC_CTYPE = 2 class tm(ctypes.Structure): pass crt._gmtime64.restype = ctypes.POINTER(tm) # set a Russian locale for the current thread kernel32.SetThreadPreferredUILanguages(MUI_LANGUAGE_NAME, 'ru-RU0円', None) crt._wsetlocale(LC_CTYPE, 'ru-RU') # update the time zone name based on the thread locale crt._tzset() # get a struct tm * ltime = ctypes.c_int64() crt._time64(ctypes.byref(ltime)) tmptr = crt._gmtime64(ctypes.byref(ltime)) # call wcsftime using C and Russian locales buf = (ctypes.c_wchar * 100)() crt._wsetlocale(LC_CTYPE, 'C') size = crt.wcsftime(buf, 100, '%Z\r\n', tmptr) tz1 = buf[:size] crt._wsetlocale(LC_CTYPE, 'ru-RU') size = crt.wcsftime(buf, 100, '%Z\r\n', tmptr) tz2 = buf[:size] hcon = kernel32.GetStdHandle(-11) pn = ctypes.pointer(ctypes.c_uint()) >>> _ = kernel32.WriteConsoleW(hcon, tz1, len(tz1), pn, None) Âðåìÿ â ôîðìàòå UTC >>> _ = kernel32.WriteConsoleW(hcon, tz2, len(tz2), pn, None) Время в формате UTC The first result demonstrates the ANSI => Latin-1 mojibake problem in the C locale. You can encode this result as Latin-1 and then decode it back as codepage 1251: >>> tz1.encode('latin-1').decode('1251') == tz2 True But transcoding isn't a general workaround since the format string shouldn't be restricted to ANSI, unless you can smuggle the Unicode through like Takayuki showed.

Content

The problem from issue 10653 is that internally the CRT encodes the time zone name using the ANSI codepage (i.e. the default system codepage). wcsftime decodes this string using mbstowcs (i.e. multibyte string to wide-character string), which uses Latin-1 in the C locale. In other words, in the C locale on Windows, mbstowcs just casts the byte values to wchar_t. 
With the new Universal CRT, strftime is implemented by calling wcsftime, so the accepted solution for issue 10653 is broken in 3.5+. A simple way around the problem is to switch back to using wcsftime and temporarily (or permanently) set the thread's LC_CTYPE locale to the system default. This makes the internal mbstowcs call use the ANSI codepage. Note that on POSIX platforms 3.x already sets the default via setlocale(LC_CTYPE, "") in Python/pylifecycle.c. Why not set this for all platforms that have setlocale?
> I only tested with my default US locale.
If your system locale uses codepage 1252 (a superset of Latin-1), then you can still test this on a per thread basis if your system has additional language packs. For example:
 import ctypes
 kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)
 if kernel32.GetModuleHandleW('ucrtbased'): # debug build
 crt = ctypes.CDLL('ucrtbased', use_errno=True)
 else:
 crt = ctypes.CDLL('ucrtbase', use_errno=True)
 MUI_LANGUAGE_NAME = 8
 LC_CTYPE = 2
 class tm(ctypes.Structure):
 pass
 crt._gmtime64.restype = ctypes.POINTER(tm)
 # set a Russian locale for the current thread 
 kernel32.SetThreadPreferredUILanguages(MUI_LANGUAGE_NAME,
 'ru-RU0円', None)
 crt._wsetlocale(LC_CTYPE, 'ru-RU')
 # update the time zone name based on the thread locale
 crt._tzset() 
 # get a struct tm *
 ltime = ctypes.c_int64()
 crt._time64(ctypes.byref(ltime))
 tmptr = crt._gmtime64(ctypes.byref(ltime))
 # call wcsftime using C and Russian locales 
 buf = (ctypes.c_wchar * 100)()
 crt._wsetlocale(LC_CTYPE, 'C')
 size = crt.wcsftime(buf, 100, '%Z\r\n', tmptr)
 tz1 = buf[:size]
 crt._wsetlocale(LC_CTYPE, 'ru-RU')
 size = crt.wcsftime(buf, 100, '%Z\r\n', tmptr)
 tz2 = buf[:size]
 hcon = kernel32.GetStdHandle(-11)
 pn = ctypes.pointer(ctypes.c_uint())
 >>> _ = kernel32.WriteConsoleW(hcon, tz1, len(tz1), pn, None)
 Âðåìÿ â ôîðìàòå UTC
 >>> _ = kernel32.WriteConsoleW(hcon, tz2, len(tz2), pn, None)
 Время в формате UTC
The first result demonstrates the ANSI => Latin-1 mojibake problem in the C locale. You can encode this result as Latin-1 and then decode it back as codepage 1251:
 >>> tz1.encode('latin-1').decode('1251') == tz2
 True
But transcoding isn't a general workaround since the format string shouldn't be restricted to ANSI, unless you can smuggle the Unicode through like Takayuki showed.

History
Date	User	Action	Args
2015年11月23日 07:05:54	eryksun	set	recipients: + eryksun, terry.reedy, belopolsky, vstinner, eric.smith, ezio.melotti, AndiDog_old, BreamoreBoy, shimizukawa
2015年11月23日 07:05:54	eryksun	set	messageid: <1448262354.84.0.562273558937.issue8304@psf.upfronthosting.co.za>
2015年11月23日 07:05:54	eryksun	link	issue8304 messages
2015年11月23日 07:05:52	eryksun	create

homepage