homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: encodings: the "mbcs" alias doesn't work
Type: Stage: resolved
Components: Library (Lib) Versions: Python 3.11
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: eryksun, vstinner
Priority: normal Keywords: patch

Created on 2022年02月06日 23:06 by vstinner, last changed 2022年04月11日 14:59 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 31174 closed vstinner, 2022年02月06日 23:17
Messages (8)
msg412678 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022年02月06日 23:06
While working on bpo-46659, I found a bug in the encodings "mbcs" alias. Even if the function has 2 tests (in test_codecs and test_site), both tests missed the bug :-(
I fixed the alias with this change:
---
commit 04dd60e50cd3da48fd19cdab4c0e4cc600d6af30
Author: Victor Stinner <vstinner@python.org>
Date: Sun Feb 6 21:50:09 2022 +0100
 bpo-46659: Update the test on the mbcs codec alias (GH-31168)
 
 encodings registers the _alias_mbcs() codec search function before
 the search_function() codec search function. Previously, the
 _alias_mbcs() was never used.
 
 Fix the test_codecs.test_mbcs_alias() test: use the current ANSI code
 page, not a fake ANSI code page number.
 
 Remove the test_site.test_aliasing_mbcs() test: the alias is now
 implemented in the encodings module, no longer in the site module.
---
But Eryk found two bugs:
"""
This was never true before. With 1252 as my ANSI code page, I checked codecs.lookup('cp1252') in 2.7, 3.4, 3.5, 3.6, 3.9, and 3.10, and none of them return the "mbcs" encoding. It's not equivalent, and not supposed to be. The implementation of "cp1252" should be cross-platform, regardless of whether we're on a Windows system with 1252 as the ANSI code page, as opposed to a Windows system with some other ANSI code page, or a Linux or macOS system.
The differences are that "mbcs" maps every byte, whereas our code-page encodings do not map undefined bytes, and the "replace" handler of "mbcs" uses a best-fit mapping (e.g. "α" -> "a") when encoding text, instead of mapping all undefined characters to "?".
"""
and my new test fails if PYTHONUTF8=1 env var is set:
"""
This will fail if PYTHONUTF8 is set in the environment, because it overrides getpreferredencoding(False) and _get_locale_encoding().
"""
The code for the "mbcs" alias changed at lot between Python 3.5 and 3.7.
In Python 3.5, site module:
---
def aliasmbcs():
 """On Windows, some default encodings are not provided by Python,
 while they are always available as "mbcs" in each locale. Make
 them usable by aliasing to "mbcs" in such a case."""
 if sys.platform == 'win32':
 import _bootlocale, codecs 
 enc = _bootlocale.getpreferredencoding(False)
 if enc.startswith('cp'): # "cp***" ?
 try:
 codecs.lookup(enc)
 except LookupError:
 import encodings
 encodings._cache[enc] = encodings._unknown
 encodings.aliases.aliases[enc] = 'mbcs'
---
In Python 3.6, encodings module:
---
(...)
codecs.register(search_function)
if sys.platform == 'win32':
 def _alias_mbcs(encoding):
 try:
 import _bootlocale
 if encoding == _bootlocale.getpreferredencoding(False):
 import encodings.mbcs
 return encodings.mbcs.getregentry()
 except ImportError:
 # Imports may fail while we are shutting down
 pass
 codecs.register(_alias_mbcs)
---
Python 3.7, encodings module:
---
(...)
codecs.register(search_function)
if sys.platform == 'win32':
 def _alias_mbcs(encoding):
 try:
 import _winapi
 ansi_code_page = "cp%s" % _winapi.GetACP()
 if encoding == ansi_code_page:
 import encodings.mbcs
 return encodings.mbcs.getregentry()
 except ImportError:
 # Imports may fail while we are shutting down
 pass
 codecs.register(_alias_mbcs)
---
The Python 3.6 and 3.7 "codecs.register(_alias_mbcs)" doesn't work because "search_function()" is tested before and it works for "cpXXX" encodings. My changes changes the order in which codecs search functions are registered: first the MBCS alias, then the encodings search_function().
In Python 3.5, the alias was only created if Python didn't support the code page.
msg412680 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022年02月06日 23:10
The alias was created in 2003 to fix bpo-671666:
---
commit 4eab486476c0082087a8460a5ab1064e64cc1a6b
Author: Martin v. Löwis <martin@v.loewis.de>
Date: Mon Mar 3 09:34:01 2003 +0000
 Patch #671666: Alias ANSI code page to "mbcs".
---
In 2003, bpo-671666 was created because Python didn't support "cp932" encoding, whereas the MBCS codec was available and could used directly since cp932 was the ANSI code page.
The alias allows to support the ANSI code 932 without implement it.
But Python got a "cp932" codec the year after:
---
commit 3e2a30692085d32ac63f72b35da39158a471fc68
Author: Hye-Shik Chang <hyeshik@gmail.com>
Date: Sat Jan 17 14:29:29 2004 +0000
 Add CJK codecs support as discussed on python-dev. (SF #873597)
 
 Several style fixes are suggested by Martin v. Loewis and
 Marc-Andre Lemburg. Thanks!
---
msg412683 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022年02月06日 23:13
Python 3.11 supports the 40 code pages:
* 037
* 273
* 424
* 437
* 500
* 720
* 737
* 775
* 850
* 852
* 855
* 856
* 857
* 858
* 860
* 861
* 862
* 863
* 864
* 865
* 866
* 869
* 874
* 875
* 932
* 949
* 950
* 1006
* 1026
* 1125
* 1140
* 1250
* 1251
* 1252
* 1253
* 1254
* 1255
* 1256
* 1257
* 1258
msg412691 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2022年02月07日 00:10
> The Python 3.6 and 3.7 "codecs.register(_alias_mbcs)" doesn't work 
> because "search_function()" is tested before and it works for "cpXXX" 
> encodings.
Isn't the 3.6-3.10 ordering of search_function() and _alias_mbcs() correct as a fallback? In this case, Python doesn't support a cross-platform encoding for the code page. That's why the old implementation of test_mbcs_alias() mocked _winapi.GetACP() to return 123 and then checked that looking up 'cp123' returned the "mbcs" codec.
I'd actually prefer to extend this by implementing _winapi.GetOEMCP() and using "oem" as a fallback for that case.
msg412738 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022年02月07日 13:00
I don't think that this fallback is needed anymore. Which Windows code page can be used as ANSI code page which is not already implemented as a Python codec?
msg412777 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2022年02月07日 17:53
> I don't think that this fallback is needed anymore. Which Windows
> code page can be used as ANSI code page which is not already 
> implemented as a Python codec?
Python has full coverage of the ANSI and OEM code pages in the standard Windows locales, but I don't have any experience with custom (i.e. supplemental or replacement) locales.
https://docs.microsoft.com/en-us/windows/win32/intl/custom-locales 
Here's a simple script to check the standard locales.
 import codecs
 import ctypes
 kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)
 LOCALE_ALL = 0
 LOCALE_WINDOWS = 1
 LOCALE_IDEFAULTANSICODEPAGE = 0x1004
 LOCALE_IDEFAULTCODEPAGE = 0x000B # OEM
 EnumSystemLocalesEx = kernel32.EnumSystemLocalesEx
 GetLocaleInfoEx = kernel32.GetLocaleInfoEx
 GetCPInfoExW = kernel32.GetCPInfoExW
 EnumLocalesProcEx = ctypes.WINFUNCTYPE(ctypes.c_int,
 ctypes.c_wchar_p, ctypes.c_ulong, ctypes.c_void_p)
 class CPINFOEXW(ctypes.Structure):
 _fields_ = (('MaxCharSize', ctypes.c_uint),
 ('DefaultChar', ctypes.c_ubyte * 2),
 ('LeadByte', ctypes.c_ubyte * 12),
 ('UnicodeDefaultChar', ctypes.c_wchar),
 ('CodePage', ctypes.c_uint),
 ('CodePageName', ctypes.c_wchar * 260))
 def get_all_locale_code_pages():
 result = []
 seen = set()
 info = (ctypes.c_wchar * 100)()
 @EnumLocalesProcEx
 def callback(locale, flags, param):
 for lctype in (LOCALE_IDEFAULTANSICODEPAGE, LOCALE_IDEFAULTCODEPAGE):
 if (GetLocaleInfoEx(locale, lctype, info, len(info)) and
 info.value not in ('0', '1')):
 cp = int(info.value)
 if cp in seen:
 continue
 seen.add(cp)
 cp_info = CPINFOEXW()
 if not GetCPInfoExW(cp, 0, ctypes.byref(cp_info)):
 cp_info.CodePage = cp
 cp_info.CodePageName = str(cp)
 result.append(cp_info)
 return True
 if not EnumSystemLocalesEx(callback, LOCALE_WINDOWS, None, None):
 raise ctypes.WinError(ctypes.get_last_error())
 result.sort(key=lambda x: x.CodePage)
 return result
 supported = []
 unsupported = []
 for cp_info in get_all_locale_code_pages():
 cp = cp_info.CodePage
 try:
 codecs.lookup(f'cp{cp}')
 except LookupError:
 unsupported.append(cp_info)
 else:
 supported.append(cp_info)
 if unsupported:
 print('Unsupported:\n')
 for cp_info in unsupported:
 print(cp_info.CodePageName)
 print('\nSupported:\n')
 else:
 print('All Supported:\n')
 for cp_info in supported:
 print(cp_info.CodePageName)
Output:
 All Supported:
 437 (OEM - United States)
 720 (Arabic - Transparent ASMO)
 737 (OEM - Greek 437G)
 775 (OEM - Baltic)
 850 (OEM - Multilingual Latin I)
 852 (OEM - Latin II)
 855 (OEM - Cyrillic)
 857 (OEM - Turkish)
 862 (OEM - Hebrew)
 866 (OEM - Russian)
 874 (ANSI/OEM - Thai)
 932 (ANSI/OEM - Japanese Shift-JIS)
 936 (ANSI/OEM - Simplified Chinese GBK)
 949 (ANSI/OEM - Korean)
 950 (ANSI/OEM - Traditional Chinese Big5)
 1250 (ANSI - Central Europe)
 1251 (ANSI - Cyrillic)
 1252 (ANSI - Latin I)
 1253 (ANSI - Greek)
 1254 (ANSI - Turkish)
 1255 (ANSI - Hebrew)
 1256 (ANSI - Arabic)
 1257 (ANSI - Baltic)
 1258 (ANSI/OEM - Viet Nam)
Some locales are Unicode only (e.g. Hindi-India) or have no OEM code page, which the above code skips by checking for "0" or "1" as the code page value. Windows 10+ allows setting the system locale to a Unicode-only locale, for which it uses UTF-8 (65001) for ANSI and OEM.
The OEM code page matters because the console input and output code pages default to OEM, e.g. for os.device_encoding(). The console's I/O code pages are used in Python by low-level os.read() and os.write(). Note that the console doesn't properly implement using UTF-8 (65001) as the input code page. In this case, input read from the console via ReadFile() or ReadConsoleA() has a null byte in place of each non-ASCII character.
msg412847 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022年02月08日 17:09
I created GH-31218 which basically restores Python 3.10 code but enhances the test.
msg413825 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022年02月23日 17:14
commit ccbe8045faf6e63d36229ea4e1b9298572cda126
Author: Victor Stinner <vstinner@python.org>
Date: Tue Feb 22 22:04:07 2022 +0100
 bpo-46659: Fix the MBCS codec alias on Windows (GH-31218)
History
Date User Action Args
2022年04月11日 14:59:55adminsetgithub: 90826
2022年02月23日 17:14:28vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg413825

stage: patch review -> resolved
2022年02月08日 17:09:32vstinnersetmessages: + msg412847
2022年02月07日 17:53:46eryksunsetmessages: + msg412777
2022年02月07日 13:00:07vstinnersetmessages: + msg412738
2022年02月07日 00:10:02eryksunsetnosy: + eryksun
messages: + msg412691
2022年02月06日 23:17:16vstinnersetkeywords: + patch
stage: patch review
pull_requests: + pull_request29345
2022年02月06日 23:13:19vstinnersetmessages: + msg412683
2022年02月06日 23:10:38vstinnersetmessages: + msg412680
2022年02月06日 23:06:49vstinnercreate

AltStyle によって変換されたページ (->オリジナル) /