Issue 6058: Add cp65001 to encodings/aliases.py

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/50308

classification

Title:	Add cp65001 to encodings/aliases.py
Type:	enhancement	Stage:	patch review
Components:	Library (Lib), Unicode, Windows	Versions:	Python 3.2

process

Dependencies:	Superseder:
Status:	closed	Resolution:	not a bug
Assigned To:	Nosy List:	David.Sankel, davidsarah, ezio.melotti, lemburg, loewis, pitrou, skrah, tzot, vstinner
Priority:	high	Keywords:	patch

Created on 2009年05月19日 00:21 by tzot, last changed 2022年04月11日 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
alias_cp65001.diff	tzot, 2009年05月19日 00:24	One-line addition of cp65001 aliased to utf_8
testnetcodecs.py	lemburg, 2009年12月07日 22:41
gen65001.c	skrah, 2009年12月22日 13:24	Generate multibyte characters with cp65001
check65001.py	skrah, 2009年12月22日 13:24	Check output of gen65001.exe
export-encodings.py	lemburg, 2010年01月13日 19:15
check-encodings.py	lemburg, 2010年01月13日 19:16

Messages (20)
msg88060 - (view)	Author: Χρήστος Γεωργίου (Christos Georgiou) (tzot) *	Date: 2009年05月19日 00:21
Add 'cp65001' (Microsoft term for UTF-8) as an alias to 'utf_8'
msg96065 - (view)	Author: Marc-Andre Lemburg (lemburg) * (Python committer)	Date: 2009年12月07日 18:57
Could you provide some official reference defining the alias ? Thanks.
msg96066 - (view)	Author: Marc-Andre Lemburg (lemburg) * (Python committer)	Date: 2009年12月07日 19:07
Nevermind, I found this reference: http://msdn.microsoft.com/en-us/library/system.text.encoding(VS.80).aspx Looks like we could add a few more aliases for other encodings as well.
msg96076 - (view)	Author: Martin v. Löwis (loewis) * (Python committer)	Date: 2009年12月07日 21:19
> http://msdn.microsoft.com/en-us/library/system.text.encoding(VS.80).aspx > > Looks like we could add a few more aliases for other encodings as well. I wouldn't trust this table. Microsoft is on record of implementing the code pages with slight variations compared to other references for some encodings (in particular the Asian ones). So unless there is an actual documented need for a certain alias (and preferably a demonstration that Microsoft's interpretation of the code page is the same as Python's), I would advise against adding such aliases.
msg96077 - (view)	Author: Marc-Andre Lemburg (lemburg) * (Python committer)	Date: 2009年12月07日 21:41
Martin v. Löwis wrote: > > Martin v. Löwis <martin@v.loewis.de> added the comment: > >> http://msdn.microsoft.com/en-us/library/system.text.encoding(VS.80).aspx >> >> Looks like we could add a few more aliases for other encodings as well. > > I wouldn't trust this table. Microsoft is on record of implementing the > code pages with slight variations compared to other references for some > encodings (in particular the Asian ones). So unless there is an actual > documented need for a certain alias (and preferably a demonstration that > Microsoft's interpretation of the code page is the same as Python's), > I would advise against adding such aliases. Fair enough. Could someone with some IronPython/.NET foo check whether the code pages are the same as the Python codecs ? The above page has some sample code to get started and IronPython provides easy access to both the .NET codecs and the Python ones. Thanks, -- Marc-Andre Lemburg eGenix.com ________________________________________________________________________ ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
msg96080 - (view)	Author: Marc-Andre Lemburg (lemburg) * (Python committer)	Date: 2009年12月07日 22:41
Here's a script for IronPython 2.6 that checks a few encoders. Since IronPython doesn't appear to come with the full set of Python codecs and it's also not clear whether the implemented codecs actually match the default Python ones, I'm not sure how reliable this output is. It's probably better to dump the encoded data to a file and compare against a CPython run. Anyway, here's the output: Code Page 65000 vs. encoding 'utf-7' 0 errors Code Page 65001 vs. encoding 'utf-8' 0 errors Code Page 1200 vs. encoding 'utf-16-le' 0 errors Code Page 1201 vs. encoding 'utf-16-be' 0 errors Code Page 28591 vs. encoding 'iso-8859-1' 0 errors
msg96758 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2009年12月21日 16:26
(I tried running your script under IronPython 2.6 with Mono but I got a bunch of errors; since I don't know IronPython at all I can't really investigate)
msg96796 - (view)	Author: Stefan Krah (skrah) * (Python committer)	Date: 2009年12月22日 13:24
I wrote a small C application that converts all possible wchar_t to multibyte strings, using code page 65001. Usage: cl.exe gen65001.c python check65001.py Except for the newline character and a sequence from 55296-57343, this code page matches UFT-8. Note, however, that cp65001 is a pseudo code page: http://www.postgresql.org/docs/faqs.FAQ_windows.html#2.6 For instance, setlocale will not work: http://blogs.msdn.com/michkap/archive/2006/03/13/550191.aspx
msg96807 - (view)	Author: Martin v. Löwis (loewis) * (Python committer)	Date: 2009年12月22日 18:59
This report is really about the issues reported in #1602 and #7441, i.e. where console output fails if the terminal encoding is 65001. Rather than adding the alias, I would prefer to find out why terminal output fails in that code page.
msg96809 - (view)	Author: Χρήστος Γεωργίου (Christos Georgiou) (tzot) *	Date: 2009年12月22日 19:23
re Martin's question, I can offer the indirect wisdom of Michael Kaplan in this blog post: http://blogs.msdn.com/michkap/archive/2008/03/18/8306597.aspx where he mentions that the easiest way to output unicode text in the Windows console, is: int main(void) { _setmode(_fileno(stdout), _O_U16TEXT); wprintf(L"\x043a\x043e\x0448\x043a\x0430 \x65e5\x672c\x56fd\n"); return 0; } _setmode being the special call needed. I haven't tested with any _O_U8TEXT (if such a thing exists), I don't do Windows anymore, therefore I can't provide a patch. It also seems that Python —when stdin/stdout/stderr is under control of a Windows console— doesn't use plain *printf functions. The example code I offered in one of the other issues (dumb stdout doing plain .write as UTF-8) runs and displays fine.
msg96815 - (view)	Author: Martin v. Löwis (loewis) * (Python committer)	Date: 2009年12月22日 21:16
I also wonder whether stdin/stdout/stderr should be streams on Windows that use WriteConsole instead of WriteFile. Then the entire issue of console CP would go away for Unicode output.
msg97731 - (view)	Author: Marc-Andre Lemburg (lemburg) * (Python committer)	Date: 2010年01月13日 19:15
I created two scripts for exporting the IronPython findings and checking them in CPython. These are the results: Checking code Page 28591 against encoding 'iso-8859-1' using file 'iso-8859-1.map' 0 errors Checking code Page 28592 against encoding 'iso-8859-2' using file 'iso-8859-2.map' 0 errors Checking code Page 28593 against encoding 'iso-8859-3' using file 'iso-8859-3.map' 0 errors Checking code Page 28594 against encoding 'iso-8859-4' using file 'iso-8859-4.map' 0 errors Checking code Page 28595 against encoding 'iso-8859-5' using file 'iso-8859-5.map' 0 errors Checking code Page 1201 against encoding 'utf-16-be' using file 'utf-16-be.map' 2048 errors Checking code Page 1200 against encoding 'utf-16-le' using file 'utf-16-le.map' 2048 errors Checking code Page 65000 against encoding 'utf-7' using file 'utf-7.map' 21 errors Checking code Page 65001 against encoding 'utf-8' using file 'utf-8.map' 2048 errors Result: We can add aliases for the various ISO mappings, but not for the UTF ones. .NET encodes the surrogates differently than Python's codecs and it also produces different results for UTF-7 than Python's codec.
msg97732 - (view)	Author: Marc-Andre Lemburg (lemburg) * (Python committer)	Date: 2010年01月13日 19:18
What we could do is add new codecs based on the .NET tables for cp65000 et al. However, before doing this, I'd like to know where these code page settings can occur and what exact names are used for them. If they only appear in .NET and IronPython, I don't think it's worth adding extra codecs for the MS UTF variants.
msg106274 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2010年05月22日 00:03
Would it be possible to implement a "cp65001" codec in Python using MultiByteToWideChar() / WideCharToMultiByte() with codepage=CP_UTF8?
msg119440 - (view)	Author: David-Sarah Hopwood (davidsarah)	Date: 2010年10月23日 16:10
This problem causes {{{os.getcwdu()}}} to fail when the console code page is set to 65001 (always, I think): {{{ t:\>ver Microsoft Windows [Version 6.0.6002] t:\>chcp Active code page: 65001 t:\>python -c "import os; print os.getcwdu()" Traceback (most recent call last): File "<string>", line 1, in <module> LookupError: unknown encoding: cp65001 t:\>chcp 1252 Active code page: 1252 t:\>python -c "import os; print os.getcwdu()" t:\ }}} Incidentally, I don't agree that this codepage needs to be distinguished from UTF-8. The deviations in the Microsoft codec are just their bugs. There is only one correct way to encode/decode UTF-8, and cp65001 is supposed to be UTF-8 according to Microsoft (e.g. http://msdn.microsoft.com/en-us/library/86hf4sb8%28en-US,VS.80%29.aspx ).
msg119441 - (view)	Author: David-Sarah Hopwood (davidsarah)	Date: 2010年10月23日 16:13
I said: "There is only one correct way to encode/decode UTF-8". This is true modulo differences in the treatment of initial byte order marks.
msg119444 - (view)	Author: David-Sarah Hopwood (davidsarah)	Date: 2010年10月23日 16:25
I meant to say that the os.getcwdu() test in msg119440 was done with Windows native Python 2.6.2.
msg119447 - (view)	Author: David-Sarah Hopwood (davidsarah)	Date: 2010年10月23日 16:40
Oops, false alarm. python -c "import os; print repr(os.getcwdu())" works as expected, so the exception is part of issue 1602. (My command about there being no need to distinguish this codepage from UTF-8 stands.)
msg120712 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2010年11月08日 04:11
Different tests proved that cp65001 can not be set as an alias to utf-8, and that's why I'm closing this issue. Anyway, I don't think that cp65001 is configured by default on any Windows setup. It is only set by the user, using the chcp command, to try to display unicode characters in the Windows console: but it is not possible to display any unicode character in this console (see issue #1602). And chcp command should not be used in the Windows console because it does not only change the ANSI code page: it changes also the console code page, which is wrong (the console still expect text encoded to the previous code page). It is possible to implement a codec for cp65001 using utf-8 existing codec in surrogatepass mode, or by using MultiByteToWideChar() / WideCharToMultiByte() with codepage=CP_UTF8. But I don't think that we need cp65001 at all. If you need cp65001 for a good reason and you would like to implement a cp65001 Python codec, open a new issue. If you consider that we should use _O_U8TEXT or _O_U16TEXT, open another new issue. _O_U8TEXT or _O_U16TEXT might improve unicode support if Python output is redirected to a pipe, but I don't think that it would help to display unicode character in the Windows console. I also fear that it breaks existing code and any function not aware of this special mode.
msg146467 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2011年10月26日 23:48
I added a cp65001 codec to Python 3.3: see issue #13216.

History
Date	User	Action	Args
2022年04月11日 14:56:49	admin	set	github: 50308
2011年10月26日 23:48:07	vstinner	set	messages: + msg146467
2010年11月08日 04:11:51	vstinner	set	status: open -> closed resolution: not a bug messages: + msg120712
2010年11月04日 03:14:06	michael.foord	set	nosy: - michael.foord
2010年11月04日 03:08:09	David.Sankel	set	nosy: + David.Sankel
2010年10月23日 16:40:48	davidsarah	set	messages: + msg119447
2010年10月23日 16:25:09	davidsarah	set	messages: + msg119444
2010年10月23日 16:13:06	davidsarah	set	messages: + msg119441
2010年10月23日 16:10:54	davidsarah	set	nosy: + davidsarah messages: + msg119440
2010年07月10日 05:32:25	terry.reedy	set	versions: - Python 2.6, Python 3.1, Python 2.7
2010年05月22日 00:03:13	vstinner	set	nosy: + vstinner messages: + msg106274
2010年01月13日 19:18:25	lemburg	set	messages: + msg97732
2010年01月13日 19:16:09	lemburg	set	files: + check-encodings.py
2010年01月13日 19:15:59	lemburg	set	files: + export-encodings.py
2010年01月13日 19:15:18	lemburg	set	messages: + msg97731
2010年01月13日 07:34:01	pitrou	set	priority: high stage: patch review
2009年12月22日 21:16:33	loewis	set	messages: + msg96815
2009年12月22日 19:23:43	tzot	set	messages: + msg96809
2009年12月22日 18:59:23	loewis	set	messages: + msg96807
2009年12月22日 13:24:58	skrah	set	files: + check65001.py
2009年12月22日 13:24:18	skrah	set	files: + gen65001.c nosy: + skrah messages: + msg96796
2009年12月21日 16:26:54	pitrou	set	nosy: + pitrou messages: + msg96758
2009年12月07日 22:41:41	lemburg	set	files: + testnetcodecs.py messages: + msg96080
2009年12月07日 21:57:46	pitrou	set	nosy: + michael.foord
2009年12月07日 21:41:50	lemburg	set	messages: + msg96077
2009年12月07日 21:19:15	loewis	set	messages: + msg96076
2009年12月07日 19:07:45	lemburg	set	messages: + msg96066
2009年12月07日 18:58:00	lemburg	set	messages: + msg96065
2009年12月05日 11:25:45	flox	set	versions: + Python 2.6, Python 3.1, Python 3.2
2009年05月19日 07:52:47	pitrou	set	nosy: + lemburg, loewis
2009年05月19日 00:27:45	ezio.melotti	set	nosy: + ezio.melotti
2009年05月19日 00:24:04	tzot	set	files: + alias_cp65001.diff
2009年05月19日 00:23:03	tzot	set	files: - alias_cp65001.diff
2009年05月19日 00:21:57	tzot	set	components: + Windows
2009年05月19日 00:21:32	tzot	create

homepage