homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: mimetypes initialization fails on Windows because of non-Latin characters in registry
Type: behavior Stage: resolved
Components: Library (Lib), Windows Versions: Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: tim.golden Nosy List: Daniel.Szoska, Dmitry.Jemerov, Hugo.Lol, Michał.Pasternak, Roman.Evstifeev, Suzumizaki, Vladimir Iofik, aclover, adamhj, brian.curtin, eric.araujo, exarkun, frankoid, jaraco, kaizhu, loewis, me21, python-dev, quick.es, r.david.murray, shimizukawa, tim.golden, vldmit, vstinner
Priority: normal Keywords: easy, patch

Created on 2010年07月18日 11:54 by Dmitry.Jemerov, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
9291.patch Dmitry.Jemerov, 2010年07月23日 12:17 review
9291a.patch Vladimir Iofik, 2010年10月22日 06:11 Issue 9291 patch review
sitecustomize.py me21, 2013年12月26日 14:23
issue9291-key-utf8.ini Michał.Pasternak, 2014年02月22日 18:32 Offending REG key in Windows Registry file encoded with utf-8
issue9291-key.reg Michał.Pasternak, 2014年02月22日 18:33
issue9291.8.patch tim.golden, 2014年04月20日 14:47
Messages (32)
msg110637 - (view) Author: Dmitry Jemerov (Dmitry.Jemerov) Date: 2010年07月18日 11:54
On Windows, mimetypes initialization reads the list of MIME types from the Windows registry. It assumes that all characters are Latin-1 encoded, and fails when it's not the case, with the following exception:
Traceback (most recent call last):
 File "mttest.py", line 3, in <module>
 mimetypes.init()
 File "c:\Python27\lib\mimetypes.py", line 355, in init
 db.read_windows_registry()
 File "c:\Python27\lib\mimetypes.py", line 260, in read_windows_registry
 for ctype in enum_types(mimedb):
 File "c:\Python27\lib\mimetypes.py", line 250, in enum_types
 ctype = ctype.encode(default_encoding) # omit in 3.x!
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)
This can be reproduced, for example, on a Russian Windows XP installation which has QuickTime installed (QuickTime creates the non-Latin entries in the registry). The following line causes the exception to happen:
import mimetypes; mimetypes.init()
msg110760 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010年07月19日 14:30
I'm guessing this problem doesn't occur in 3.x? If so, the quick fix would be to have the registry code catch UnicodeError instead of UnicodeEncodeError. That may be the correct fix anyway.
The "fun" part of this bug is going to be creating a unit test for it.
msg110881 - (view) Author: Dmitry Jemerov (Dmitry.Jemerov) Date: 2010年07月20日 10:11
The problem doesn't happen on Python 3.1.2 because it doesn't have the code in mimetypes that accesses the Windows registry. Haven't tried the 3.2 alphas yet.
msg111288 - (view) Author: Dmitry Jemerov (Dmitry.Jemerov) Date: 2010年07月23日 12:17
Patch (suggested fix and unittest) attached.
msg111291 - (view) Author: Dmitry Jemerov (Dmitry.Jemerov) Date: 2010年07月23日 12:17
And by the way I've verified that the problem doesn't happen in py3k trunk.
msg111318 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010年07月23日 13:35
And just for clarity: py3k trunk does contain the _winreg code path.
msg113662 - (view) Author: kai zhu (kaizhu) Date: 2010年08月12日 07:13
python 3.1.2 mimetypes initialization also fails in redhat linux:
>>> import http.server
Traceback (most recent call last):
 File "/home/public/i386-redhat-linux-gnu/python/lib/python3.1/http/server.py", line 588, in <module>
 class SimpleHTTPRequestHandler(BaseHTTPRequestHandler):
 File "/home/public/i386-redhat-linux-gnu/python/lib/python3.1/http/server.py", line 764, in SimpleHTTPRequestHandler
 mimetypes.init() # try to read system mime.types
 File "/home/public/i386-redhat-linux-gnu/python/lib/python3.1/mimetypes.py", line 305, in init
 db.readfp(open(file))
 File "/home/public/i386-redhat-linux-gnu/python/lib/python3.1/mimetypes.py", line 209, in readfp
 line = fp.readline()
 File "/home/public/i386-redhat-linux-gnu/bin/../python/lib/python3.1/encodings/ascii.py", line 26, in decode
 return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 3921: ordinal not in range(128)
msg119362 - (view) Author: Vladimir Iofik (Vladimir Iofik) Date: 2010年10月22日 06:11
Here is a better patch.
msg119364 - (view) Author: Vladimir Iofik (Vladimir Iofik) Date: 2010年10月22日 06:43
UnicodeDecodeException is thrown because 'ctype' is already a string, 
so it is first implicitly decoded by default encoder (which is 'ascii') and then reencoded back. I see no reason in all these actions, so I simply removed them. I think Antoine Pitrou (who is the author of these lines) can shed some light on this, but I guess it's just a copy-paste of the code below.
msg177044 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012年12月06日 14:53
> File "c:\Python27\lib\mimetypes.py", line 250, in enum_types
> ctype = ctype.encode(default_encoding) # omit in 3.x!
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)
The encoding is wrong. We should read the registry using Unicode, or at least use the correct encoding. The correct encoding is the ANSI code page: sys.getfilesystemencoding().
Can you please try with: default_encoding = sys.getfilesystemencoding() ?
> python 3.1.2 mimetypes initialization also fails in redhat linux: (...)
In Python 3.3, MimeTypes.read() opens files in UTF-8. The issue #13025 explains why UTF-8 is used instead the locale encoding, or another encoding.
I see that read_mime_types() uses the locale encoding, it looks like a bug, it should also use UTF-8.
msg202755 - (view) Author: Tim Golden (tim.golden) * (Python committer) Date: 2013年11月13日 14:54
Only just been reminded of this one; it's possible that it's been superseded by Issue15207. At the least, that issue resulted in a code change in this area of mimetypes. I'll have a look later.
msg202840 - (view) Author: adamhj (adamhj) Date: 2013年11月14日 13:26
> The encoding is wrong. We should read the registry using Unicode, or at least use the correct encoding. The correct encoding is the ANSI code page: sys.getfilesystemencoding().
> Can you please try with: default_encoding = sys.getfilesystemencoding() ?
This does not work. In fact it doesn't matter what default_encoding is. The variable ctype, which is returned by _winreg.EnumKey(), is a byte string(b'blahblah'), at least on my computer(win2k3sp2, python 2.7.6). Because the interpreter is asked to encode a byte string, it tries to convert the byte string to unicode string first, by calling decode implicitly with 'ascii' encoding, so the exception UnicodeDecodeError.
the variable ctype, which is read from registry key name, can be decoded correctly with sys.getfilesystemencoding()(which returns 'mbcs'), but in fact what we need is a byte string, so there should be neither encoding nor decoding here.
if there is a case that _winreg.EnumKey() returns unicode string, then a type check should be added before the encode. Or maybe the case is that the return type of _winreg.EnumKey() is different in 2.x and 3.x?
msg206494 - (view) Author: Suzumizaki (Suzumizaki) Date: 2013年12月18日 04:56
There is possibility that the installation of setuptools fails with
any Windows machine because of this bug. I want change the priority of this issue higher...
I failed the installation of setuptools with Python 2.7.6 on my machine, Windows 8.1 Pro Japanese Edition 64bit, but no problem with both Python 2.7.4 and Python 3.3.3.
msg206510 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013年12月18日 12:32
OK, that means the issue 15207 fix didn't fix it, since that's in 2.7.6.
msg206528 - (view) Author: Tim Golden (tim.golden) * (Python committer) Date: 2013年12月18日 15:58
I'll try to look at this soonish. Thanks for bringing it back to the
surface.
msg206540 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013年12月18日 16:35
Issue #20017 has been marked as a duplicate of this issue. Copy of the message:
Running Windows 8 (64-bit) and Python 2.7.6 (64-bit).
> python -m SimpleHTTPServer
Traceback (most recent call last):
 File "C:\Python27\lib\runpy.py", line 162, in _run_module_as_main
 "__main__", fname, loader, pkg_name)
 File "C:\Python27\lib\runpy.py", line 72, in _run_code
 exec code in run_globals
 File "C:\Python27\lib\SimpleHTTPServer.py", line 27, in <module>
 class SimpleHTTPRequestHandler(BaseHTTPServer.BaseHTTPRequestHandler):
 File "C:\Python27\lib\SimpleHTTPServer.py", line 208, in SimpleHTTPRequestHand
ler
 mimetypes.init() # try to read system mime.types
 File "C:\Python27\lib\mimetypes.py", line 358, in init
 db.read_windows_registry()
 File "C:\Python27\lib\mimetypes.py", line 258, in read_windows_registry
 for subkeyname in enum_types(hkcr):
 File "C:\Python27\lib\mimetypes.py", line 249, in enum_types
 ctype = ctype.encode(default_encoding) # omit in 3.x!
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd7 in position 2: ordinal
not in range(128)
msg206579 - (view) Author: Takayuki SHIMIZUKAWA (shimizukawa) Date: 2013年12月19日 04:58
This issue affects mercurial too.
http://bz.selenic.com/show_bug.cgi?id=3624 
msg206938 - (view) Author: Alexandr Zarubkin (me21) Date: 2013年12月26日 14:23
An alternative solution, which worked for me, is:
add file named sitecustomize.py in Lib\site-packages folder.
The contents of the file:
import sys
sys.setdefaultencoding("cp1251")
The default encoding should be determined for every localized Windows version.
Also, when creating virtual environments, the same file should be placed in site-packages folder of virtual environment being created prior to installing setuptools in it.
msg207076 - (view) Author: Jason R. Coombs (jaraco) * (Python committer) Date: 2013年12月29日 14:58
The bug as reported against setuptools: https://bitbucket.org/pypa/setuptools/issue/127/unicodedecodeerror-when-install-in-windows 
msg211921 - (view) Author: Michał Pasternak (Michał.Pasternak) Date: 2014年02月22日 12:55
I just hit this bug on 2.7.6, running on polish WinXP (I need to build some packages there, I hope I'll avoid a nasty py2exe bug). Any reasons this is not fixed yet? Do you need any assistance?
msg211931 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2014年02月22日 17:28
Michał: Can you please report the exact registry key and value that is causing the problem? It's difficult to test a patch if one is not able to reproduce the problem.
Of the patches suggested: does any of them fix the problem for you? If so, which one?
I personally fine Vladimir's patch more plausible (EnumKeys gives bytes objects in 2.x, so it is pointless to apply .encode to them). The introduction of the count() call is unrelated, though, and should be omitted from a bug fix.
msg211932 - (view) Author: Daniel Szoska (Daniel.Szoska) Date: 2014年02月22日 17:53
Martin: I had the same problem after upgrading to 2.7.6.
System here: German XP 32 Bit
I used the solution from Alexandr with sitecustomize.py (with cp1252) and it works fine for me.
msg211934 - (view) Author: Michał Pasternak (Michał.Pasternak) Date: 2014年02月22日 18:32
Martin: the problematic key is "[HKEY_CLASSES_ROOT\BDATuner.Składniki]". I am pasting its name, because I suppose, that as bugs.python.org is utf-8, special characters will be pasted properly.
Included you will find a .REG file, which is Windows Registry Editor file, which is plaintext. It is encoded with CP-1250 charset (I believe). In any case of confusion, I inlcude also the same file encoded with utf-8. 
If you add those information to your Windows registry, you should be able to reproduce this bug just by simply using "pip install" anything. "pip install wokkel", for example.
msg211935 - (view) Author: Michał Pasternak (Michał.Pasternak) Date: 2014年02月22日 18:33
Another REG file, encoded with CP1250, I believe.
msg211936 - (view) Author: Michał Pasternak (Michał.Pasternak) Date: 2014年02月22日 18:34
As for the fix, sitecustomize.py works for me, too, but I somehow believe, that adding sitecustomize.py for new Python installations would propably do more harm than good. I'll check those 2 patches and I'll let you know.
msg211938 - (view) Author: Michał Pasternak (Michał.Pasternak) Date: 2014年02月22日 18:41
9291.patch works for me too, but I am unsure about its idea. Silently ignoring non-ASCII registry entries - does it sound like a good idea? Maybe. Is it pythonic? I doubt so. 
I don't exactly understand what 9291a.patch is doing. For me it does look like a re-iteration of the first patch. I have not tested it.
msg216571 - (view) Author: Tim Golden (tim.golden) * (Python committer) Date: 2014年04月16日 19:58
The attached patch issue9291.7.patch (which is essentially an amalgam of 9291.patch & 9291a.patch with some tweaks of my own) does appear to solve the issue. My Windows setup is UK, so if any of the people still watching this issue could test against a non-English Windows, that would be useful.
Even this fix does leave some room for encoding mismatches between the stored values (mbcs encoded) and any string passed to guess_type. But it's not clear how that should be handled, and at least it doesn't crash out on .init.
msg216903 - (view) Author: Tim Golden (tim.golden) * (Python committer) Date: 2014年04月20日 14:47
Another version of the patch: this one, in addition to removing the unnecessary encodes, also does the check for extensions before attempting to open the registry key, and narrows down the try-catch block to just the attempt to read the "Content Type" value.
This does mean that if any process is unable to read HKCR or its subkeys the mimetypes.init will fail. Frankly, I can't see how that could happen, but if anyone feels strongly enough I can add extra guards so it fails silently.
msg217137 - (view) Author: stoyanov (quick.es) Date: 2014年04月24日 20:07
Alternative temporary solution
def enum_types(mimedb):
....
try:
 ctype = ctype.encode(default_encoding) # omit in 3.x!
except UnicodeEncodeError:
 pass
except Exception: #<--
 pass #<--
else:
 yield ctype
msg217271 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2014年04月27日 15:37
New changeset 18cfc2a42772 by Tim Golden in branch '2.7':
Issue #9291 Do not attempt to re-encode mimetype data read from registry in ANSI mode. Initial patches by Dmitry Jemerov & Vladimir Iofik
http://hg.python.org/cpython/rev/18cfc2a42772 
msg217273 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2014年04月27日 15:39
New changeset 0c8a7299c7e3 by Tim Golden in branch '2.7':
Issue #9291 Add ACKS & NEWS
http://hg.python.org/cpython/rev/0c8a7299c7e3 
msg220175 - (view) Author: Jean-Paul Calderone (exarkun) * (Python committer) Date: 2014年06月10日 17:01
Please see http://bugs.python.org/issue21652 for a regression introduced by this change.
History
Date User Action Args
2022年04月11日 14:57:03adminsetgithub: 53537
2014年06月10日 17:01:14exarkunsetnosy: + exarkun
messages: + msg220175
2014年04月29日 17:27:43tim.goldensetstatus: open -> closed
assignee: tim.golden
resolution: fixed
stage: patch review -> resolved
2014年04月27日 15:39:48python-devsetmessages: + msg217273
2014年04月27日 15:37:07python-devsetnosy: + python-dev
messages: + msg217271
2014年04月24日 20:07:48quick.essetnosy: + quick.es
messages: + msg217137
2014年04月20日 14:47:11tim.goldensetfiles: + issue9291.8.patch

messages: + msg216903
2014年04月20日 14:43:31tim.goldensetfiles: - issue9291.7.patch
2014年04月16日 19:58:58tim.goldensetfiles: + issue9291.7.patch

messages: + msg216571
2014年02月22日 18:41:36Michał.Pasternaksetmessages: + msg211938
2014年02月22日 18:34:10Michał.Pasternaksetmessages: + msg211936
2014年02月22日 18:33:09Michał.Pasternaksetfiles: + issue9291-key.reg

messages: + msg211935
2014年02月22日 18:32:54Michał.Pasternaksetfiles: + issue9291-key-utf8.ini

messages: + msg211934
2014年02月22日 17:53:28Daniel.Szoskasetmessages: + msg211932
2014年02月22日 17:28:45loewissetnosy: + loewis
messages: + msg211931
2014年02月22日 12:55:23Michał.Pasternaksetnosy: + Michał.Pasternak
messages: + msg211921
2014年01月21日 13:07:48Daniel.Szoskasetnosy: + Daniel.Szoska
2013年12月29日 14:58:26jaracosetnosy: + jaraco
messages: + msg207076
2013年12月26日 14:23:23me21setfiles: + sitecustomize.py
nosy: + me21
messages: + msg206938

2013年12月19日 04:58:04shimizukawasetnosy: + shimizukawa
messages: + msg206579
2013年12月18日 16:35:35vstinnersetnosy: + Hugo.Lol
messages: + msg206540
2013年12月18日 16:34:51vstinnerlinkissue20017 superseder
2013年12月18日 15:58:07tim.goldensetmessages: + msg206528
2013年12月18日 12:32:02r.david.murraysetmessages: + msg206510
2013年12月18日 04:56:57Suzumizakisetnosy: + Suzumizaki
messages: + msg206494
2013年11月14日 13:26:38adamhjsetmessages: + msg202840
2013年11月13日 14:54:06tim.goldensetnosy: + tim.golden
messages: + msg202755
2013年11月13日 13:46:46r.david.murraysetnosy: + adamhj
2013年11月13日 13:46:26r.david.murraylinkissue19567 superseder
2012年12月06日 14:53:31vstinnersetmessages: + msg177044
2012年12月06日 13:39:47Roman.Evstifeevsetnosy: + Roman.Evstifeev
2012年12月05日 17:23:19amaury.forgeotdarclinkissue16617 superseder
2012年01月29日 22:27:56r.david.murraylinkissue13906 superseder
2011年08月30日 22:53:16amaury.forgeotdarclinkissue12865 superseder
2011年03月08日 14:30:56pitrousetnosy: + vstinner
2011年03月08日 13:50:34frankoidsetnosy: + frankoid
2010年11月27日 23:12:41ned.deilyunlinkissue10551 superseder
2010年11月27日 19:34:18ned.deilylinkissue10551 superseder
2010年11月22日 20:25:22r.david.murraysetnosy: + aclover
2010年11月22日 20:24:33r.david.murraylinkissue10490 superseder
2010年10月22日 06:43:11Vladimir Iofiksetmessages: + msg119364
2010年10月22日 06:11:04Vladimir Iofiksetfiles: + 9291a.patch
nosy: + Vladimir Iofik
messages: + msg119362

2010年10月15日 13:45:40eric.araujosetnosy: + eric.araujo
2010年10月15日 12:53:03r.david.murraysetnosy: + vldmit
2010年10月15日 12:52:38r.david.murraylinkissue10113 superseder
2010年08月12日 07:13:44kaizhusetnosy: + kaizhu
messages: + msg113662
2010年07月23日 13:36:27brian.curtinsetnosy: + brian.curtin
2010年07月23日 13:35:12r.david.murraysetmessages: + msg111318
stage: test needed -> patch review
2010年07月23日 12:17:43Dmitry.Jemerovsetmessages: + msg111291
2010年07月23日 12:17:13Dmitry.Jemerovsetfiles: + 9291.patch
keywords: + patch
messages: + msg111288
2010年07月20日 10:11:50Dmitry.Jemerovsetmessages: + msg110881
2010年07月19日 14:30:25r.david.murraysetnosy: + r.david.murray
messages: + msg110760

keywords: + easy
type: behavior
stage: test needed
2010年07月18日 11:54:53Dmitry.Jemerovcreate

AltStyle によって変換されたページ (->オリジナル) /