homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: zipfile.extractall fails in Posix shell with utf-8 filename
Type: behavior Stage:
Components: Extension Modules Versions: Python 3.3
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Laurent.Mazuel, cheryl.sabella, ncoghlan, r.david.murray, serhiy.storchaka, vstinner
Priority: normal Keywords:

Created on 2014年01月21日 15:05 by Laurent.Mazuel, last changed 2022年04月11日 14:57 by admin.

Files
File name Uploaded Description Edit
test_ut8.zip Laurent.Mazuel, 2014年01月21日 15:05 Zip where filenames are in UTF-8
Messages (10)
msg208648 - (view) Author: Laurent Mazuel (Laurent.Mazuel) Date: 2014年01月21日 15:05
Hello,
Considering a zip file which contains utf-8 filenames (as uploaded zip file), the following code fails if launched in a Posix shell.
>>> with zipfile.ZipFile("test_ut8.zip") as fd:
... fd.extractall()
... 
Traceback (most recent call last):
 File "<stdin>", line 2, in <module>
 File "/opt/python/3.3/lib/python3.3/zipfile.py", line 1225, in extractall
 self.extract(zipinfo, path, pwd)
 File "/opt/python/3.3/lib/python3.3/zipfile.py", line 1213, in extract
 return self._extract_member(member, path, pwd)
 File "/opt/python/3.3/lib/python3.3/zipfile.py", line 1276, in _extract_member
 open(targetpath, "wb") as target:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 10-14: ordinal not in range(128)
With shell:
$ locale
LANG=POSIX
...
But filesystem is not encoding dependant. On a Unix system, filename are only bytes, there is no reason to refuse to unzip a zip file (in fact, "unzip" command line don't fail to unzip the file in a Posix shell).
Since "open" can take "bytes" filename, changing the line 1276 from
> open(targetpath)
to:
> open(targetpath.encode("utf-8"))
fixes the problem.
zipfile should not care about the encoding of the filename and should use the bytes sequence filename extracted directly from the bytes sequence of the zipfile. Having "ZipInfo.filename" as a string (and not bytes) is great for an API, but is not needed to open/write a file on the disk. Then, ZipInfo should store the direct bytes sequences of filename as a "bytes_filename" field and use it in the "open" of "extract".
In addition, considering the patch of bug 10614, the right patch could use the new "ZipInfo.encoding" field:
> open(targetpath.encode(member.encoding))
msg208655 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014年01月21日 15:33
If you live in a current-posix world, this might make sense. However, one can also argue that the filename should be *transcoded* from the tarfile encoding to the local FS filename encoding, which I believe is what we are currently doing. Which, if you are using POSIX as the locale, will fail a lot. If you use a sensible modern locale that includes utf-8, you wouldn't have a problem.
Unfortunately, the reality is probably that sometimes you want one behavior and sometimes you want the other :(
Encoding using member.encoding is probably wrong, though. If you are trying to preserve the original bytes, is is probably best do so, and not assume that the tarfile encoding field is valid.
I'm adding Victor Stinner to nosy: he's thought about these issues much more deeply than I have. The answer may be that we will only support transcoding filenames in our tarfile module...and certainly it looks like doing anything else, even if we want to, would be a new feature.
msg208755 - (view) Author: Laurent Mazuel (Laurent.Mazuel) Date: 2014年01月22日 07:39
Thanks for your answer.
I think you can't transcode internal zip filenames to FS encoding. Actually, in Unix the FS only stores bytes for filename, there is no "FS encoding". Then, if you change your locale, the filename printed will change too in your console. If you transcode filename using the current locale, unzipping twice the same file with two different locales will lead to two different files, which is not (I think) you are intending for.
The problem will not arise in Windows (NTFS is UTF-16) nor MAC OSX (UTF-8)
Moreover, a simple "unzip" works like a charm. It doesn't care about encoding or current locale and extract the file using the initial bytes in the zip. Unzipping twice with the two different locales creates only one file.
An interesting link (even if it is not an official reference):
http://unix.stackexchange.com/questions/2089/what-charset-encoding-is-used-for-filenames-and-paths-on-linux 
msg208817 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014年01月22日 14:56
Believe me, we are *well* aware of the issue that linux stores filenames as bytes.
I agree that the inability to always transcode is an issue. That's why I'd like the opinion of someone who has studied this problem in more depth.
msg208858 - (view) Author: Alyssa Coghlan (ncoghlan) * (Python committer) Date: 2014年01月22日 22:36
The POSIX locale tells Python 3 to use ASCII for all operating system interfaces, including the standard streams. This is an antiquated behaviour in the POSIX spec that Python 3 doesn't currently work around.
Issue 19977 is a proposal to work around this limitation by default.
As an immediate workaround, it's possible to either set PYTHONIOENCODING explicitly so Python ignores the incorrect encoding claims from the OS, or else to do your own encoding and write directly to the sys.stdout.buffer binary interface.
Python 3.4 also allows setting *just* the default error handler for the streams, while still getting the encoding from the OS.
msg208859 - (view) Author: Alyssa Coghlan (ncoghlan) * (Python committer) Date: 2014年01月22日 22:41
My apologies, I completely misread the issue and thought it was related to displaying file names, rather than opening them.
I believe Python 3.4 includes some changes in this area - are you in a position to retry this on the latest 3.4 beta release?
msg212349 - (view) Author: Laurent Mazuel (Laurent.Mazuel) Date: 2014年02月27日 11:06
Thank for your answer.
Unfortunately, I cannot test easily python 3.4 for now. But I have downloaded the source code and "diff" from 3.3 to 3.4 the "zipfile" module and see no difference relating to this problem. I can be wrong, maybe if some core improvement of Python may change something?
msg308345 - (view) Author: Cheryl Sabella (cheryl.sabella) * (Python committer) Date: 2017年12月14日 23:28
I created an environment under 3.3.1 in which this error was still occurring, but within that same environment, it is not occurring for 3.7. I believe this can be closed.
msg308350 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017年12月15日 00:09
> I created an environment under 3.3.1 in which this error was still occurring, but within that same environment, it is not occurring for 3.7. I believe this can be closed.
Python 3.7 now uses the UTF-8 encoding when the LC_CTYPE locale is POSIX (PEP 538, PEP 540). You should still be able to reproduce the bug with a locale with an encoding different than UTF-8.
Moreover, I understand that Python 3.6 is still affected by the bug.
I don't think that we can fix this bug, sadly. But I'm happy to see that the PEP 538 and PEP 540 are already useful!
msg308568 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017年12月18日 14:39
> I don't think that we can fix this bug, sadly. But I'm happy to see that the PEP 538 and PEP 540 are already useful!
Oops, I mean "we cannot *close* this bug" (right now). Sorry.
I mean that IMHO we still have to fix the bug.
History
Date User Action Args
2022年04月11日 14:57:57adminsetgithub: 64528
2017年12月18日 14:39:09vstinnersetmessages: + msg308568
2017年12月15日 00:09:59vstinnersetmessages: + msg308350
2017年12月14日 23:28:49cheryl.sabellasetnosy: + cheryl.sabella
messages: + msg308345
2014年02月27日 11:06:41Laurent.Mazuelsetmessages: + msg212349
2014年01月22日 22:41:50ncoghlansetsuperseder: Use "surrogateescape" error handler for sys.stdin and sys.stdout on UNIX for the C locale ->
2014年01月22日 22:41:18ncoghlansetstatus: closed -> open
resolution: duplicate ->
messages: + msg208859
2014年01月22日 22:36:58ncoghlansetstatus: open -> closed
superseder: Use "surrogateescape" error handler for sys.stdin and sys.stdout on UNIX for the C locale
resolution: duplicate
messages: + msg208858
2014年01月22日 14:56:59r.david.murraysetnosy: + ncoghlan
messages: + msg208817
2014年01月22日 07:39:33Laurent.Mazuelsetmessages: + msg208755
2014年01月21日 17:09:38serhiy.storchakasetnosy: + serhiy.storchaka
2014年01月21日 15:33:31r.david.murraysetnosy: + vstinner, r.david.murray
messages: + msg208655
2014年01月21日 15:05:53Laurent.Mazuelcreate

AltStyle によって変換されたページ (->オリジナル) /