homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: tarfile: use surrogates for undecode fields
Type: Stage:
Components: Library (Lib), Unicode Versions: Python 3.1, Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: lars.gustaebel, loewis, vstinner
Priority: normal Keywords: patch

Created on 2010年04月13日 23:53 by vstinner, last changed 2022年04月11日 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
tarfile_surrogates.patch vstinner, 2010年04月13日 23:53
tarfile_surrogates.2.diff lars.gustaebel, 2010年05月05日 20:23
Messages (8)
msg103099 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010年04月13日 23:53
When reading a tar archive, tarfile decodes fields using "replace" error handler by default. The result is that we loose informations if there is an undecodable character.
Since the PEP 383, undecodable filenames are stored using surrogates in Python3. I think that it's a good idea to use surrogates for tar, because it's a common problem to have undecodable data in a tar archive (see the unicode section of the tarfile documentation).
msg104606 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010年04月30日 01:05
lars: Do you have an opinion about this suggestion?
msg104867 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2010年05月03日 18:40
Yes, I will soon have ;-) Please give me a few days...
msg104870 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010年05月03日 19:32
A better fix is maybe to store fields as bytes, but it would break the compatibility and unicode is pratical in Python3.
msg104872 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010年05月03日 19:51
I think it is helpful to read the pax specification here:
http://www.opengroup.org/onlinepubs/009695399/utilities/pax.html
pax defines (IIUC) that all strings in a pax-compliant tar file are UTF-8 encoded. For the "invalid" option, they offer the alternatives bypass, rename, UTF-8, and write. It may be useful to provide the same options, in some form.
msg104873 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010年05月03日 19:59
My patch changes test_uname_unicode() of test_tarfile for the GNU and ustar formats (but not PAX). In GNU and ustar formats, the fields can be encoded in any encoding, and may contain invalid byte sequences.
msg105085 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2010年05月05日 20:23
I think it is a good suggestion to use "surrogateescape" as the default, because (I hope) it produces the fewest errors and is the best choice if tarfile is used in connection with Python's filesystem calls.
- When reading tar headers, undecodable chars in filenames end up as surrogates. This way no information is lost. In principle tarfile is merely a gateway to a filesystem inside an archive, so it feels natural if it treats filenames the same as Python's filesystem calls.
- When writing tar headers, filenames with surrogate chars (e.g. from os.listdir()) will be converted back to bytes in the header (in case of gnu and ustar formats). Filenames will remain unchanged, this is exactly as one would expect.
- When writing pax headers, filenames with surrogates will raise a UnicodeError because we may only use strict utf-8 inside a pax header. This is actually no difference to the status quo.
@Martin: As I understand it, the pax "invalid"-option is supposed to deal with the case when strings from a pax header are not representable in the user's encoding. In tarfile's case we don't have this problem when reading the archive until we try to extract it.
Unfortunately, POSIX says nothing about how to store bad filenames in a pax archive. tarfile raises an error. GNU tar fails silently, it just puts the unchanged original filename into the pax header without converting it to utf-8, thus violating the standard.
msg105096 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010年05月05日 22:15
Thank you for your review. I commited the patch as r80824 (I fixed the documentation, :versionadded => :versionchanged), blocked as r80825 (3.2).
--
> Unfortunately, POSIX says nothing about how to store bad filenames in
> a pax archive. tarfile raises an error. GNU tar fails silently,
> it just puts the unchanged original filename into the pax header
> without converting it to utf-8, thus violating the standard.
Right. I opened a new issue about that: #8333. I consider that it's a different problem.
History
Date User Action Args
2022年04月11日 14:56:59adminsetgithub: 52637
2010年05月07日 00:18:51vstinnersetstatus: open -> closed
2010年05月05日 22:15:22vstinnersetresolution: fixed
messages: + msg105096
2010年05月05日 20:23:51lars.gustaebelsetfiles: + tarfile_surrogates.2.diff

messages: + msg105085
2010年05月03日 19:59:08vstinnersetmessages: + msg104873
2010年05月03日 19:51:56loewissetmessages: + msg104872
2010年05月03日 19:32:35vstinnersetmessages: + msg104870
2010年05月03日 18:40:24lars.gustaebelsetmessages: + msg104867
2010年04月30日 01:05:51vstinnersetmessages: + msg104606
2010年04月23日 20:37:00vstinnersetnosy: + lars.gustaebel
2010年04月18日 23:27:35vstinnerlinkissue8242 dependencies
2010年04月13日 23:53:14vstinnercreate

AltStyle によって変換されたページ (->オリジナル) /