This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2010年09月01日 22:41 by vstinner, last changed 2022年04月11日 14:57 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| encodings.patch | vstinner, 2010年09月01日 22:41 | |||
| Messages (12) | |||
|---|---|---|---|
| msg115339 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年09月01日 22:41 | |
Many C functions have bytes argument (char* type) but the encoding is not documented. If would not be a problem if the encoding was always the same, but it is not. Examples: - format of PyUnicode_FromFormat() should be encoded as ISO-8859-1 - filename of PyParser_ASTFromString() should be encoded as utf-8 - filename of PyErr_SetFromErrnoWithFilename() should be encoded to the filesystem encoding (with strict error handler, and not surrogateescape) - 's' argument of PyParser_ASTFromString() should be encoded as utf-8 if PyPARSE_IGNORE_COOKIE flag is set, otherwise the parser checks for #coding:xxx cookie (if there is no cookie, utf-8 is used) Attached patch is a try to document most low level functions. I choosed to add the name of function arguments in the headers because I consider that a header can be used as a quick documentation. I only touched .c files to change argument names. It is hard to get the right encoding, so I cannot ensure that my patch is correct. My patch is just a draft. I don't know if "encoded to utf-8" is the right expression. Or should it be "decoded as utf-8"? |
|||
| msg115404 - (view) | Author: Éric Araujo (eric.araujo) * (Python committer) | Date: 2010年09月02日 20:53 | |
I think either of these is correct: - a UTF-8-encoded string - a string encoded in UTF-8 |
|||
| msg115405 - (view) | Author: Dave Malcolm (dmalcolm) (Python committer) | Date: 2010年09月02日 21:13 | |
> I think either of these is correct: > - a UTF-8-encoded string > - a string encoded in UTF-8 Possibly use the word "buffer" here, rather than "string", as "string" may suggest the "str" type. Or even: "NUL-terminated buffer of UTF-8-encoded bytes", or whatnot. (sorry for bikeshedding) |
|||
| msg115523 - (view) | Author: Terry J. Reedy (terry.reedy) * (Python committer) | Date: 2010年09月03日 22:38 | |
Better specifying requirements is good. A few comments: - The second argument is an error message; it is converted to a string object. + The second argument is an error message; it is decoded to a string object + with ``'utf-8'`` encoding. I would write the change as + The second argument is a utf-8 encoded error message; it is decoded to a string object. I the second part (what the function will do with the arg) really needed? I think in the current version, it serves to indirectly specify that the arg in not to be a string, but bytes. If the specific encoding required is specified, that also says bytes, making 'will be decoded' redundant and irrelevant. ------------------------------- + a Python exception (class, not an instance). *format* should be a string + encoded to ISO-8859-1, containing format codes, *format* should be ISO-8859-1 encoded bytes containing format codes, although I am not clear about the implications of that. Are not all format code ascii chars? -------------------------------- I do not really like 'encoded to', but 'decoded to' is wrong. 'will be decoded from xxx bytes' is better. I think there should be a general discussion somewhere about bytes arguments and the terminology that will be used. |
|||
| msg115543 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年09月03日 23:53 | |
About PyErr_Format() and PyUnicode_FromFormat*() encoding: it's not exactly ISO-8859-1... there is a bug => issue #9769. |
|||
| msg115942 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年09月09日 12:47 | |
#6543 changed the encoding of the filename argument of PyRun_SimpleFileExFlags() (and all functions based on PyRun_SimpleFileExFlags) and c_filename attribute of the compiler (private) structure in Python 3.1.3: use utf-8 in strict mode instead of filesystem encoding with surrogateescape. |
|||
| msg123655 - (view) | Author: Dave Malcolm (dmalcolm) (Python committer) | Date: 2010年12月08日 22:08 | |
A (probably crazy) idea that just occurred to me: typedef char utf8_bytes; typedef char iso8859_1_bytes; typedef char fsenc_bytes; then specify the encoding in the type signature of the API e.g.: - int PyRun_SimpleFile(FILE *fp, const char *filename) + int PyRun_SimpleFile(FILE *fp, const fsenc_bytes *filename) |
|||
| msg123659 - (view) | Author: Alexander Belopolsky (belopolsky) * (Python committer) | Date: 2010年12月08日 22:55 | |
> A (probably crazy) idea that just occurred to me: > typedef char utf8_bytes; > typedef char iso8859_1_bytes; > typedef char fsenc_bytes; I like it! Let's see how far we can get without iso8859_1_bytes, though. (It is likely to be locale_bytes anyways.) There are a few places where we'll need ascii_bytes. The added benefit is that we can make these typedefs unsigned char and avoid char signness being ambiguous. We will also need to give the typedefs the Py_ prefix. And an obligatory bikesheding comment: if we typedef char, we should use singular form. Or we can typedef char* Py_utf8_bytes. |
|||
| msg124692 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年12月27日 01:50 | |
r87504 documents encodings of error functions. r87505 documents encodings of unicode functions. r87506 documents encodings of AST, compiler, parser and PyRun functions. |
|||
| msg124696 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年12月27日 02:07 | |
While documenting encodings, I found two issues: #10778 and #10779. |
|||
| msg125359 - (view) | Author: Alexander Belopolsky (belopolsky) * (Python committer) | Date: 2011年01月04日 19:18 | |
Victor, Here is an interesting case for your collection: PyDict_GetItemString. Note that it is documented as not setting error, but in fact it may if encoding fails. This rarely an issue because most uses of PyDict_GetItemString are with an ASCII string literal. |
|||
| msg137331 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2011年05月30日 21:13 | |
> Here is an interesting case for your collection: PyDict_GetItemString. It's easier to guess the encoding of such function: Python 3 always use UTF-8, but yes, the encoding should be documented. I documented many functions, directly in the header files, and sometimes also in the reST documentation. I close this issue because I consider it as done. If you would like to document the encoding of some specific functions, please open new issues. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:57:06 | admin | set | github: 53947 |
| 2011年05月30日 21:13:23 | vstinner | set | status: open -> closed resolution: fixed messages: + msg137331 |
| 2011年01月04日 19:18:35 | belopolsky | set | nosy:
terry.reedy, belopolsky, vstinner, eric.araujo, dmalcolm, docs@python messages: + msg125359 |
| 2010年12月27日 02:07:04 | vstinner | set | nosy:
terry.reedy, belopolsky, vstinner, eric.araujo, dmalcolm, docs@python messages: + msg124696 |
| 2010年12月27日 01:50:56 | vstinner | set | nosy:
terry.reedy, belopolsky, vstinner, eric.araujo, dmalcolm, docs@python messages: + msg124692 |
| 2010年12月08日 22:55:09 | belopolsky | set | messages: + msg123659 |
| 2010年12月08日 22:08:28 | dmalcolm | set | messages: + msg123655 |
| 2010年11月17日 23:54:56 | belopolsky | set | nosy:
+ belopolsky |
| 2010年09月09日 12:47:26 | vstinner | set | messages: + msg115942 |
| 2010年09月03日 23:53:46 | vstinner | set | messages: + msg115543 |
| 2010年09月03日 22:38:37 | terry.reedy | set | nosy:
+ terry.reedy messages: + msg115523 |
| 2010年09月02日 21:13:07 | dmalcolm | set | nosy:
+ dmalcolm messages: + msg115405 |
| 2010年09月02日 20:53:03 | eric.araujo | set | nosy:
+ eric.araujo messages: + msg115404 |
| 2010年09月01日 22:41:34 | vstinner | create | |