homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Document the encoding of functions bytes arguments of the C API
Type: Stage:
Components: Documentation, Interpreter Core, Unicode Versions: Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: docs@python Nosy List: belopolsky, dmalcolm, docs@python, eric.araujo, terry.reedy, vstinner
Priority: normal Keywords: patch

Created on 2010年09月01日 22:41 by vstinner, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
encodings.patch vstinner, 2010年09月01日 22:41
Messages (12)
msg115339 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010年09月01日 22:41
Many C functions have bytes argument (char* type) but the encoding is not documented. If would not be a problem if the encoding was always the same, but it is not. Examples:
 - format of PyUnicode_FromFormat() should be encoded as ISO-8859-1
 - filename of PyParser_ASTFromString() should be encoded as utf-8
 - filename of PyErr_SetFromErrnoWithFilename() should be encoded to the filesystem encoding (with strict error handler, and not surrogateescape)
 - 's' argument of PyParser_ASTFromString() should be encoded as utf-8 if PyPARSE_IGNORE_COOKIE flag is set, otherwise the parser checks for #coding:xxx cookie (if there is no cookie, utf-8 is used)
Attached patch is a try to document most low level functions. I choosed to add the name of function arguments in the headers because I consider that a header can be used as a quick documentation. I only touched .c files to change argument names.
It is hard to get the right encoding, so I cannot ensure that my patch is correct. My patch is just a draft.
I don't know if "encoded to utf-8" is the right expression. Or should it be "decoded as utf-8"?
msg115404 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010年09月02日 20:53
I think either of these is correct:
- a UTF-8-encoded string
- a string encoded in UTF-8
msg115405 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2010年09月02日 21:13
> I think either of these is correct:
> - a UTF-8-encoded string
> - a string encoded in UTF-8
Possibly use the word "buffer" here, rather than "string", as "string" may suggest the "str" type.
Or even: "NUL-terminated buffer of UTF-8-encoded bytes", or whatnot.
(sorry for bikeshedding)
msg115523 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010年09月03日 22:38
Better specifying requirements is good. A few comments:
- The second argument is an error message; it is converted to a string object.
+ The second argument is an error message; it is decoded to a string object
+ with ``'utf-8'`` encoding.
 
I would write the change as
+ The second argument is a utf-8 encoded error message; it is decoded to a string object. 
I the second part (what the function will do with the arg) really needed? I think in the current version, it serves to indirectly specify that the arg in not to be a string, but bytes. If the specific encoding required is specified, that also says bytes, making 'will be decoded' redundant and irrelevant.
-------------------------------
+ a Python exception (class, not an instance). *format* should be a string
+ encoded to ISO-8859-1, containing format codes, 
*format* should be ISO-8859-1 encoded bytes containing format codes,
although I am not clear about the implications of that. Are not all format code ascii chars?
--------------------------------
I do not really like 'encoded to', but 'decoded to' is wrong. 'will be decoded from xxx bytes' is better. I think there should be a general discussion somewhere about bytes arguments and the terminology that will be used.
msg115543 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010年09月03日 23:53
About PyErr_Format() and PyUnicode_FromFormat*() encoding: it's not exactly ISO-8859-1... there is a bug => issue #9769.
msg115942 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010年09月09日 12:47
#6543 changed the encoding of the filename argument of PyRun_SimpleFileExFlags() (and all functions based on PyRun_SimpleFileExFlags) and c_filename attribute of the compiler (private) structure in Python 3.1.3: use utf-8 in strict mode instead of filesystem encoding with surrogateescape.
msg123655 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2010年12月08日 22:08
A (probably crazy) idea that just occurred to me:
 typedef char utf8_bytes;
 typedef char iso8859_1_bytes;
 typedef char fsenc_bytes;
then specify the encoding in the type signature of the API e.g.:
- int PyRun_SimpleFile(FILE *fp, const char *filename)
+ int PyRun_SimpleFile(FILE *fp, const fsenc_bytes *filename)
msg123659 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010年12月08日 22:55
> A (probably crazy) idea that just occurred to me:
> typedef char utf8_bytes;
> typedef char iso8859_1_bytes;
> typedef char fsenc_bytes;
I like it! Let's see how far we can get without iso8859_1_bytes, though. (It is likely to be locale_bytes anyways.) There are a few places where we'll need ascii_bytes.
The added benefit is that we can make these typedefs unsigned char and avoid char signness being ambiguous. We will also need to give the typedefs the Py_ prefix.
And an obligatory bikesheding comment: if we typedef char, we should use singular form. Or we can typedef char* Py_utf8_bytes.
msg124692 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010年12月27日 01:50
r87504 documents encodings of error functions.
r87505 documents encodings of unicode functions.
r87506 documents encodings of AST, compiler, parser and PyRun functions.
msg124696 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010年12月27日 02:07
While documenting encodings, I found two issues: #10778 and #10779.
msg125359 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2011年01月04日 19:18
Victor,
Here is an interesting case for your collection: PyDict_GetItemString. Note that it is documented as not setting error, but in fact it may if encoding fails. This rarely an issue because most uses of PyDict_GetItemString are with an ASCII string literal.
msg137331 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011年05月30日 21:13
> Here is an interesting case for your collection: PyDict_GetItemString.
It's easier to guess the encoding of such function: Python 3 always use UTF-8, but yes, the encoding should be documented.
I documented many functions, directly in the header files, and sometimes also in the reST documentation.
I close this issue because I consider it as done. If you would like to document the encoding of some specific functions, please open new issues.
History
Date User Action Args
2022年04月11日 14:57:06adminsetgithub: 53947
2011年05月30日 21:13:23vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg137331
2011年01月04日 19:18:35belopolskysetnosy: terry.reedy, belopolsky, vstinner, eric.araujo, dmalcolm, docs@python
messages: + msg125359
2010年12月27日 02:07:04vstinnersetnosy: terry.reedy, belopolsky, vstinner, eric.araujo, dmalcolm, docs@python
messages: + msg124696
2010年12月27日 01:50:56vstinnersetnosy: terry.reedy, belopolsky, vstinner, eric.araujo, dmalcolm, docs@python
messages: + msg124692
2010年12月08日 22:55:09belopolskysetmessages: + msg123659
2010年12月08日 22:08:28dmalcolmsetmessages: + msg123655
2010年11月17日 23:54:56belopolskysetnosy: + belopolsky
2010年09月09日 12:47:26vstinnersetmessages: + msg115942
2010年09月03日 23:53:46vstinnersetmessages: + msg115543
2010年09月03日 22:38:37terry.reedysetnosy: + terry.reedy
messages: + msg115523
2010年09月02日 21:13:07dmalcolmsetnosy: + dmalcolm
messages: + msg115405
2010年09月02日 20:53:03eric.araujosetnosy: + eric.araujo
messages: + msg115404
2010年09月01日 22:41:34vstinnercreate

AltStyle によって変換されたページ (->オリジナル) /