Message 115282 - Python tracker

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

In-reply-to
Author	vstinner
Recipients	ideasman42, vstinner
Date	2010年08月31日.22:29:49
SpamBayes Score	4.0656367e-12
Marked as misclassified	No
Message-id	<1283293792.55.0.37414088951.issue9713@psf.upfronthosting.co.za>

Content
The problem is not specific to Py_CompileString(): all functions based (indirectly) on PyParser_ASTFromString() and PyParser_ASTFromFile() expect filenames encoded in utf-8 with the strict error handler. If we choose to use something else than utf-8 in strict mode, here is an incomplete list of functions that have to be patched: - parser: * initerr() * err_input() - ast * ast_error_finish() And the list of impacted functions (parsing functions accepting filenames): - PyParser_ParseStringFlagsFilename() - PyParser_ParseFile() - PyParser_ASTFromString(), PyParser_ASTFromFile() - PyAST_FromNode() - PyRun_SimpleFile() - PyRun_AnyFile() - PyRun_InteractiveOneFlags() - etc. All these functions are public and I don't think that it would be a good idea to change the encoding (eg. to iso-8859-1). We can use a different error handler (especially surrogateespace, as suggested in the initial message) and/or create new functions accepting unicode filenames. -- I'm working on undecodable filenames in issues #8611 and #9425, especially on the import machinery part. When the import machinery will be fully unicode compliant, the last part will be the "parser machinery" (Parser/.c). It is a little bit more complex to patch the parser because there is the bootstrap problem: the parser is compiled twice, once with a small subset of the C Python API (using some mockups), once with the full API.

Content

The problem is not specific to Py_CompileString(): all functions based (indirectly) on PyParser_ASTFromString() and PyParser_ASTFromFile() expect filenames encoded in utf-8 with the strict error handler.
If we choose to use something else than utf-8 in strict mode, here is an incomplete list of functions that have to be patched:
 - parser:
 * initerr()
 * err_input()
 - ast
 * ast_error_finish()
And the list of impacted functions (parsing functions accepting filenames):
 - PyParser_ParseStringFlagsFilename()
 - PyParser_ParseFile*()
 - PyParser_ASTFromString(), PyParser_ASTFromFile()
 - PyAST_FromNode()
 - PyRun_SimpleFile*()
 - PyRun_AnyFile*()
 - PyRun_InteractiveOneFlags()
 - etc.
All these functions are public and I don't think that it would be a good idea to change the encoding (eg. to iso-8859-1). We can use a different error handler (especially surrogateespace, as suggested in the initial message) and/or create new functions accepting unicode filenames.
--
I'm working on undecodable filenames in issues #8611 and #9425, especially on the import machinery part. When the import machinery will be fully unicode compliant, the last part will be the "parser machinery" (Parser/*.c). It is a little bit more complex to patch the parser because there is the bootstrap problem: the parser is compiled twice, once with a small subset of the C Python API (using some mockups), once with the full API.

History
Date	User	Action	Args
2010年08月31日 22:29:52	vstinner	set	recipients: + vstinner, ideasman42
2010年08月31日 22:29:52	vstinner	set	messageid: <1283293792.55.0.37414088951.issue9713@psf.upfronthosting.co.za>
2010年08月31日 22:29:51	vstinner	link	issue9713 messages
2010年08月31日 22:29:49	vstinner	create

homepage