Issue 4008: IDLE: checksyntax() doesn't support Unicode?

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/48258

classification

Title:	IDLE: checksyntax() doesn't support Unicode?
Type:	crash	Stage:
Components:	IDLE	Versions:	Python 3.0

process

Dependencies:	Superseder:
Status:	closed	Resolution:	fixed
Assigned To:	Nosy List:	geon, loewis, terry.reedy, vstinner
Priority:	release blocker	Keywords:	needs review, patch

Created on 2008年10月01日 15:37 by vstinner, last changed 2022年04月11日 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
idle-3.0rc1-quits-when-run.py	vstinner, 2008年10月01日 15:37
idle_encoding-3.patch	vstinner, 2008年10月02日 21:49	Use tokenize.detect_encoding() to detect Python script encoding
iso.py	vstinner, 2008年10月03日 22:37	Example of non-utf8 file (coding: ISO-8859-1)
idle_encoding_4.patch	loewis, 2008年12月29日 19:48

Messages (16)
msg74131 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2008年10月01日 15:37
IDLE checksyntax() function doesn't support Unicode. Example with idle-3.0rc1-quits-when-run.py in an ASCII terminal: $ ./python Tools/scripts/idle Exception in Tkinter callback Traceback (most recent call last): File "/home/haypo/prog/py3k/Lib/tkinter/__init__.py", line 1405, in __call__ return self.func(*args) File "/home/haypo/prog/py3k/Lib/idlelib/ScriptBinding.py", line 124, in run_module_event code = self.checksyntax(filename) File "/home/haypo/prog/py3k/Lib/idlelib/ScriptBinding.py", line 86, in checksyntax source = f.read() File "/home/haypo/prog/py3k/Lib/io.py", line 1719, in read decoder.decode(self.buffer.read(), final=True)) File "/home/haypo/prog/py3k/Lib/io.py", line 1294, in decode output = self.decoder.decode(input, final=final) File "/home/haypo/prog/py3k/Lib/encodings/ascii.py", line 26, in decode return codecs.ascii_decode(input, self.errors)[0] UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 87: ordinal not in range(128) To open an ASCII terminal on Linux, you can for example use xterm with an empty environment (except DISPLAY and HOME variables): "env -i DISPLAY=$DISPLAY HOME=$HOME xterm".
msg74134 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2008年10月01日 16:13
Hum, the problem is that IDLE asks io.open() to detect the charset whereas open() doesn't know the #coding: header. So if your locale is ASCII, CP1252 or anything different of UTF-8, read the file will fails. I wrote a patch to detect the encoding. Python code (detect_encoding() function) is based on PyTokenizer_FindEncoding() and get_coding_spec() (from Parser/tokenizer.c). There is no existing Python function to detect the encoding of a Python script? (a public function available in a Python script)
msg74160 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2008年10月02日 14:11
Ah! tokenize has already a method detect_encoding(). My new patch uses it to avoid code duplication.
msg74161 - (view)	Author: Martin v. Löwis (loewis) * (Python committer)	Date: 2008年10月02日 14:29
Notice that there is also IOBinding.coding_spec. Not sure whether this or the one in tokenize is more correct.
msg74197 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2008年10月02日 21:49
loewis wrote: > Notice that there is also IOBinding.coding_spec. > Not sure whether this or the one in tokenize is more correct. Oh! IOBinding reimplement many features now available in Python like universal new line or function to write unicode strings to a file. But I don't want to rewrite IDLE, I just want to fix the initial problem: IDLE is unable to open a non-ASCII file using "#coding:" header. So IDLE reimplemented coding detection twice: once in IOBinding and once in ScriptBinding. So I wrote a new version of my patch removing all the code to reuse tokenize.detect_encoding(). I changed IDLE behaviour: IOBinding._decode() used the locale encoding if it's unable to detect the encoding using UTF-8 BOM and/or if the #coding: header is missing. Since I also read "Finally, try the locale's encoding. This is deprecated", I prefer to remove it. If you want to keep the current behaviour, use: ------------------------- def detect_encoding(filename, default=None): with open(filename, 'rb') as f: encoding, line = tokenize.detect_encoding(f.readline) if (not line) and default: return default return encoding ... encoding = detect_encoding(filename, locale_encoding) ------------------------- Please review and test my patch (which becomes longer and longer) :-)
msg74202 - (view)	Author: Martin v. Löwis (loewis) * (Python committer)	Date: 2008年10月02日 22:33
> Oh! IOBinding reimplement many features now available in Python like > universal new line or function to write unicode strings to a file. It did not reimplement. The implementation in IOBinding predates all other implementations out there.
msg74207 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2008年10月02日 23:05
@loewis: Ok, I didn't know. I think that it's better to reuse existing code. I also compared the implementations of encoding detection, and the code looks the same in IDLE and tokenize, but I prefer tokenize. tokenize.detect_encoding() has longer documentation, return the line (decoded as Unicode) matching the encoding cookie, and look to be more robust. I saw an interesting test in IDLE code: it checks the charset. So I wrote a patch raising a SyntaxError for tokenize: issue4021.
msg74210 - (view)	Author: Martin v. Löwis (loewis) * (Python committer)	Date: 2008年10月02日 23:19
I can't reproduce the problem. It works fine for me, displaying the box drawing character. In case it matters, sys.getpreferredencoding returns 'ANSI_X3.4-1968'; this is on Linux, idle started from an xterm, r66761
msg74280 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2008年10月03日 22:37
@loewis: I guess that your locale is still UTF-8. On Linux (Ubuntu Gutsy) using "env -i DISPLAY=$DISPLAY HOME=$HOME xterm" to get a new empty environment, I get: $ locale LANG= LC_ALL= LC_CTYPE="POSIX" LC_NUMERIC="POSIX" LC_TIME="POSIX" LC_COLLATE="POSIX" ... $ python3.0 >>> from idlelib.IOBinding import encoding >>> encoding 'ansi_x3.4-1968' >>> import locale >>> locale.getdefaultlocale() (None, None) >>> locale.nl_langinfo(locale.CODESET) 'ANSI_X3.4-1968' In this environment, IDLE is unable to detect idle-3.0rc1-quits-when-run.py encoding. IDLE uses open(filename, 'r'): it doesn't specify the charset. In this case, TextIOWrapper uses locale.getpreferredencoding() as encoding (or ASCII on failure). To sum IDLE: if your locale is UTF-8, you will be able to open an UTF-8 file. So for example, if your locale is UTF-8, you won't be able to open an ISO-8859-1 file. Let's try iso.py: IDLE displays the error "Failed to decode" and quit whereas I specified the encoding :-/
msg74303 - (view)	Author: Martin v. Löwis (loewis) * (Python committer)	Date: 2008年10月04日 08:00
> @loewis: I guess that your locale is still UTF-8. To refute this claim, I reported that locale.getpreferredencoding reports 'ANSI_X3.4-1968'. I was following your instructions exactly (on Debian 4.0), and still, it opens successfully (when loaded through File/Open). Should I do something else with it to trigger the error, other than opening it? When opening iso.py, I get a pop window titled "Decoding error", with a message "Failed to Decode". This seems to be correct also. So I still can't reproduce the problem. I don't understand why you say that IDLE uses open(filename, 'r'). In IOBinding.IOBinding.loadfile, I see # open the file in binary mode so that we can handle # end-of-line convention ourselves. f = open(filename,'rb')
msg74312 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2008年10月04日 11:24
IDLE opens the script many than once. There are two cases: (1) first open when IDLE read the file content to display it (2) second open on pressing F5 key (Run Module) to check the syntax (1) uses IOBinding and fails to open ISO-8859-1 file with UTF-8 locale. (2) uses ScriptBinding and fails to open UTF-8 file with ASCII locale. About the initial problem (idle-3.0rc1-quits-when-run.py), yes, I forgot to say that you have to run the module, sorry :-/
msg76052 - (view)	Author: Martin v. Löwis (loewis) * (Python committer)	Date: 2008年11月19日 15:38
This patch has two problems: 1. saving files fails, since there is still a call left to the function coding_spec, but that function is removed. 2. if saving would work: it doesn't preserve the line endings of the original file when writing it back. If you open files with DOS line endings on Unix, upon saving, they should still have DOS line endings.
msg76579 - (view)	Author: Terry J. Reedy (terry.reedy) * (Python committer)	Date: 2008年11月29日 02:15
This is still a problem on my WinXP 3.0rc3 with # -- coding: utf-8 -- in a file but not with the same pasted directly into the shell Window.
msg78479 - (view)	Author: Martin v. Löwis (loewis) * (Python committer)	Date: 2008年12月29日 19:48
Here is a new patch that fixes this issue, and the duplicate issues (#4410, and #4623). It doesn't try to eliminate code duplication, but fixes coding_spec by decoding always to Latin-1 first until the coding is known. It fixes check_syntax by opening the source file in binary. It should have fixed tabnanny the same way, except that tabnanny cannot properly process byte tokens.
msg78933 - (view)	Author: Pavel Kosina (geon)	Date: 2009年01月03日 05:00
I vote for fixing this too. This might be simplified/another example of above mentioned issues: # -- coding: utf-8 -- print ("ěščřžýáíé") in IDLE prints this: >>> Ä›ĹˇÄŤĹTMĹľĂ ̋ĂˇĂĂ© When running this script under python command line from another editor, I get the output readable as expected.
msg80119 - (view)	Author: Martin v. Löwis (loewis) * (Python committer)	Date: 2009年01月18日 20:18
Committed as r68730 and r68731.

History
Date	User	Action	Args
2022年04月11日 14:56:39	admin	set	github: 48258
2009年01月18日 20:18:13	loewis	set	status: open -> closed resolution: fixed messages: + msg80119
2009年01月03日 05:00:24	geon	set	nosy: + geon messages: + msg78933
2008年12月29日 19:48:27	loewis	set	priority: release blocker keywords: + needs review messages: + msg78479 files: + idle_encoding_4.patch
2008年12月29日 19:42:25	loewis	link	issue4410 superseder
2008年12月29日 19:41:55	loewis	link	issue4623 superseder
2008年12月04日 23:14:22	amaury.forgeotdarc	link	issue4530 superseder
2008年11月29日 02:15:41	terry.reedy	set	nosy: + terry.reedy type: crash messages: + msg76579
2008年11月29日 01:37:16	amaury.forgeotdarc	link	issue4454 superseder
2008年11月19日 15:38:47	loewis	set	messages: + msg76052
2008年10月06日 21:58:37	vstinner	set	files: - idle_encoding-2.patch
2008年10月04日 11:24:20	vstinner	set	messages: + msg74312
2008年10月04日 08:00:53	loewis	set	messages: + msg74303
2008年10月03日 22:37:12	vstinner	set	files: + iso.py messages: + msg74280
2008年10月02日 23:19:45	loewis	set	messages: + msg74210
2008年10月02日 23:05:39	vstinner	set	messages: + msg74207
2008年10月02日 22:33:35	loewis	set	messages: + msg74202
2008年10月02日 21:49:13	vstinner	set	files: + idle_encoding-3.patch messages: + msg74197
2008年10月02日 14:29:28	loewis	set	nosy: + loewis messages: + msg74161
2008年10月02日 14:11:18	vstinner	set	files: - idle_encoding.patch
2008年10月02日 14:11:13	vstinner	set	files: + idle_encoding-2.patch messages: + msg74160
2008年10月01日 16:13:59	vstinner	set	files: + idle_encoding.patch keywords: + patch messages: + msg74134
2008年10月01日 15:37:54	vstinner	create

homepage