homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Implement utf-8-bmp codec
Type: behavior Stage: resolved
Components: IDLE, Tkinter Versions: Python 3.7, Python 3.6
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: asvetlov Nosy List: Arfrever, asvetlov, belopolsky, ezio.melotti, loewis, pitrou, roger.serwy, serhiy.storchaka, terry.reedy
Priority: normal Keywords: patch

Created on 2012年03月14日 21:01 by asvetlov, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
idle_escape_nonbmp.patch serhiy.storchaka, 2012年04月16日 17:13 review
Messages (30)
msg155793 - (view) Author: Andrew Svetlov (asvetlov) * (Python committer) Date: 2012年03月14日 21:01
Tkinter (and IDLE specially) can use only UCS-2 characters.
In PyShell IDLE tries to escape non-ascii.
To better result we should to escape only non-BMP chars leaving BMP characters untouched.
msg157235 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012年03月31日 23:07
The solution outlined in the issue title ("utf-8-bmp codec") sounds like a rather dubious idea.
msg157248 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012年04月01日 01:35
pitrou: can you elaborate?
msg157263 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012年04月01日 07:38
''.join(c if ord(c) < 0x10000 else escape(c) for c in s)
msg158372 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012年04月15日 21:59
What is this codec? What do you mean by "escpe non-ascii"?
msg158424 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012年04月16日 12:45
This codec is one that is equal to UTF-8, but restricted to the BMP. For non-BMP character, the error handler is called. It will be the stdout codec for the IDLE interactive shell, causing non-BMP results to be ascii() escaped.
msg158426 - (view) Author: Andrew Svetlov (asvetlov) * (Python committer) Date: 2012年04月16日 12:50
Tkinter (as Tcl itself) has no support of non-BMP characters in any form. 
It looks like support of UTF-16 without surrogates.
I like to implement codec for that which will process different error modes (strict, replace, ignore etc) as well as others codecs does.
It will allow to support BMP well and control processing of non-BMP in IDLE.
About your second question. 
IDLE has interactive shell. This shell in REPL will try to print expression result. It it contains non-BMP whole result is converted to ASCII with escaping. It's different from standard python console. From my perspective expected behavior is to pass BMP chars and escape only non-BMP.
msg158460 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012年04月16日 14:56
Example:
>>> '\u0100'
'Ā'
>>> '\u0100\U00010000'
'\u0100\U00010000'
>>> print('\u0100')
Ā
>>> print('\u0100\U00010000')
Traceback (most recent call last):
 File "<pyshell#33>", line 1, in <module>
 print('\u0100\U00010000')
UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 1-1: Non-BMP character not supported in Tk
But I think that it is too specific problem and too specific solution. It would be better if IDLE itself escapes the string in the most appropriate way.
def utf8bmp_encode(s):
 return ''.join(c if ord(c) <= 0xffff else '\\U%08x' % ord(c) for c in s).encode('utf-8')
or
def utf8bmp_encode(s):
 return re.sub('[^\x00-\uffff]', lambda m: '\\U%08x' % ord(m.group()), s).encode('utf-8')
msg158467 - (view) Author: Andrew Svetlov (asvetlov) * (Python committer) Date: 2012年04月16日 15:28
The way is named 'codec'.
msg158470 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012年04月16日 15:35
> But I think that it is too specific problem and too specific
> solution. It would be better if IDLE itself escapes the string in the
> most appropriate way.
That is not implementable correctly. If you think otherwise, please
submit a patch. If not, please trust me on that judgment.
msg158486 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012年04月16日 17:13
May be I did not correctly understand the problem, but I can assume,
that this patch solves it.
'Агов!\U00010000'
msg158487 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012年04月16日 17:32
Sorry, the mail daemon has eaten a piece of example.
>>> '\u0410\u0433\u043e\u0432!\U00010000'
'Агов!\U00010000'
msg159497 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012年04月27日 21:48
Andrew, the patch solves your issue?
msg159530 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012年04月28日 17:43
The patch is incorrect, i.e. it deviates from what the command line interface does. When you try to write to sys.stdout, and the characters are not supported you get UnicodeError. Only when it is interactive mode, and tries to represent some result, ascii escaping happens.
msg159531 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012年04月28日 18:32
I don't see what the patch worse than the current behavior.
Unpatched:
>>> ''.join(map(chr, [76, 246, 119, 105, 115]))
'Löwis'
>>> ''.join(map(chr, [76, 246, 119, 105, 115, 65536]))
'L\xf6wis\U00010000'
Patched:
>>> ''.join(map(chr, [76, 246, 119, 105, 115]))
'Löwis'
>>> ''.join(map(chr, [76, 246, 119, 105, 115, 65536]))
'Löwis\U00010000'
In the case of the Cyrillic alphabet all text becomes unreadable, if there are some non-bmp characters in it.
msg159538 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012年04月28日 20:34
> In the case of the Cyrillic alphabet all text becomes unreadable, if 
> there are some non-bmp characters in it.
And indeed, that's the correct, desired behavior, as it models what the
interactive shell does.
If you want to change this, you need to also change the interactive console,
which is an issue independent of this one.
msg159541 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012年04月28日 20:58
I take that back; the interactive shell uses the backslashescape error handler.
Still, I don't think IDLE should setup a displayhook in the first place. What if an application replaces the displayhook?
msg159543 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012年04月28日 21:39
> Still, I don't think IDLE should setup a displayhook in the first place. What if an application replaces the displayhook?
IDLE *is* the application.
If another application that uses the idlelib, replace displayhook, it
must itself to worry about the correct encoding and escaping.
msg159544 - (view) Author: Andrew Svetlov (asvetlov) * (Python committer) Date: 2012年04月28日 22:07
Serhiy, I like to fix tkinter itself, not only IDLE.
There are other problems like idle is crashing if non-bmp char will be pasted from clipboard.
Moreover, non-bmp behavior is different from one Tk widget to other.
I still want to make codec for it and then try to solve tk problems.
Maybe solution will force to extend tkinter interface for process codec errors with reasonable well specified default behavior.
Sorry for my silence. I hope to make some progress next weeks.
msg159545 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012年04月28日 22:30
> IDLE *is* the application.
No, IDLE is the development environment. The application is
whatever is being developed with IDLE.
msg159546 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012年04月28日 22:51
I don't understand how the utf-8-bmp codec will help to fix the tkinter. To fix the tkinter, you need to fix the Tcl/Tk, but it is outside of Python. While Tcl does not support non-bmp characters, correct and non-ambiguous working with non-bmp characters is not possible. You should choose the method of encoding of non-bmp characters and these methods will be different for different applications.
msg159547 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012年04月28日 23:04
> No, IDLE is the development environment. The application is
> whatever is being developed with IDLE.
If the application replaces the displayhook, than it is the development
environment too.
msg159582 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012年04月29日 05:52
Andrew, imagine that the utf-8-bmp codec is already there (I will do it
for you, if I see its necessity). How are you going to use it? Show a
patch that fixes IDLE and tkinter using this codec. It seems to me that
any result can be achieved without the codec, and not higher cost. And
that's not counting cost of the codec itself.
msg163745 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012年06月24日 07:27
Any chance to commit the patch before final feature freeze?
msg228168 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2014年10月02日 07:05
Pending doing some experiments with current and patched code, and reading the rpc code, I believe I would like to see the patch applied. I don't care about whether the patch defines a 'codec' or what its name would be. What i do want is for the Idle Shell to display unicode strings produced by python code as faithfully as possible, without raising an exception, given the limitations of tk and the selected font.
msg228175 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014年10月02日 07:39
> Tkinter (and IDLE specially) can use only UCS-2 characters.
Is it always the case, or does depend on a compilation flag of Tcl or Tk?
msg228182 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014年10月02日 08:00
In theory Tcl/Tk can be built with 32-bit Tcl_Char. But I doubt that this option is well tested. In any case on Linux Python depends on system Tcl/Tk.
msg228183 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014年10月02日 08:01
> In theory Tcl/Tk can be built with 32-bit Tcl_Char.
Would it make sense to compile Tcl/Tk with 32-bit Tcl_Char on Windows? I think that we embed our own build ot Tcl/Tk, right?
msg296687 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2017年06月23日 07:39
In 3.6, Python's use of the Windows console was changed to work much better with unicode. As a result, IDLE is now worse rather than better than the console on Windows. I plan to do something before 3.7.0.
msg370920 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2020年06月07日 22:58
October 2019, Serhiy solved the display issue with a _tkinter patch for #13153.
bpo-13153: Use OS native encoding for converting between Python and Tcl. (GH-16545)
https://github.com/python/cpython/commit/06cb94bc8419b9a24df6b0d724fcd8e40c6971d6
In Windows IDLE Shell
>>> ''.join(map(chr, [76, 246, 119, 105, 115, 0x1F40D]))
'Löwis🐍'
except that the snake is black and white. (Many astral chars have no glyph and appear as a box.) In console REPL, the snake shows as box box space box.
Pasting astral characters into edited code 'works' except that editing following code is messy because the astral char is multiple chars internally and the visible cursor no longer matches the internal index. (But pasting such no longer crashes IDLE.)
History
Date User Action Args
2022年04月11日 14:57:28adminsetgithub: 58512
2020年06月07日 23:13:21vstinnersetnosy: - vstinner
2020年06月07日 22:58:51terry.reedysetstatus: open -> closed
resolution: out of date
messages: + msg370920

stage: patch review -> resolved
2017年06月23日 07:39:16terry.reedysetmessages: + msg296687
components: + IDLE
versions: + Python 3.6, Python 3.7, - Python 3.3
2014年10月27日 19:32:21belopolskysetnosy: + belopolsky
2014年10月02日 08:01:52vstinnersetmessages: + msg228183
2014年10月02日 08:00:57serhiy.storchakasetmessages: + msg228182
2014年10月02日 07:39:01vstinnersetmessages: + msg228175
2014年10月02日 07:05:31terry.reedysetnosy: + terry.reedy
messages: + msg228168

type: behavior
stage: patch review
2012年06月24日 07:27:40serhiy.storchakasetmessages: + msg163745
2012年04月30日 13:38:59Arfreversetnosy: + Arfrever
2012年04月29日 05:52:56serhiy.storchakasetmessages: + msg159582
2012年04月28日 23:04:53serhiy.storchakasetmessages: + msg159547
2012年04月28日 22:51:12serhiy.storchakasetmessages: + msg159546
2012年04月28日 22:30:47loewissetmessages: + msg159545
2012年04月28日 22:07:04asvetlovsetmessages: + msg159544
2012年04月28日 21:39:22serhiy.storchakasetmessages: + msg159543
2012年04月28日 20:58:21loewissetmessages: + msg159541
2012年04月28日 20:34:28loewissetmessages: + msg159538
2012年04月28日 18:32:32serhiy.storchakasetmessages: + msg159531
2012年04月28日 17:43:25loewissetmessages: + msg159530
2012年04月27日 21:48:08serhiy.storchakasetmessages: + msg159497
2012年04月16日 17:32:21serhiy.storchakasetmessages: + msg158487
2012年04月16日 17:13:51serhiy.storchakasetfiles: + idle_escape_nonbmp.patch
keywords: + patch
messages: + msg158486
2012年04月16日 15:35:35loewissetmessages: + msg158470
2012年04月16日 15:28:10asvetlovsetmessages: + msg158467
2012年04月16日 14:56:47serhiy.storchakasetmessages: + msg158460
2012年04月16日 12:50:53asvetlovsetmessages: + msg158426
2012年04月16日 12:45:14loewissetmessages: + msg158424
2012年04月15日 21:59:22vstinnersetnosy: + vstinner
messages: + msg158372
2012年04月14日 04:31:06ezio.melottisetnosy: + ezio.melotti
2012年04月01日 07:38:47serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg157263
2012年04月01日 01:35:44loewissetmessages: + msg157248
2012年03月31日 23:07:43pitrousetnosy: + loewis, pitrou
messages: + msg157235
2012年03月14日 22:38:55roger.serwysetnosy: + roger.serwy
2012年03月14日 21:01:18asvetlovsetcomponents: + Tkinter
2012年03月14日 21:01:10asvetlovcreate

AltStyle によって変換されたページ (->オリジナル) /