This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2008年03月18日 05:22 by ocean-city, last changed 2022年04月11日 14:56 by admin.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| py3k_adjust_cursor_at_syntax_error.patch | ocean-city, 2008年09月21日 23:04 | |||
| traceback_adjust_cursor.patch | amaury.forgeotdarc, 2008年09月30日 23:52 | |||
| py3k_adjust_cursor_at_syntax_error_v2.patch | ocean-city, 2008年10月06日 06:49 | |||
| issue2382.patch | vstinner, 2009年03月17日 21:40 | |||
| unicode_utf8size.patch | vstinner, 2009年03月17日 23:23 | |||
| unicode_width.patch | vstinner, 2009年03月17日 23:23 | |||
| adjust_offset.patch | vstinner, 2009年03月17日 23:23 | |||
| print_exception.patch | vstinner, 2009年03月17日 23:24 | |||
| test.py | belopolsky, 2013年06月10日 20:54 | |||
| adjust_offset_2.patch | serhiy.storchaka, 2013年09月25日 20:59 | review | ||
| Messages (29) | |||
|---|---|---|---|
| msg63895 - (view) | Author: Hirokazu Yamamoto (ocean-city) * (Python committer) | Date: 2008年03月18日 05:22 | |
Hello. I found another problem related to issue2301. SyntaxError cursor "^" is shifted when multibyte characters are in line (before "^"). I think this is because err->text is stored as UTF-8 which requires 3 bytes for multibyte character, but actually cp932 (my console encoding) requires only 2 bytes for it. So "^" is shited to right 5 bytes because there is 5 multibyte chars. C:\Documents and Settings\WhiteRabbit>py3k x.py push any key.... File "x.py", line 3 print "あいうえお" ^ SyntaxError: invalid syntax [22567 refs] Sorry, I didn't know what PyTokenizer_RestoreEncoding really doing. That function adjusted err_ret->offset for this encoding conversion. So, Python2.5 can output cursor in right place. (Of course, if source encoding is not compatible for console encoding, broken string is printed though. Anyway, cursor is right) C:\Documents and Settings\WhiteRabbit>py a.py File "a.py", line 2 x "、「、、、ヲ、ィ、ェ" ^ SyntaxError: invalid syntax [8728 refs] I tried to fix this problem, but I'm not sure how to fix this. |
|||
| msg63904 - (view) | Author: Hirokazu Yamamoto (ocean-city) * (Python committer) | Date: 2008年03月18日 07:15 | |
> I tried to fix this problem, but I'm not sure how to fix this. Quick observation... /////////////////////////////////// // Possible Solution 1. Convert err->text to console compatible encoding (not to source encoding like in python2.x) where PyTokenizer_RestoreEncoding is there. 2. err->text is UTF-8, actual output is done in Python/pythonrun.c(print_error_text), so adjust offset there. /////////////////////////////////// // Solution requires... 1. - PyUnicode_DecodeUTF8 in Python/pythonrun.c(err_input) should be changed to some kind of "bytes" API. - The way to write "bytes" to File object directly is needed. 2. - The way to know actual byte length of given unicode + encoding. //////////////////////////////////////////////////// // Experimental patch Attached as experimental patch of solution 2. Looks agly, but seems working on my environment. (I assumed get_length_in_bytes(f, " ", 1) == 1 but I'm not sure this is always true in other platforms. Probably nicer and more general solution may exist) |
|||
| msg64156 - (view) | Author: Hirokazu Yamamoto (ocean-city) * (Python committer) | Date: 2008年03月20日 05:47 | |
> (I assumed get_length_in_bytes(f, " ", 1) == 1 but I'm not sure > this is always true in other platforms. Probably nicer and more > general solution may exist) This assumption still lives, but I cannot find better solution. I'm thinking now attached patch is good enough. |
|||
| msg73539 - (view) | Author: Hirokazu Yamamoto (ocean-city) * (Python committer) | Date: 2008年09月21日 23:04 | |
Patch revised. |
|||
| msg74106 - (view) | Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) | Date: 2008年09月30日 23:52 | |
I think that your patch works only for terminals where one byte of the encoded text is displayed as one character on the terminal. This is not true for utf-8 terminals, for example. In the attached patch, I tried to write some unit tests, (I had to adapt the traceback module as well), and one test still fails because the captured stderr has a utf-8 encoding. I think that it's better to count unicode characters. |
|||
| msg74114 - (view) | Author: Hirokazu Yamamoto (ocean-city) * (Python committer) | Date: 2008年10月01日 04:09 | |
You are right, this issue is more difficult than I thought... I found wcswidth(3), if this function is available we can use this function, but unfortunately there is no such function in VC6 and this function is meaningless on cygwn, so I cannot test it. ;-( Maybe we can use import unicodedata unicodedata.east_asian_width() but I need to investigate more. |
|||
| msg74119 - (view) | Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) | Date: 2008年10月01日 08:13 | |
For the moment, I'd suggest that one unicode character has a the same
with as the space character, assuming that stdout.encoding correctly
matches the terminal.
Then the C implementation could do something similar to the statements I
added in traceback.py:
offset = len(line.encode('utf-8')[:offset].decode('utf-8'))
|
|||
| msg74129 - (view) | Author: Hirokazu Yamamoto (ocean-city) * (Python committer) | Date: 2008年10月01日 14:05 | |
Amaury, if doing so, the cursor will shift left by 5 columns on my
environment like this, no? ("あ" requires 2 columns for example)
print "あいうえお"
^
|
|||
| msg74148 - (view) | Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) | Date: 2008年10月01日 22:31 | |
This seems to be a difficult problem. Doesn't the exact width depend on the terminal capabilities? and fonts, and combining diacritics... An easy way to put the caret at the same exact position is to repeat the beginning of the line up to the offending offset: print "あいうえお" print "あいうえお^<------------------ But I don't know how to make it look less ugly. At least my "one unicode char is one space" suggestion corrects the case of Western languages, and all messages with single-width characters. |
|||
| msg74149 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2008年10月02日 01:21 | |
See also a related issue: issue3975. |
|||
| msg74361 - (view) | Author: Hirokazu Yamamoto (ocean-city) * (Python committer) | Date: 2008年10月06日 06:49 | |
>At least my "one unicode char is one space" suggestion corrects the case >of Western languages, and all messages with single-width characters. I'm not happy with this solution. ;-( >Doesn't the exact width depend on >the terminal capabilities? and fonts, and combining diacritics... I have to admit you are right. Nevertheless, I got coLinux(Debian) which has localed wcswidth(3), so I created another experimental patch. (py3k_adjust_cursor_at_syntax_error_v2.patch) The strategy is ... 1. Try to convert to unicode. If fails, nothing changed to offset. 2. If system has wcswidth(3), try that function 3. If system is windows, try WideCharToMultibyte with CP_ACP 4. If above 2/3 fails or system is others, use unicode length as offset (Amaury's suggestion) This patch ignores file encoding. Again, this patch is experimental, best effort, but maybe better than current state. P.S. I tested this patch on coLinux with ja_JP.UTF-8 locale and manual #define HAVE_WCSWIDTH 1 because I don't know how to change configure script. |
|||
| msg74362 - (view) | Author: Hirokazu Yamamoto (ocean-city) * (Python committer) | Date: 2008年10月06日 07:20 | |
Experimental patch was experimental, wcswidth(3) returns 1 for East Asian Ambiguous character. debian:~/python-dev/py3k# ./python /mnt/windows/a.py File "/mnt/windows/a.py", line 3 "♪xÅx" abc ^ should point 'c'. And another one debian:~/python-dev/py3k# export LANG=C debian:~/python-dev/py3k# ./python /mnt/windows/a.py File "/mnt/windows/a.py", line 3 "\u266ax\u212bx" abc ^ SyntaxError: invalid syntax Please forget my patch. :-( |
|||
| msg83636 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2009年03月15日 14:56 | |
This issue is a problem of units. The error text is an utf8 *byte* string and offset is a number of *bytes*. The goal is to get the text *width* of a *character* string. We have to: 1- convert offset from bytes number to character number 2- get the error message as (unicode) characters 3- get the width of text[:offset] It's already possible to get (2) from the utf8 string, and code from ocean-city's patch (py3k_adjust_cursor_at_syntax_error_v2.patch) can be used for (3). The most difficult point is (1). I will try to implement that. |
|||
| msg83659 - (view) | Author: David W. Lambert (LambertDW) | Date: 2009年03月16日 05:29 | |
Resolution of this may be applicable to Issue3446 as well. "center, ljust and rjust are inconsistent with unicode parameters" |
|||
| msg83700 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2009年03月17日 21:40 | |
Proof of concept of patch fixing this issue:
- parse_syntax_error() reads the text line into a PyUnicodeObject*
instead of a "const char**"
- create utf8_to_unicode_offset(): convert byte offset to a number of
characters. The Python version should be something like:
def utf8_to_unicode_offset(text, byte_offset):
utf8 = text.encode("utf-8")
utf8 = utf8[:byte_offset]
text = str(utf8, "utf-8")
return len(text)
- reuse adjust_offset() from
py3k_adjust_cursor_at_syntax_error_v2.patch, but force the use of
wcswidth() because HAVE_WCSWIDTH is not defined by configure
- print_error_text() works on unicode characters and not on bytes!
The patch should be refactorized:
- move adjust_offset(), utf8_to_unicode_offset(), utf8_len() in
unicodeobject.c. You might create a new method "width()" for the
unicode type. This method can be used to fix center(), ljust() and
rjust() unicode methods (see issue #3446).
|
|||
| msg83712 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2009年03月17日 23:23 | |
For an easier review, I splitted my patch in multiple small patches: - unicode_utf8size.patch: create _PyUnicode_UTF8Size() function: Number of bytes needed to encode the unicode character as UTF-8 - unicode_width.patch: create PyUnicode_Width(): Number of column needed to represent the string in the current locale. -1 is returned in case of an error. - adjust_offset.patch: Change unit of SyntaxError.offset, convert utf8 offset to unicode offset - print_exception.patch: process error text as an unicode string (instead of a byte string), convert offset from characters to "columns" Dependencies: - adjust_offset.patch depends on unicode_utf8size.patch - print_exception.patch depends on unicode_width.patch Changes since issue2382.patch: - PyUnicode_Width() doesn't change the locale - PyUnicode_Width() uses WideCharToMultiByte() on MS_WINDOWS, and wcswidth() otherwise (before: do nothing if HAVE_WCSWIDTH is not definied) - the offset was converted from utf8 index to unicode index only in print_error_text(), not on SyntaxError creation - _PyUnicode_UTF8Size() and PyUnicode_Width() are public |
|||
| msg83714 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2009年03月17日 23:40 | |
Comments about my own patches.
unicode_width.patch:
* error messages should be improved:
ValueError("Unable to compute string width") for Windows
IOError(strerror(errno)) otherwise
adjust_offset.patch:
* format_exception_only() from Lib/traceback.py may need a fix
* about the documentation: it looks like SyntaxError.offset unit is
not documentation in exceptions.rst (should it be documented, or
leaved unchanged?)
print_exception.patch:
* i'm not sure of the reference counts (ref leak?)
* in case of PyUnicode_FromUnicode(text, textlen) error,
>>PyFile_WriteObject(textobj, f, Py_PRINT_RAW);
PyFile_WriteString("\n", f);<< is used to display the line but textobj
may already ends with \n.
* format_exception_only() from Lib/traceback.py should do the same job
than fixed print_exception(): get the string width (to fix this
issue!)
|
|||
| msg140377 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2011年07月14日 22:44 | |
I just created the issue #12568 for unicode_width.patch. |
|||
| msg149960 - (view) | Author: Petri Lehtinen (petri.lehtinen) * (Python committer) | Date: 2011年12月21日 07:04 | |
What's the status of this issue? FWIW, this is not only a problem with east asian characters: >>> ä äää File "<stdin>", line 1 ä äää ^ SyntaxError: invalid syntax |
|||
| msg172500 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2012年10月09日 18:40 | |
Here is a patch upgraded to Python 3.3. It uses a little different approach and works with invalid encoded data. unicode_utf8size.patch is not needed. This patch fixes a half of the issue - working with non-ascii non-wide characters. It's enough for many people. Let's commit it and go further. |
|||
| msg172525 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2012年10月09日 21:02 | |
> This patch fixes a half of the issue - working with non-ascii > non-wide characters. The purpose of this issue is to handle CJK characters taking 2 columns instead of 1 in a terminal, or did I misunderstand it? |
|||
| msg190933 - (view) | Author: Alexander Belopolsky (belopolsky) * (Python committer) | Date: 2013年06月10日 20:45 | |
haypo> The purpose of this issue is to handle CJK characters taking 2 haypo> columns instead of 1 in a terminal, or did I misunderstand it? That's the other half of the problem, but the more common issue is misplaced caret when non-ascii characters are present: >>> ¡TM£¢∞§¶•ao File "<stdin>", line 1 ¡TM£¢∞§¶•ao ^ SyntaxError: invalid character in identifier With Serhiy's patch: >>> ¡TM£¢∞§¶•ao File "<stdin>", line 1 ¡TM£¢∞§¶•ao ^ SyntaxError: invalid character in identifier |
|||
| msg190934 - (view) | Author: Alexander Belopolsky (belopolsky) * (Python committer) | Date: 2013年06月10日 20:54 | |
Serhiy's patch is lacking tests, but it passes the test I proposed at #10382 at attaching here. |
|||
| msg198419 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2013年09月25日 20:59 | |
Added tests. I think it will be worth apply this patch which fixes the issue for most Europeans and than continue working on the issue of wide characters. |
|||
| msg208115 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2014年01月14日 21:20 | |
If no one complain I'll commit last patch tomorrow. |
|||
| msg208697 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2014年01月21日 20:30 | |
New changeset eb7565c212f1 by Serhiy Storchaka in branch '3.3': Issue #2382: SyntaxError cursor "^" now is written at correct position in most http://hg.python.org/cpython/rev/eb7565c212f1 New changeset ea34b2b0b8ae by Serhiy Storchaka in branch 'default': Issue #2382: SyntaxError cursor "^" now is written at correct position in most http://hg.python.org/cpython/rev/ea34b2b0b8ae |
|||
| msg228027 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2014年09月30日 23:13 | |
The issue #10384 has been marked as a duplicate of this issue: it's a similar issue, identifier which contains invisible character. |
|||
| msg228034 - (view) | Author: Alexander Belopolsky (belopolsky) * (Python committer) | Date: 2014年09月30日 23:24 | |
The original problem is still present Python 3.5.0a0 (default:5313b4c0bb6c, Sep 30 2014, 18:55:45) >>> A_I_U_E_O$ = None File "<stdin>", line 1 A_I_U_E_O$ = None ^ SyntaxError: invalid syntax Replace A_I_U_E_O above with the Japanese script. I get codec error from the server when I try to paste my session as is. (Note that invalid character is $ above and not the Japanese AIUEO.) Another outstanding issue is with zero-width characters. See #10384. |
|||
| msg323734 - (view) | Author: Terry J. Reedy (terry.reedy) * (Python committer) | Date: 2018年08月18日 21:23 | |
IDLE avoids the problem of calculating a location for a '^' below the bad line by instead asking tk to give the marked character (and maybe more) a 'ERROR' tag, which shows as a red background. So it marks the '$' of 'A_I_U_E_O$' and the 'alid' slice of 'inv\u200balid' (from duplicate #10384). When the marked character is '\n', the space following the line is tagged. Is it possible to do something similar with any of the major system consoles? |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:56:32 | admin | set | github: 46635 |
| 2018年08月18日 21:23:35 | terry.reedy | set | nosy:
+ terry.reedy messages: + msg323734 |
| 2014年09月30日 23:24:28 | belopolsky | set | messages: + msg228034 |
| 2014年09月30日 23:13:11 | vstinner | set | messages: + msg228027 |
| 2014年09月30日 23:12:39 | vstinner | link | issue10384 superseder |
| 2014年01月21日 20:31:40 | serhiy.storchaka | set | assignee: serhiy.storchaka -> stage: patch review -> needs patch |
| 2014年01月21日 20:30:33 | python-dev | set | nosy:
+ python-dev messages: + msg208697 |
| 2014年01月14日 21:20:45 | serhiy.storchaka | set | versions:
+ Python 3.4, - Python 3.2 messages: + msg208115 assignee: serhiy.storchaka type: behavior stage: patch review |
| 2013年09月25日 21:02:28 | serhiy.storchaka | set | files: - adjust_offset-3.3.patch |
| 2013年09月25日 20:59:44 | serhiy.storchaka | set | files:
+ adjust_offset_2.patch messages: + msg198419 |
| 2013年06月10日 20:54:17 | belopolsky | set | files:
+ test.py messages: + msg190934 |
| 2013年06月10日 20:45:38 | belopolsky | set | nosy:
+ belopolsky messages: + msg190933 |
| 2013年06月10日 20:37:57 | belopolsky | link | issue10382 superseder |
| 2012年10月10日 04:05:00 | ezio.melotti | link | issue16173 superseder |
| 2012年10月09日 21:02:11 | vstinner | set | messages: + msg172525 |
| 2012年10月09日 18:40:10 | serhiy.storchaka | set | files:
+ adjust_offset-3.3.patch messages: + msg172500 |
| 2012年10月09日 17:13:10 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka |
| 2012年10月09日 15:31:31 | Arfrever | set | nosy:
+ Arfrever |
| 2011年12月21日 07:04:45 | petri.lehtinen | set | nosy:
+ petri.lehtinen messages: + msg149960 versions: + Python 3.2, Python 3.3, - Python 3.0 |
| 2011年07月14日 22:44:30 | vstinner | set | messages: + msg140377 |
| 2010年07月09日 17:14:05 | ezio.melotti | set | nosy:
+ ezio.melotti |
| 2009年03月17日 23:40:59 | vstinner | set | messages: + msg83714 |
| 2009年03月17日 23:24:01 | vstinner | set | files: + print_exception.patch |
| 2009年03月17日 23:23:51 | vstinner | set | files: + adjust_offset.patch |
| 2009年03月17日 23:23:45 | vstinner | set | files: + unicode_width.patch |
| 2009年03月17日 23:23:35 | vstinner | set | files:
+ unicode_utf8size.patch messages: + msg83712 |
| 2009年03月17日 21:40:27 | vstinner | set | files:
+ issue2382.patch messages: + msg83700 |
| 2009年03月16日 05:29:34 | LambertDW | set | nosy:
+ LambertDW messages: + msg83659 |
| 2009年03月15日 14:56:07 | vstinner | set | messages: + msg83636 |
| 2008年10月06日 07:20:34 | ocean-city | set | messages: + msg74362 |
| 2008年10月06日 06:49:49 | ocean-city | set | files:
+ py3k_adjust_cursor_at_syntax_error_v2.patch messages: + msg74361 |
| 2008年10月02日 01:21:16 | vstinner | set | nosy:
+ vstinner messages: + msg74149 |
| 2008年10月01日 22:31:39 | amaury.forgeotdarc | set | messages: + msg74148 |
| 2008年10月01日 14:05:41 | ocean-city | set | messages: + msg74129 |
| 2008年10月01日 08:14:00 | amaury.forgeotdarc | set | messages: + msg74119 |
| 2008年10月01日 04:09:01 | ocean-city | set | messages: + msg74114 |
| 2008年09月30日 23:52:07 | amaury.forgeotdarc | set | files:
+ traceback_adjust_cursor.patch nosy: + amaury.forgeotdarc messages: + msg74106 |
| 2008年09月21日 23:04:33 | ocean-city | set | files:
+ py3k_adjust_cursor_at_syntax_error.patch messages: + msg73539 components: + Interpreter Core, - None |
| 2008年09月21日 23:03:45 | ocean-city | set | files: - fix.patch |
| 2008年03月20日 05:48:01 | ocean-city | set | files: - experimental.patch |
| 2008年03月20日 05:47:48 | ocean-city | set | files:
+ fix.patch messages: + msg64156 |
| 2008年03月18日 07:15:57 | ocean-city | set | files:
+ experimental.patch keywords: + patch messages: + msg63904 |
| 2008年03月18日 05:22:30 | ocean-city | create | |