This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2010年11月10日 19:34 by belopolsky, last changed 2022年04月11日 14:57 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| issue10382.diff | belopolsky, 2010年11月11日 00:04 | review | ||
| issue10382a.diff | belopolsky, 2010年11月11日 23:06 | review | ||
| Messages (5) | |||
|---|---|---|---|
| msg120930 - (view) | Author: Alexander Belopolsky (belopolsky) * (Python committer) | Date: 2010年11月10日 19:34 | |
>>> ¡TM£¢∞§¶•ao File "<stdin>", line 1 ¡TM£¢∞§¶•ao ^ SyntaxError: invalid character in identifier It looks like strlen() is used instead of number of characters in the decoded string. |
|||
| msg120933 - (view) | Author: Alexander Belopolsky (belopolsky) * (Python committer) | Date: 2010年11月11日 00:04 | |
I am attaching a patch that seems to fix the issue. Note that I considered fixing the problem in parsetok.c where offset is originally computed, but this is part of pgen which has to be compiled without unicode support.
The test case suitable to be included in unittests is:
try:
eval(b'\xc2\xa1'.decode('utf-8'))
except SyntaxError as err:
assert(err.offset == 1)
|
|||
| msg120941 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年11月11日 08:53 | |
See also #2382: I wrote patches two years ago for this issue. |
|||
| msg120982 - (view) | Author: Alexander Belopolsky (belopolsky) * (Python committer) | Date: 2010年11月11日 23:05 | |
haypo> See also #2382: I wrote patches two years ago for this issue. Yes, this is the same issue. I don't want to close this as a duplicate because #2382 contains a much more ambitious set of patches. What I am trying to achieve here is similar to the adjust_offset.patch there. I am attaching a patch that takes an alternative approach and computes the number of characters in the parser. I strongly believe that the buffer in the tokenizer always contains UTF-8 encoded text. If it is not so already, I would consider making it so by replacing a call to _PyUnicode_AsDefaultEncodedString() with a call to PyUnicode_AsUTF8String(). (if that matters) The patch still needs unittests and possibly has some off-by-one issues, but I would like to get to an agreement that this is the right level at which the problem should be fixed first. |
|||
| msg190931 - (view) | Author: Alexander Belopolsky (belopolsky) * (Python committer) | Date: 2013年06月10日 20:37 | |
The latest patch at #2382 is simpler than mine, so I am closing this as duplicate. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:57:08 | admin | set | github: 54591 |
| 2013年06月10日 20:37:57 | belopolsky | set | status: open -> closed superseder: [Py3k] SyntaxError cursor shifted if multibyte character is in line. resolution: duplicate messages: + msg190931 |
| 2010年11月11日 23:06:14 | belopolsky | set | files: + issue10382a.diff |
| 2010年11月11日 23:05:52 | belopolsky | set | messages: + msg120982 |
| 2010年11月11日 08:53:41 | vstinner | set | messages: + msg120941 |
| 2010年11月11日 01:37:09 | belopolsky | link | issue10384 dependencies |
| 2010年11月11日 00:17:27 | belopolsky | set | nosy:
+ loewis |
| 2010年11月11日 00:04:06 | belopolsky | set | files:
+ issue10382.diff messages: + msg120933 assignee: belopolsky keywords: + patch stage: needs patch -> patch review |
| 2010年11月10日 20:57:44 | belopolsky | set | nosy:
+ lemburg, vstinner, ezio.melotti |
| 2010年11月10日 19:34:23 | belopolsky | create | |