Issue 10382: Command line error marker misplaced on unicode entry

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/54591

classification

Title:	Command line error marker misplaced on unicode entry
Type:	behavior	Stage:	patch review
Components:	Interpreter Core	Versions:	Python 3.2

process

Status:	closed	Resolution:	duplicate
Dependencies:	Superseder:	[Py3k] SyntaxError cursor shifted if multibyte character is in line. View: 2382
Assigned To:	belopolsky	Nosy List:	belopolsky, ezio.melotti, lemburg, loewis, vstinner
Priority:	normal	Keywords:	patch

Created on 2010年11月10日 19:34 by belopolsky, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
issue10382.diff	belopolsky, 2010年11月11日 00:04	review
issue10382a.diff	belopolsky, 2010年11月11日 23:06	review

Messages (5)
msg120930 - (view)	Author: Alexander Belopolsky (belopolsky) * (Python committer)	Date: 2010年11月10日 19:34
>>> ¡TM£¢∞§¶•ao File "<stdin>", line 1 ¡TM£¢∞§¶•ao ^ SyntaxError: invalid character in identifier It looks like strlen() is used instead of number of characters in the decoded string.
msg120933 - (view)	Author: Alexander Belopolsky (belopolsky) * (Python committer)	Date: 2010年11月11日 00:04
I am attaching a patch that seems to fix the issue. Note that I considered fixing the problem in parsetok.c where offset is originally computed, but this is part of pgen which has to be compiled without unicode support. The test case suitable to be included in unittests is: try: eval(b'\xc2\xa1'.decode('utf-8')) except SyntaxError as err: assert(err.offset == 1)
msg120941 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2010年11月11日 08:53
See also #2382: I wrote patches two years ago for this issue.
msg120982 - (view)	Author: Alexander Belopolsky (belopolsky) * (Python committer)	Date: 2010年11月11日 23:05
haypo> See also #2382: I wrote patches two years ago for this issue. Yes, this is the same issue. I don't want to close this as a duplicate because #2382 contains a much more ambitious set of patches. What I am trying to achieve here is similar to the adjust_offset.patch there. I am attaching a patch that takes an alternative approach and computes the number of characters in the parser. I strongly believe that the buffer in the tokenizer always contains UTF-8 encoded text. If it is not so already, I would consider making it so by replacing a call to _PyUnicode_AsDefaultEncodedString() with a call to PyUnicode_AsUTF8String(). (if that matters) The patch still needs unittests and possibly has some off-by-one issues, but I would like to get to an agreement that this is the right level at which the problem should be fixed first.
msg190931 - (view)	Author: Alexander Belopolsky (belopolsky) * (Python committer)	Date: 2013年06月10日 20:37
The latest patch at #2382 is simpler than mine, so I am closing this as duplicate.

History
Date	User	Action	Args
2022年04月11日 14:57:08	admin	set	github: 54591
2013年06月10日 20:37:57	belopolsky	set	status: open -> closed superseder: [Py3k] SyntaxError cursor shifted if multibyte character is in line. resolution: duplicate messages: + msg190931
2010年11月11日 23:06:14	belopolsky	set	files: + issue10382a.diff
2010年11月11日 23:05:52	belopolsky	set	messages: + msg120982
2010年11月11日 08:53:41	vstinner	set	messages: + msg120941
2010年11月11日 01:37:09	belopolsky	link	issue10384 dependencies
2010年11月11日 00:17:27	belopolsky	set	nosy: + loewis
2010年11月11日 00:04:06	belopolsky	set	files: + issue10382.diff messages: + msg120933 assignee: belopolsky keywords: + patch stage: needs patch -> patch review
2010年11月10日 20:57:44	belopolsky	set	nosy: + lemburg, vstinner, ezio.melotti
2010年11月10日 19:34:23	belopolsky	create

homepage