Issue 38755: Long unicode string causes SyntaxError: Non-UTF-8 code starting with '\xe2' in file ..., but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/82936

classification

Title:	Long unicode string causes SyntaxError: Non-UTF-8 code starting with '\xe2' in file ..., but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
Type:	behavior	Stage:	needs patch
Components:	Interpreter Core	Versions:	Python 3.9, Python 3.8

process

Dependencies:	Superseder:
Status:	open	Resolution:
Assigned To:	Nosy List:	Andrew Ushakov, eryksun, serhiy.storchaka, terry.reedy, vstinner
Priority:	normal	Keywords:

Created on 2019年11月09日 12:26 by Andrew Ushakov, last changed 2022年04月11日 14:59 by admin.

Files
File name	Uploaded	Description	Edit
tst112.py	Andrew Ushakov, 2019年11月09日 12:26

Messages (7)
msg356298 - (view)	Author: Andrew Ushakov (Andrew Ushakov)	Date: 2019年11月09日 12:26
Not very long unicode comment #, space and then 170 or more repetitions of the utf8 symbol ░ (b'\xe2\x96\x91'.decode()) # ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ causes syntax error: SyntaxError: Non-UTF-8 code starting with '\xe2' in file tst112.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details Python file is attached. Second example is similar, but here unicode string with similar length is used as an argument of a print function. print('\n░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░') Similar Issue34979 was submitted one year ago...
msg356709 - (view)	Author: Terry J. Reedy (terry.reedy) * (Python committer)	Date: 2019年11月15日 19:45
I think that this should be closed as a duplicate of #34979 and this example posted there, with the OS and python version included. On Windows, with 3.7, 3.8.0, and master, neither the posted comment, the one in the file, not the initial statement in #34979 give the SyntaxError.
msg356715 - (view)	Author: Andrew Ushakov (Andrew Ushakov)	Date: 2019年11月15日 20:16
> On Windows, with 3.7, 3.8.0, and master, neither the posted comment, the one in the file, not the initial statement in #34979 give the SyntaxError. Just tried again on my corporate laptop with the downloaded file from this site: Microsoft Windows [Version 10.0.16299.1451] (c) 2017 Microsoft Corporation. All rights reserved. D:\Downloads>py Python 3.8.0 (tags/v3.8.0:fa919fd, Oct 14 2019, 19:37:50) [MSC v.1916 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> quit() D:\Downloads>py tst112.py File "tst112.py", line 1 SyntaxError: Non-UTF-8 code starting with '\xe2' in file tst112.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details d:\Downloads>py -3.7 Python 3.7.4 (tags/v3.7.4:e09359112e, Jul 8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> quit() d:\Downloads>py -3.7 tst112.py File "tst112.py", line 1 SyntaxError: Non-UTF-8 code starting with '\xe2' in file tst112.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
msg390931 - (view)	Author: Andrew Ushakov (Andrew Ushakov)	Date: 2021年04月13日 07:09
Just tested again: D:\Downloads>py Python 3.9.4 (tags/v3.9.4:1f2e308, Apr 4 2021, 13:27:16) [MSC v.1928 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or"license" for more information. >>> quit() D:\Downloads>py tst112.py SyntaxError: Non-UTF-8 code starting with '\xe2' in file D:\Downloads\tst112.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details P.S. No problems with Python 3.8.5 and Ubuntu 20.04.2 LTS.
msg390942 - (view)	Author: Eryk Sun (eryksun) * (Python triager)	Date: 2021年04月13日 09:37
> P.S. No problems with Python 3.8.5 and Ubuntu 20.04.2 LTS. The issue is that the line length is limited to BUFSIZ, which ends up splitting the UTF-8 sequence b'\xe2\x96\x91'. BUFSIZ is only 512 bytes in Windows. It's 8192 bytes in Linux, in which case you need a line that's 16 times longer in order to reproduce the error. For example: $ stat -c "%s" test.py 8194 $ python3.9 test.py SyntaxError: Non-UTF-8 code starting with '\xe2' in file /home/someone/test.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details This has been fixed in a rewrite of the tokenizer (bpo-25643), for which the PR was recently merged into the main branch for 3.10a7+. Maybe a minimal backport to keep reading up to "\n" can be applied to 3.8 and 3.9.
msg391018 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2021年04月13日 23:52
The bpo-14811 issue was fixed in Python 3.10 by bpo-25643, but is not fixed in Python 3.8 and 3.9.
msg391019 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2021年04月13日 23:54
In 2012, I wrote detect_truncate.patch in bpo-14811. Does someone want to convert it to a PR for Python 3.9?

History
Date	User	Action	Args
2022年04月11日 14:59:23	admin	set	github: 82936
2021年04月13日 23:54:10	vstinner	set	messages: + msg391019
2021年04月13日 23:52:57	vstinner	set	nosy: + vstinner messages: + msg391018
2021年04月13日 09:37:59	eryksun	set	stage: test needed -> needs patch versions: - Python 3.7
2021年04月13日 09:37:26	eryksun	set	nosy: + eryksun messages: + msg390942
2021年04月13日 07:09:58	Andrew Ushakov	set	messages: + msg390931 versions: + Python 3.7, Python 3.9
2019年11月15日 20:16:55	Andrew Ushakov	set	messages: + msg356715
2019年11月15日 19:45:45	terry.reedy	set	nosy: + terry.reedy messages: + msg356709 type: behavior stage: test needed
2019年11月09日 12:43:01	serhiy.storchaka	set	nosy: + serhiy.storchaka
2019年11月09日 12:26:48	Andrew Ushakov	create

homepage