homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Long unicode string causes SyntaxError: Non-UTF-8 code starting with '\xe2' in file ..., but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
Type: behavior Stage: needs patch
Components: Interpreter Core Versions: Python 3.9, Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Andrew Ushakov, eryksun, serhiy.storchaka, terry.reedy, vstinner
Priority: normal Keywords:

Created on 2019年11月09日 12:26 by Andrew Ushakov, last changed 2022年04月11日 14:59 by admin.

Files
File name Uploaded Description Edit
tst112.py Andrew Ushakov, 2019年11月09日 12:26
Messages (7)
msg356298 - (view) Author: Andrew Ushakov (Andrew Ushakov) Date: 2019年11月09日 12:26
Not very long unicode comment #, space and then 170 or more repetitions of the utf8 symbol ░ (b'\xe2\x96\x91'.decode()) 
# ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
causes syntax error:
SyntaxError: Non-UTF-8 code starting with '\xe2' in file tst112.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
Python file is attached. Second example is similar, but here unicode string with similar length is used as an argument of a print function.
print('\n░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░')
Similar Issue34979 was submitted one year ago...
msg356709 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2019年11月15日 19:45
I think that this should be closed as a duplicate of #34979 and this example posted there, with the OS and python version included.
On Windows, with 3.7, 3.8.0, and master, neither the posted comment, the one in the file, not the initial statement in #34979 give the SyntaxError.
msg356715 - (view) Author: Andrew Ushakov (Andrew Ushakov) Date: 2019年11月15日 20:16
> On Windows, with 3.7, 3.8.0, and master, neither the posted comment, the one in the file, not the initial statement in #34979 give the SyntaxError.
Just tried again on my corporate laptop with the downloaded file from this site:
Microsoft Windows [Version 10.0.16299.1451]
(c) 2017 Microsoft Corporation. All rights reserved.
D:\Downloads>py
Python 3.8.0 (tags/v3.8.0:fa919fd, Oct 14 2019, 19:37:50) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> quit()
D:\Downloads>py tst112.py
 File "tst112.py", line 1
SyntaxError: Non-UTF-8 code starting with '\xe2' in file tst112.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
d:\Downloads>py -3.7
Python 3.7.4 (tags/v3.7.4:e09359112e, Jul 8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> quit()
d:\Downloads>py -3.7 tst112.py
 File "tst112.py", line 1
SyntaxError: Non-UTF-8 code starting with '\xe2' in file tst112.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
msg390931 - (view) Author: Andrew Ushakov (Andrew Ushakov) Date: 2021年04月13日 07:09
Just tested again:
D:\Downloads>py 
Python 3.9.4 (tags/v3.9.4:1f2e308, Apr 4 2021, 13:27:16) [MSC v.1928 64 bit (AMD64)] on win32 
Type "help", "copyright", "credits" or"license" for more information. 
>>> quit()
 D:\Downloads>py tst112.py 
SyntaxError: Non-UTF-8 code starting with '\xe2' in file D:\Downloads\tst112.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details 
P.S. No problems with Python 3.8.5 and Ubuntu 20.04.2 LTS.
msg390942 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2021年04月13日 09:37
> P.S. No problems with Python 3.8.5 and Ubuntu 20.04.2 LTS.
The issue is that the line length is limited to BUFSIZ, which ends up splitting the UTF-8 sequence b'\xe2\x96\x91'. BUFSIZ is only 512 bytes in Windows. It's 8192 bytes in Linux, in which case you need a line that's 16 times longer in order to reproduce the error. For example:
 $ stat -c "%s" test.py 
 8194
 $ python3.9 test.py
 SyntaxError: Non-UTF-8 code starting with '\xe2' in file 
 /home/someone/test.py on line 1, but no encoding declared; see 
 http://python.org/dev/peps/pep-0263/ for details
This has been fixed in a rewrite of the tokenizer (bpo-25643), for which the PR was recently merged into the main branch for 3.10a7+.
Maybe a minimal backport to keep reading up to "\n" can be applied to 3.8 and 3.9.
msg391018 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021年04月13日 23:52
The bpo-14811 issue was fixed in Python 3.10 by bpo-25643, but is not fixed in Python 3.8 and 3.9.
msg391019 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021年04月13日 23:54
In 2012, I wrote detect_truncate.patch in bpo-14811. Does someone want to convert it to a PR for Python 3.9?
History
Date User Action Args
2022年04月11日 14:59:23adminsetgithub: 82936
2021年04月13日 23:54:10vstinnersetmessages: + msg391019
2021年04月13日 23:52:57vstinnersetnosy: + vstinner
messages: + msg391018
2021年04月13日 09:37:59eryksunsetstage: test needed -> needs patch
versions: - Python 3.7
2021年04月13日 09:37:26eryksunsetnosy: + eryksun
messages: + msg390942
2021年04月13日 07:09:58Andrew Ushakovsetmessages: + msg390931
versions: + Python 3.7, Python 3.9
2019年11月15日 20:16:55Andrew Ushakovsetmessages: + msg356715
2019年11月15日 19:45:45terry.reedysetnosy: + terry.reedy
messages: + msg356709

type: behavior
stage: test needed
2019年11月09日 12:43:01serhiy.storchakasetnosy: + serhiy.storchaka
2019年11月09日 12:26:48Andrew Ushakovcreate

AltStyle によって変換されたページ (->オリジナル) /