homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: trace.py and unicode in Python 3
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.1, Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: belopolsky Nosy List: belopolsky, doerwalter, ncoghlan, pitrou, vstinner
Priority: normal Keywords: needs review, patch

Created on 2010年11月05日 15:01 by doerwalter, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
trace.diff doerwalter, 2010年11月05日 15:01 review
issue10329.diff belopolsky, 2010年11月06日 04:09 review
Messages (12)
msg120506 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2010年11月05日 15:01
It seems that on Python 3 (i.e. the py3k branch) trace.py can not handle source that includes Unicode characters. Running the test suite with code coverage info via
 ./python Lib/test/regrtest.py -T -N -uurlfetch,largefile,network,decimal
sometimes fails with the following exception:
Traceback (most recent call last):
 File "Lib/test/regrtest.py", line 1500, in <module>
 main()
 File "Lib/test/regrtest.py", line 696, in main
 r.write_results(show_missing=True, summary=True, coverdir=coverdir)
 File "/home/coverage/python/Lib/trace.py", line 319, in write_results
 lnotab, count)
 File "/home/coverage/python/Lib/trace.py", line 369, in write_results_file
 outfile.write(line.expandtabs(8))
UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in
position 30: ordinal not in range(128)
The script that produces code coverage info on http://coverage.livinglogic.de/ uses this feature to generate code coverage info.
Applying the attached patch (i.e. specifying an explicit encoding when opening the output file) fixes the problem.
msg120520 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010年11月05日 18:22
I don't think trace.diff is proposed for commit. I see it more as a supporting file for diagnosing the problem.
I see two problems here:
1. Apparently OP's system opens files with encoding set to 'ascii' by default. This is not the case on any of the systems I have access to (OSX and Linux). I will try to reproduce this issue by setting LANG="en_US.ascii".
2. Regrtest attempts to write a no-ascii character into the trace results file. I suspect this comes from test cases that test import from modules with non-ascii name or with non-ascii identifiers.
I am not sure there is anything we need to change here other than possibly skip tests that use non-ascii identifiers of the systems with default encoding set to ascii. I would be +0 on adding errors='replace' or 'backshlashreplace' to the open() call in write_results_file(), but hardcoding encoding="utf-8" is definitely not the right thing to do.
msg120522 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010年11月05日 18:43
> I would be +0 on adding errors='replace' or 'backshlashreplace' to the 
> open() call in write_results_file(), but hardcoding encoding="utf-8"
> is definitely not the right thing to do.
Who are the consumers of the trace files? Is there a formal specification or is Python the primary consumer?
If the former, then follow the specification (and/or amend it ;-)).
If the latter, you have the right to be creative; then utf-8 with the sounds like a most reasonable choice (possibly with an error handler such as "ignore" or "replace" to avoid barfing on lone surrogates).
Relying on the default encoding is not really a good idea, though. This is good for quick scripts or in the rare cases where it is by definition the expected behaviour. But in more elaborate cases you certainly want to decide the encoding by yourself.
msg120574 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010年11月06日 03:03
On Fri, Nov 5, 2010 at 2:43 PM, Antoine Pitrou <report@bugs.python.org> wrote:
..
> Who are the consumers of the trace files? Is there a formal specification
> or is Python the primary consumer?
The trace files contain annotated python source code. There is no
formal specification that I am aware of as these files are intended
for human consumption.
..
> Relying on the default encoding is not really a good idea, though. This is good for quick scripts or
> in the rare cases where it is by definition the expected behaviour. But in more elaborate cases you
> certainly want to decide the encoding by yourself.
I agree and the correct encoding seems to be the encoding of the
original source file that trace annotates.
msg120579 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010年11月06日 04:09
Attached patch, issue10329.diff fixes the issue by setting the encoding of the coverage file to that of the source file. I am not 100% happy with this patch for the following reasons:
1. It opens the source file one more time. This is probably acceptable because existing code already opens it at least four times when -m (show missing) option is selected. (Twice in find_executable_linenos() and twice in linecache.getlines(). Fixing that would require refactoring of linecache code.
2. This will not work for source code not stored in a file, but provided by a __loader__.get_source() method. However it looks like trace will not work at all in this case, so fixing that is a separate issue.
msg120601 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010年11月06日 10:51
> 1. It opens the source file one more time. This is probably acceptable
> because existing code already opens it at least four times when -m (show
> missing) option is selected. (Twice in find_executable_linenos() and
> twice in linecache.getlines(). Fixing that would require refactoring of
> linecache code.
Create a function like linecache.getencoding() seems to be overkill.
I created issue #10335 to add a function tokenize.open_python(): open a Python 
script in read mode without opening the file twice and get the encoding with 
detect_encoding(). This issue is more generic than trying to optimize the 
trace module.
> 2. This will not work for source code not stored in a file, but provided by
> a __loader__.get_source() method. However it looks like trace will not
> work at all in this case, so fixing that is a separate issue.
For this case, I think that we can add a try/except IOError with a fallback to 
encoding = 'utf-8'.
msg120687 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010年11月07日 14:31
For the record, the test failure can reproduced by the following:
$ LANG=C ./python -m test.regrtest test_imp test_trace
[1/2] test_imp
[2/2] test_trace
/home/antoine/py3k/__svn__/Lib/unittest/case.py:402: ResourceWarning: unclosed file <_io.TextIOWrapper name='@test_11986_tmp/os.cover' encoding='ANSI_X3.4-1968'>
 result.addError(self, sys.exc_info())
test test_trace failed -- Traceback (most recent call last):
 File "/home/antoine/py3k/__svn__/Lib/test/test_trace.py", line 296, in test_coverage
 self._coverage(tracer)
 File "/home/antoine/py3k/__svn__/Lib/test/test_trace.py", line 291, in _coverage
 r.write_results(show_missing=True, summary=True, coverdir=TESTFN)
 File "/home/antoine/py3k/__svn__/Lib/trace.py", line 334, in write_results
 lnotab, count)
 File "/home/antoine/py3k/__svn__/Lib/trace.py", line 384, in write_results_file
 outfile.write(line.expandtabs(8))
UnicodeEncodeError: 'ascii' codec can't encode character '\xa0' in position 5: ordinal not in range(128)
1 test OK.
1 test failed:
 test_trace
There's a strange interaction between test_imp and test_trace, it seems. Not sure why.
msg120690 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010年11月07日 15:54
$ LANG=C ./python -m test.regrtest test_imp test_trace
[1/2] test_imp
[2/2] test_trace
...
UnicodeEncodeError: 'ascii' codec can't encode character '\xa0' in position 5: ordinal not in range(128)
issue10329.diff fixes this failure. The failure comes from a nonbreaking space introduced by myself by error in Lib/os.py, which is the only non-ASCII character in this file. r86302 removes it.
I commited issue10329.diff to Python 3.2 as r86303: thanks Alex ;-)
msg120694 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010年11月07日 23:56
Reopening as a reminder to add a unit test for this case.
msg120725 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2010年11月08日 10:42
Using the original encoding of the Python source file might be the politically correct thing to do, but it complicates handling of the output of trace.py. For each file you have to do the encoding detection dance again. It would be great if I could specify which encoding trace.py use (with the files encoding being the default).
msg120726 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010年11月08日 10:48
> ... it complicates handling of the output of trace.py. 
> For each file you have to do the encoding detection dance again ...
What? You just have to call one function! tokenize.open() :-) Well, ok, it's not commited yet, but it looks like most people agree: #10335.
msg120883 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2010年11月09日 18:04
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
>> ... it complicates handling of the output of trace.py. 
>> For each file you have to do the encoding detection dance again ...
> 
> What? You just have to call one function! tokenize.open() :-) Well, ok, 
> it's not commited yet, but it looks like most people agree: #10335.
The problem is that the script that downloads and builds the Python
source and generates the HTML for http://coverage.livinglogic.de/ isn't
ported to Python 3 yet (and can't be ported easily). However *running*
the test suite of course uses the current Python checkout, so an option
that lets me specify which encoding trace.py/regrtest.py should output
would be helpful.
History
Date User Action Args
2022年04月11日 14:57:08adminsetgithub: 54538
2013年08月04日 20:05:32belopolskysetstatus: open -> closed
stage: test needed -> resolved
2010年11月09日 18:04:48doerwaltersetmessages: + msg120883
2010年11月08日 10:48:16vstinnersetmessages: + msg120726
2010年11月08日 10:42:13doerwaltersetmessages: + msg120725
2010年11月07日 23:56:00belopolskysetstatus: closed -> open

messages: + msg120694
stage: needs patch -> test needed
2010年11月07日 15:54:57vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg120690
2010年11月07日 14:31:35pitrousetmessages: + msg120687
stage: test needed -> needs patch
2010年11月06日 10:51:01vstinnersetmessages: + msg120601
2010年11月06日 04:09:55belopolskysetkeywords: + needs review
assignee: belopolsky
messages: + msg120579

files: + issue10329.diff
2010年11月06日 03:03:06belopolskysetmessages: + msg120574
2010年11月05日 18:43:08pitrousetnosy: + pitrou
messages: + msg120522
2010年11月05日 18:22:08belopolskysetmessages: + msg120520
stage: patch review -> test needed
2010年11月05日 15:25:10ncoghlansetnosy: + ncoghlan
2010年11月05日 15:07:07pitrousetversions: + Python 3.1, Python 3.2
nosy: + belopolsky

components: + Library (Lib)
type: behavior
stage: patch review
2010年11月05日 15:01:34doerwaltercreate

AltStyle によって変換されたページ (->オリジナル) /