Message 210283 - Python tracker

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

In-reply-to
Author	gdr@garethrees.org
Recipients	daniel.urban, eric.snow, ezio.melotti, gdr@garethrees.org, meador.inge, sandro.tosi, serhiy.storchaka, takluyver, yselivanov
Date	2014年02月05日.01:52:46
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1391565172.86.0.464570242122.issue12691@psf.upfronthosting.co.za>

Content
Yury, let me see if I can move this issue forward. I clearly haven't done a good job of explaining these problems, how they are related, and why it makes sense to solve them together, so let me have a go now. 1. tokenize.untokenize() raises AssertionError if you pass it a sequence of tokens output from tokenize.tokenize(). This was my original problem report, and it's still not fixed in Python 3.4: Python 3.4.0b3 (default, Jan 27 2014, 02:26:41) [GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import tokenize, io >>> t = list(tokenize.tokenize(io.BytesIO('1+1'.encode('utf8')).readline)) >>> tokenize.untokenize(t) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/tokenize.py", line 317, in untokenize out = ut.untokenize(iterable) File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/tokenize.py", line 246, in untokenize self.add_whitespace(start) File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/tokenize.py", line 232, in add_whitespace assert row <= self.prev_row AssertionError This defeats any attempt to use the sequence: input code -> tokenize -> transform -> untokenize -> output code to transform Python code. But this ought to be the main use case for the untokenize function! That's how I came across the problem in the first place, when I was starting to write Minipy <https://github.com/gareth-rees/minipy>. 2. Fixing problem #1 is easy (just swap <= for >=), but it raises the question: why wasn't this mistake caught by test_tokenize? There's a test function roundtrip() whose docstring says: Test roundtrip for `untokenize`. `f` is an open file or a string. The source code in f is tokenized, converted back to source code via tokenize.untokenize(), and tokenized again from the latter. The test fails if the second tokenization doesn't match the first. If I don't fix the problem with roundtrip(), then how can I be sure I have fixed the problem? Clearly it's necessary to fix the test case and establish that it provokes the assertion. So why doesn't roundtrip() detect the error? Well, it turns out that tokenize.untokenize() has two modes of operation and roundtrip() only tests one of them. The documentation for tokenize.untokenize() is rather cryptic, and all it says is: Each element returned by the [input] iterable must be a token sequence with at least two elements, a token number and token value. If only two tokens are passed, the resulting output is poor. By reverse-engineering the implementation, it seems that it has two modes of operation. In the first mode (which I have called "compatibility" mode after the method Untokenizer.compat() that implements it) you pass it tokens in the form of 2-element tuples (type, text). These must have exactly 2 elements. In the second mode (which I have called "full" mode based on the description "full input" in the docstring) you pass it tokens in the form of tuples with 5 elements (type, text, start, end, line). These are compatible with the namedtuples returned from tokenize.tokenize(). The "full" mode has the buggy assertion, but test_tokenize.roundtrip() only tests the "compatibility" mode. So I must (i) fix roundtrip() so that it tests both modes; (ii) improve the documentation for tokenize.untokenize() so that programmers have some chance of figuring this out in future! 3. As soon as I make roundtrip() test both modes it provokes the assertion failure. Good, so I can fix the assertion. Problem #1 solved. But now there are test failures in "full" mode: $ ./python.exe -m test test_tokenize [1/1] test_tokenize ******************************************************************** File "/Users/gdr/hg.python.org/cpython/Lib/test/test_tokenize.py", line ?, in test.test_tokenize.__test__.doctests Failed example: for testfile in testfiles: if not roundtrip(open(testfile, 'rb')): print("Roundtrip failed for file %s" % testfile) break else: True Expected: True Got: Roundtrip failed for file /Users/gdr/hg.python.org/cpython/Lib/test/test_platform.py ****************************************************************** 1 items had failures: 1 of 73 in test.test_tokenize.__test__.doctests Test Failed** 1 failures. test test_tokenize failed -- 1 of 78 doctests failed 1 test failed: test_tokenize Examination of the failed tokenization shows that if the source contains a backslashed-newline, then in "full" mode Untokenize.add_whitespace() fails to leave any space between the token before the blackslash-newline and the token afterwards. The trouble is that backslash-newlines don't result in a token from tokenize(). But in update Untokenize.add_whitespace() I can deduce that there must have been a backslash-newline whenever I see the first token on a new line (other than a DEDENT or ENDMARKER) and the previous token was not a NEWLINE or NL. 4. OK, with problems 1-3 fixed, shouldn't I now actually test the documented round-trip property fully? The property that roundtrip() currently tests is: tokenize(untokenize(tokenize(code))) == tokenize(code) which is all that is guaranteed in "compatibility" mode. But the docstring for tokenize.untokenize() says: Round-trip invariant for full input: Untokenized source will match input source exactly So in "full" mode it seems that the following is guaranteed: untokenize(tokenize(code)) == code But if I add a test for this stronger round-trip property, it immediately fails. The docstring is not telling the truth! 5. Why doesn't the "full" round-trip property hold? It seems that there are (at least) two separate problems: i. tokenize() doesn't output a token for a backslash-newline so although untokenize() can deduce that there must have been one, it can't know what whitespace there was before it. ii. In compatibility mode, the first token is always discarded. If the tokens came from tokenize.tokenize(), then this is the ENCODING token, so the encoding is never applied. (Benjamin Peterson had previously discovered this problem but just suppressed the failing test case instead of reporting it: <http://hg.python.org/cpython/rev/c13abed5d764>) Looking through this sequence of woes again, it's clear that I can save problems 4 and 5 for another issue. But I am confident that problems 1-3 must be fixed together. So although Meador Inge said above that "the current patch fixes way too many issues", I hope that if he reads the argument above he will see why I did what I did: this is the minimum sensible change I know how to make. I have prepared a revised patch that applies to the current head. Please review it: I'll do my best to fix all the problems you find.

Content

Yury, let me see if I can move this issue forward. I clearly haven't
done a good job of explaining these problems, how they are related,
and why it makes sense to solve them together, so let me have a go
now.
1. tokenize.untokenize() raises AssertionError if you pass it a
 sequence of tokens output from tokenize.tokenize(). This was my
 original problem report, and it's still not fixed in Python 3.4:
 Python 3.4.0b3 (default, Jan 27 2014, 02:26:41) 
 [GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin
 Type "help", "copyright", "credits" or "license" for more information.
 >>> import tokenize, io
 >>> t = list(tokenize.tokenize(io.BytesIO('1+1'.encode('utf8')).readline))
 >>> tokenize.untokenize(t)
 Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/tokenize.py", line 317, in untokenize
 out = ut.untokenize(iterable)
 File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/tokenize.py", line 246, in untokenize
 self.add_whitespace(start)
 File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/tokenize.py", line 232, in add_whitespace
 assert row <= self.prev_row
 AssertionError
 This defeats any attempt to use the sequence:
 input code -> tokenize -> transform -> untokenize -> output code
 to transform Python code. But this ought to be the main use case
 for the untokenize function! That's how I came across the problem
 in the first place, when I was starting to write Minipy
 <https://github.com/gareth-rees/minipy>.
2. Fixing problem #1 is easy (just swap <= for >=), but it raises the
 question: why wasn't this mistake caught by test_tokenize? There's
 a test function roundtrip() whose docstring says:
 Test roundtrip for `untokenize`. `f` is an open file or a
 string. The source code in f is tokenized, converted back to
 source code via tokenize.untokenize(), and tokenized again from
 the latter. The test fails if the second tokenization doesn't
 match the first.
 If I don't fix the problem with roundtrip(), then how can I be
 sure I have fixed the problem? Clearly it's necessary to fix the
 test case and establish that it provokes the assertion.
 So why doesn't roundtrip() detect the error? Well, it turns out
 that tokenize.untokenize() has two modes of operation and
 roundtrip() only tests one of them.
 The documentation for tokenize.untokenize() is rather cryptic, and
 all it says is:
 Each element returned by the [input] iterable must be a token
 sequence with at least two elements, a token number and token
 value. If only two tokens are passed, the resulting output is
 poor.
 By reverse-engineering the implementation, it seems that it has two
 modes of operation.
 In the first mode (which I have called "compatibility" mode after
 the method Untokenizer.compat() that implements it) you pass it
 tokens in the form of 2-element tuples (type, text). These must
 have exactly 2 elements.
 In the second mode (which I have called "full" mode based on the
 description "full input" in the docstring) you pass it tokens in
 the form of tuples with 5 elements (type, text, start, end, line).
 These are compatible with the namedtuples returned from
 tokenize.tokenize().
 The "full" mode has the buggy assertion, but
 test_tokenize.roundtrip() only tests the "compatibility" mode.
 So I must (i) fix roundtrip() so that it tests both modes; (ii)
 improve the documentation for tokenize.untokenize() so that
 programmers have some chance of figuring this out in future!
3. As soon as I make roundtrip() test both modes it provokes the
 assertion failure. Good, so I can fix the assertion. Problem #1
 solved.
 But now there are test failures in "full" mode:
 $ ./python.exe -m test test_tokenize
 [1/1] test_tokenize
 **********************************************************************
 File "/Users/gdr/hg.python.org/cpython/Lib/test/test_tokenize.py", line ?, in test.test_tokenize.__test__.doctests
 Failed example:
 for testfile in testfiles:
 if not roundtrip(open(testfile, 'rb')):
 print("Roundtrip failed for file %s" % testfile)
 break
 else: True
 Expected:
 True
 Got:
 Roundtrip failed for file /Users/gdr/hg.python.org/cpython/Lib/test/test_platform.py
 **********************************************************************
 1 items had failures:
 1 of 73 in test.test_tokenize.__test__.doctests
 ***Test Failed*** 1 failures.
 test test_tokenize failed -- 1 of 78 doctests failed
 1 test failed:
 test_tokenize
 Examination of the failed tokenization shows that if the source
 contains a backslashed-newline, then in "full" mode
 Untokenize.add_whitespace() fails to leave any space between the
 token before the blackslash-newline and the token afterwards.
 The trouble is that backslash-newlines don't result in a token from
 tokenize(). But in update Untokenize.add_whitespace() I can deduce
 that there must have been a backslash-newline whenever I see the
 first token on a new line (other than a DEDENT or ENDMARKER) and
 the previous token was not a NEWLINE or NL.
4. OK, with problems 1-3 fixed, shouldn't I now actually test the
 documented round-trip property fully? The property that roundtrip()
 currently tests is:
 tokenize(untokenize(tokenize(code))) == tokenize(code)
 which is all that is guaranteed in "compatibility" mode. But the
 docstring for tokenize.untokenize() says:
 Round-trip invariant for full input:
 Untokenized source will match input source exactly
 So in "full" mode it seems that the following is guaranteed:
 untokenize(tokenize(code)) == code
 But if I add a test for this stronger round-trip property, it
 immediately fails. The docstring is not telling the truth!
5. Why doesn't the "full" round-trip property hold? It seems that
 there are (at least) two separate problems:
 i. tokenize() doesn't output a token for a backslash-newline so
 although untokenize() can deduce that there must have been one,
 it can't know what whitespace there was before it.
 ii. In compatibility mode, the first token is always discarded. If
 the tokens came from tokenize.tokenize(), then this is the
 ENCODING token, so the encoding is never applied.
 (Benjamin Peterson had previously discovered this problem but
 just suppressed the failing test case instead of reporting it:
 <http://hg.python.org/cpython/rev/c13abed5d764>)
Looking through this sequence of woes again, it's clear that I can
save problems 4 and 5 for another issue. But I am confident that
problems 1-3 must be fixed together. So although Meador Inge said
above that "the current patch fixes way too many issues", I hope that
if he reads the argument above he will see why I did what I did: this
is the minimum sensible change I know how to make.
I have prepared a revised patch that applies to the current head.
Please review it: I'll do my best to fix all the problems you find.

History
Date	User	Action	Args
2014年02月05日 01:52:53	gdr@garethrees.org	set	recipients: + gdr@garethrees.org, ezio.melotti, meador.inge, daniel.urban, sandro.tosi, eric.snow, takluyver, serhiy.storchaka, yselivanov
2014年02月05日 01:52:52	gdr@garethrees.org	set	messageid: <1391565172.86.0.464570242122.issue12691@psf.upfronthosting.co.za>
2014年02月05日 01:52:52	gdr@garethrees.org	link	issue12691 messages
2014年02月05日 01:52:49	gdr@garethrees.org	create

homepage