Issue 11909: Doctest sees directives in strings when it should only see them in comments

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/56118

classification

Title:	Doctest sees directives in strings when it should only see them in comments
Type:	enhancement	Stage:	patch review
Components:	Library (Lib)	Versions:	Python 3.3

process

Dependencies:	Superseder:
Status:	open	Resolution:
Assigned To:	Nosy List:	Devin Jeanpierre, benjamin.peterson, petri.lehtinen, r.david.murray, tim.peters
Priority:	normal	Keywords:	patch

Created on 2011年04月22日 19:31 by Devin Jeanpierre, last changed 2022年04月11日 14:57 by admin.

Files
File name	Uploaded	Description	Edit
comments.diff	Devin Jeanpierre, 2011年04月22日 19:31	patch to tip	review
comments2.diff	Devin Jeanpierre, 2011年07月03日 23:15	review
comments3.diff	Devin Jeanpierre, 2011年07月04日 03:45

Messages (8)
msg134278 - (view)	Author: Devin Jeanpierre (Devin Jeanpierre) *	Date: 2011年04月22日 19:31
From the doctest source: 'Option directives are comments starting with "doctest:". Warning: this may give false positives for string-literals that contain the string "#doctest:". Eliminating these false positives would require actually parsing the string; but we limit them by ignoring any line containing "#doctest:" that is followed by a quote mark.' This isn't a huge deal, but it's a bit annoying. Above being confusing, this is in contradiction with the doctest documentation, which states: 'Doctest directives are expressed as a special Python comment following an example’s source code' No mention is made of this corner case where the regexp breaks. As per the comment in the source, the patched version parses the source using the tokenize module, and runs a modified directive regex on all comment tokens to find directives.
msg138780 - (view)	Author: Petri Lehtinen (petri.lehtinen) * (Python committer)	Date: 2011年06月21日 11:10
The patch looks good to me. It passes the old doctests tests and adds a new test case for what it's fixing.
msg138886 - (view)	Author: R. David Murray (r.david.murray) * (Python committer)	Date: 2011年06月24日 01:45
For the most part the patch looks good to me, too. My one concern is the encoding. tokenize detects the encoding...is it possible for the doctest fragment to be detected to be some encoding other than utf-8?
msg138893 - (view)	Author: Devin Jeanpierre (Devin Jeanpierre) *	Date: 2011年06月24日 08:40
You're right, and good catch. If a doctest starts with a "#coding:XXX" line, this should break. One option is to replace the call to tokenize.tokenize with a call to tokenize._tokenize and pass 'utf-8' as a parameter. Downside: that's a private and undocumented API. The alternative is to manually add a coding line that specifies UTF-8, so that any coding line in the doctest would be ignored. My preferred option would be to add the ability to read unicode to the tokenize API, and then use that. I can file a separate ticket if that sounds good, since it's probably useful to others too. One other thing to be worried about -- I'm not sure how doctest would treat tests with leading "coding:XXX" lines. I'd hope it ignores them, if it doesn't then this is more complicated and the above stuff wouldn't work. I'll see if I have the time to play around with this (and add more test cases to the patch, correspondingly) this weekend.
msg138948 - (view)	Author: R. David Murray (r.david.murray) * (Python committer)	Date: 2011年06月24日 14:16
I agree that having a unicode API for tokenize seems to make sense, and that would indeed require a separate issue. That's a good point about doctest not otherwise supporting coding cookies. Those only really apply to source files. So no doctest fragments ought to contain coding cookies at the start, so your patch ought to be fine. But I'm not familiar with the doctest internals, so having some tests to prove everything is fine would be great. Your code could use the tokenize sniffer to make sure the fragment reads as utf-8 and throw an error otherwise. But using a unicode interface to tokenize would probably be cleaner, since I suspect it would mimic what doctest does otherwise (ignore coding cookies). But I don't know the latter, so your checking it would be appreciated.
msg139715 - (view)	Author: Devin Jeanpierre (Devin Jeanpierre) *	Date: 2011年07月03日 23:15
Updated patch to newest revision, and to use _tokenize function and includes a test case to verify that it ignores the encoding directive during the tokenization (and every other) step. I'll file a tokenize bug separately.
msg139721 - (view)	Author: Devin Jeanpierre (Devin Jeanpierre) *	Date: 2011年07月04日 00:40
Erp I forgot to run this against the rest of the tests. Disregard, I'll fix it up a bit later.
msg139732 - (view)	Author: Devin Jeanpierre (Devin Jeanpierre) *	Date: 2011年07月04日 03:45
Updated.

History
Date	User	Action	Args
2022年04月11日 14:57:16	admin	set	github: 56118
2011年07月04日 03:45:34	Devin Jeanpierre	set	files: + comments3.diff messages: + msg139732
2011年07月04日 00:40:33	Devin Jeanpierre	set	messages: + msg139721
2011年07月03日 23:15:50	Devin Jeanpierre	set	files: + comments2.diff messages: + msg139715
2011年06月24日 14:16:25	r.david.murray	set	messages: + msg138948
2011年06月24日 08:40:44	Devin Jeanpierre	set	messages: + msg138893
2011年06月24日 01:45:14	r.david.murray	set	nosy: + r.david.murray, benjamin.peterson messages: + msg138886
2011年06月21日 11:10:42	petri.lehtinen	set	nosy: + tim.peters, petri.lehtinen messages: + msg138780
2011年04月22日 19:47:07	r.david.murray	set	stage: patch review type: enhancement versions: + Python 3.3
2011年04月22日 19:31:22	Devin Jeanpierre	create

homepage