Issue 23144: html.parser.HTMLParser: setting 'convert_charrefs = True' leads to dropped text

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/67333

classification

Title:	html.parser.HTMLParser: setting 'convert_charrefs = True' leads to dropped text
Type:	behavior	Stage:	resolved
Components:	Documentation, Library (Lib)	Versions:	Python 3.6, Python 3.4, Python 3.5

process

Dependencies:	Superseder:
Status:	closed	Resolution:	fixed
Assigned To:	ezio.melotti	Nosy List:	docs@python, ezio.melotti, larry, martin.panter, python-dev, r.david.murray, rbcollins, serhiy.storchaka, xkjq
Priority:	normal	Keywords:	patch

Created on 2015年01月01日 18:47 by xkjq, last changed 2022年04月11日 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
issue23144.diff	ezio.melotti, 2015年03月07日 17:50	Patch against 3.5/default	review

Messages (14)
msg233291 - (view)	Author: Ross (xkjq)	Date: 2015年01月01日 18:47
If convert_charrefs is set to true the final data section is not return by feed(). It is held until the next tag is encountered. --- from html.parser import HTMLParser class MyHTMLParser(HTMLParser): def __init__(self): HTMLParser.__init__(self, convert_charrefs=True) self.fed = [] def handle_starttag(self, tag, attrs): print("Encountered a start tag:", tag) def handle_endtag(self, tag): print("Encountered an end tag :", tag) def handle_data(self, data): print("Encountered some data :", data) parser = MyHTMLParser() parser.feed("foo <a>link</a> bar") print("") parser.feed("spam <a>link</a> eggs") --- gives Encountered some data : foo Encountered a start tag: a Encountered some data : link Encountered an end tag : a Encountered some data : barspam Encountered a start tag: a Encountered some data : link Encountered an end tag : a With 'convert_charrefs = False' it works as expected.
msg233292 - (view)	Author: Martin Panter (martin.panter) * (Python committer)	Date: 2015年01月01日 19:32
You "forgot" to call close(): >>> parser.close() Encountered some data : eggs Perhaps this is a documentation bug, since there is a lot of example code given, but none of the examples call close().
msg233295 - (view)	Author: Ross (xkjq)	Date: 2015年01月01日 20:36
That would make sense. Might also be worth mentioning the difference in behaviour with convert_charrefs = True/False as that was what led me to think this was a bug.
msg237457 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2015年03月07日 16:41
Here is a patch that fixes the problem. Even though calling .close() is the correct solution, I preferred to restore the previous behavior and call handle_data as soon as possible. There is a corner case in which a charref might be cut in half while feeding chunks to the parser -- in that case the parser will wait and it might still be necessary to call .close() if an incomplete charref is at the end of the string. Adding context manager support to HTMLParser might also help solving the problem, but that's a separate issue. (Also thanks to Serhiy for the feedback he provided me on IRC.)
msg237480 - (view)	Author: Martin Panter (martin.panter) * (Python committer)	Date: 2015年03月07日 23:14
I still think it would be worthwhile adding close() calls to the examples in the documentation (Doc/library/html.parser.rst). BTW I haven’t tested this, and maybe it is not a concern, but even with this patch it looks like the parser will buffer unlimited data and output nothing until close() if each string it is fed ends with an ampersand (and otherwise contains only plain text, no tags etc).
msg237483 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2015年03月08日 00:08
> I still think it would be worthwhile adding close() calls to > the examples in the documentation (Doc/library/html.parser.rst). If I add context manager support to HTMLParser I can update the examples to use it, but otherwise I don't think it's worth changing them now. > BTW I haven’t tested this, and maybe it is not a concern, but even with > this patch it looks like the parser will buffer unlimited data and > output nothing until close() if each string it is fed ends with an > ampersand (and otherwise contains only plain text, no tags etc). This is true, but I don't think it's a realistic case. For this to be a problem you would need: 1) Someone feeding the parser with arbitrary chunks. Text files are usually fed to the parser whole, or line by line -- arbitrary chunks are uncommon. 2) A file that contains lot of entities. In most documents charrefs are not very common, and so the chances that a chunk will split one in the middle is low. Chances that several consecutive charrefs are split in the middle is even lower. 3) A file that is very big. Even if all the file is buffered until a call to close(), it shouldn't be a concern, since most files have relatively small size. It is true that this has a quadratic complexity, but I would expect the parsing to complete in a reasonable time for average sizes.
msg237503 - (view)	Author: Martin Panter (martin.panter) * (Python committer)	Date: 2015年03月08日 02:33
A context manager here would seem a bit strange. Is there any precedent for using context managers with feed parsers? The two others that come to mind are ElementTree.XMLParser and email.parser.FeedParser. These two build an object while parsing, and close() returns that object, so a context manager would be unhelpful. If an exception is raised inside the context manager, should close() be called (like for file objects), or not?
msg237529 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2015年03月08日 11:32
> A context manager here would seem a bit strange. I still haven't thought this through, but I can't see any problem with it right now. This would be similar to: from contextlib import closing with closing(MyHTMLParser()) as parser: parser.feed(html) and this already seems to work fine, including with OP's case. > If an exception is raised inside the context manager, > should close() be called (like for file objects), or not? The parser is guaranteed to never raise parsing-related errors during parsing, so this shouldn't be an issue. I will open a new issue after fixing this so we can keep discussing there.
msg247617 - (view)	Author: Robert Collins (rbcollins) * (Python committer)	Date: 2015年07月29日 20:16
@ezio I think you should commit what you have so far. LGTM.
msg249165 - (view)	Author: Robert Collins (rbcollins) * (Python committer)	Date: 2015年08月25日 22:18
@ezio - you seem busy, so I'll commit this next week if its still pending.
msg249745 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2015年09月04日 08:55
I'll try to take care of this during the weekend. Feel free to ping me if I don't.
msg250006 - (view)	Author: Roundup Robot (python-dev) (Python triager)	Date: 2015年09月06日 18:55
New changeset ef82131d0c93 by Ezio Melotti in branch '3.4': #23144: Make sure that HTMLParser.feed() returns all the data, even when convert_charrefs is True. https://hg.python.org/cpython/rev/ef82131d0c93 New changeset 1f6155ffcaf6 by Ezio Melotti in branch '3.5': #23144: merge with 3.4. https://hg.python.org/cpython/rev/1f6155ffcaf6 New changeset 48ae9d66c720 by Ezio Melotti in branch 'default': #23144: merge with 3.5. https://hg.python.org/cpython/rev/48ae9d66c720
msg250007 - (view)	Author: Ezio Melotti (ezio.melotti) * (Python committer)	Date: 2015年09月06日 18:56
Fixed, thanks for the report!
msg250311 - (view)	Author: Larry Hastings (larry) * (Python committer)	Date: 2015年09月09日 13:51
The Misc/NEWS entry for this was added under Python 3.5.0rc3. But, since no pull request has been made for this change, this change hasn't been merged into 3.5.0. It will ship as part of Python 3.5.1. I've moved the Misc/NEWS entry accordingly.

History
Date	User	Action	Args
2022年04月11日 14:58:11	admin	set	github: 67333
2015年09月09日 13:51:04	larry	set	nosy: + larry messages: + msg250311
2015年09月06日 18:56:30	ezio.melotti	set	status: open -> closed versions: + Python 3.6, - Python 2.7 messages: + msg250007 resolution: fixed stage: commit review -> resolved
2015年09月06日 18:55:36	python-dev	set	nosy: + python-dev messages: + msg250006
2015年09月04日 08:55:47	ezio.melotti	set	messages: + msg249745
2015年08月25日 22:18:17	rbcollins	set	messages: + msg249165
2015年07月29日 20:16:54	rbcollins	set	nosy: + rbcollins messages: + msg247617
2015年03月08日 11:32:12	ezio.melotti	set	messages: + msg237529
2015年03月08日 02:33:55	martin.panter	set	messages: + msg237503
2015年03月08日 00:08:10	ezio.melotti	set	messages: + msg237483
2015年03月07日 23:14:08	martin.panter	set	messages: + msg237480
2015年03月07日 17:50:40	ezio.melotti	set	files: + issue23144.diff
2015年03月07日 17:50:10	ezio.melotti	set	files: - issue23144.diff
2015年03月07日 16:41:34	ezio.melotti	set	files: + issue23144.diff versions: + Python 2.7, Python 3.5 keywords: + patch nosy: + serhiy.storchaka messages: + msg237457 stage: commit review
2015年01月10日 06:23:42	ezio.melotti	set	assignee: docs@python -> ezio.melotti
2015年01月01日 20:36:29	xkjq	set	messages: + msg233295
2015年01月01日 19:32:46	martin.panter	set	nosy: + docs@python, martin.panter messages: + msg233292 assignee: docs@python components: + Documentation
2015年01月01日 18:57:09	r.david.murray	set	nosy: + ezio.melotti, r.david.murray
2015年01月01日 18:47:08	xkjq	create

homepage