Trying to process a very simple html5 script and render it using html5lib
import html5lib
html = '''<!DOCTYPE html>
<html lang="en">
<head>
<title>Hi</title>
</head>
<body>
<script src="a.js"></script>
<script src="b.js"></script>
</body>
</html>
'''
parser = html5lib.HTMLParser(tree = html5lib.treebuilders.getTreeBuilder("lxml"))
walker = html5lib.treewalkers.getTreeWalker("lxml")
serializer = html5lib.serializer.htmlserializer.HTMLSerializer()
document = parser.parse(html)
stream = walker(document)
theHTML = serializer.render(stream)
print theHTML
The output looks like:
<!DOCTYPE html><html lang=en><head>
<title>Hi</title>
</head>
<body>
<script src=a.js></script>
<script src=b.js></script>
Yup. It just cuts off mid way. Changing the tree builder from lxml to dom does nothing. Tweaking the HTML changes the output but it's still pretty corrupt.
asked Feb 2, 2012 at 5:35
schwa
11.9k14 gold badges47 silver badges54 bronze badges
1 Answer 1
So the key seems to be omit_optional_tags=False somehow with that missing it eats the end of the output.
parser = html5lib.HTMLParser(tree = html5lib.treebuilders.getTreeBuilder("lxml"))
document = parser.parse(html)
walker = html5lib.treewalkers.getTreeWalker("lxml")
stream = walker(document)
s = serializer.htmlserializer.HTMLSerializer(omit_optional_tags=False)
output_generator = s.serialize(stream)
for item in output_generator:
print item
<!DOCTYPE html>
<html lang=en>
<head>
<title>
Hi
</title>
</head>
<body>
<script src=a.js>
</script>
<script src=b.js>
</script>
</body>
</html>
>>>
answered Feb 2, 2012 at 5:51
RanRag
49.8k39 gold badges120 silver badges172 bronze badges
Sign up to request clarification or add additional context in comments.
lang-py