It seems that lxml/etree are generally imported as from lxml import etree
-- why is that? It keeps the code tidier, and while the potential namespace ambiguity might not be a concern, I don't have any incentive of doing this as it's generally frowned upon.
I know for a script of this size it doesn't matter much, but I'm going to be using these modules for a lot more. I'm also curious about what others have to say.
#!/usr/bin/python
# Stuart Powers http://sente.cc/
import sys
import urllib
import lxml.html
from cStringIO import StringIO
""" This script parses HTML and extracts the div with an id of 'search-results':
ex: <div id='search-results'>...</div>
$ python script.py "http://www.youtube.com/result?search_query=python+stackoverflow&page=1"
The output, if piped to a file would look like: http://c.sente.cc/E4xR/lxml_results.html
"""
parser = lxml.html.HTMLParser()
filecontents = urllib.urlopen(sys.argv[1]).read()
tree = lxml.etree.parse(StringIO(filecontents), parser)
node = tree.xpath("//div[@id='search-results']")[0]
print lxml.etree.tostring(tree, pretty_print=True)
2 Answers 2
You might be confusing from lxml import etree
that is a legitimate (even preferred) form of an absolute import with relative imports for intra-package imports that are discouraged: http://www.python.org/dev/peps/pep-0008/ (see "Imports" section)
In your and most of the cases I had while working with lxml.etree
or lxml.html
, there was only need for parsing and dumping, which in case of string input and output can be achieved with fromstring()
and tostring()
functions:
from lxml.html import fromstring, tostring
Which would transform your code to:
import sys
import urllib
from lxml.html import fromstring, tostring
data = urllib.urlopen(sys.argv[1]).read()
tree = fromstring(data)
node = tree.xpath("//div[@id='search-results']")[0]
print(tostring(tree, pretty_print=True))