Extracting a div from parsed HTML

Question 1

It seems that lxml/etree are generally imported as from lxml import etree -- why is that? It keeps the code tidier, and while the potential namespace ambiguity might not be a concern, I don't have any incentive of doing this as it's generally frowned upon.

I know for a script of this size it doesn't matter much, but I'm going to be using these modules for a lot more. I'm also curious about what others have to say.

#!/usr/bin/python
# Stuart Powers http://sente.cc/
import sys
import urllib
import lxml.html
from cStringIO import StringIO
""" This script parses HTML and extracts the div with an id of 'search-results':
 ex: <div id='search-results'>...</div>
$ python script.py "http://www.youtube.com/result?search_query=python+stackoverflow&page=1"
The output, if piped to a file would look like: http://c.sente.cc/E4xR/lxml_results.html
"""
parser = lxml.html.HTMLParser()
filecontents = urllib.urlopen(sys.argv[1]).read()
tree = lxml.etree.parse(StringIO(filecontents), parser)
node = tree.xpath("//div[@id='search-results']")[0]
print lxml.etree.tostring(tree, pretty_print=True)

Question 2

You might be confusing from lxml import etree that is a legitimate (even preferred) form of an absolute import with relative imports for intra-package imports that are discouraged: http://www.python.org/dev/peps/pep-0008/ (see "Imports" section)

Question 3

In your and most of the cases I had while working with lxml.etree or lxml.html, there was only need for parsing and dumping, which in case of string input and output can be achieved with fromstring() and tostring() functions:

from lxml.html import fromstring, tostring

Which would transform your code to:

import sys
import urllib
from lxml.html import fromstring, tostring
data = urllib.urlopen(sys.argv[1]).read()
tree = fromstring(data)
node = tree.xpath("//div[@id='search-results']")[0]
print(tostring(tree, pretty_print=True))

jfs jfs 7333 silver badges12 bronze badges · Answer 1 · 2012-01-04 16:54:34Z

You might be confusing from lxml import etree that is a legitimate (even preferred) form of an absolute import with relative imports for intra-package imports that are discouraged: http://www.python.org/dev/peps/pep-0008/ (see "Imports" section)

alecxe alecxe 17.5k8 gold badges52 silver badges93 bronze badges · Answer 2 · 2017-03-05 04:47:13Z

In your and most of the cases I had while working with lxml.etree or lxml.html, there was only need for parsing and dumping, which in case of string input and output can be achieved with fromstring() and tostring() functions:

from lxml.html import fromstring, tostring

Which would transform your code to:

import sys
import urllib
from lxml.html import fromstring, tostring
data = urllib.urlopen(sys.argv[1]).read()
tree = fromstring(data)
node = tree.xpath("//div[@id='search-results']")[0]
print(tostring(tree, pretty_print=True))

Stack Exchange Network

Extracting a div from parsed HTML

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Extracting a div from parsed HTML

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions