python ElementTree write function

Question 1

I am using python ElementTree to read and modify some content of my html files. When I am done with changes and use ElementTree.write function,

1) it adds extra html: infront of all the tags. How should I avoid that?

2) It also adds & where I have special characters. How should i avoid that?

Thank you, Divya.

Question 2

May this be of some help ? stackoverflow.com/questions/780334/…

Question 3

You can't. ElementTree works by loading the XML, parsing it, and only storing an abstract representation. It writes that out to a string by walking the abstract representation, but it doesn't remember things like which characters were escaped as entities, or whether an element was stored as <foo/> or <foo></foo> (HTML: <foo> or <foo></foo>)

Now, since ElementTree only works with XML (not HTML), I'm guessing you're working with lxml.html -- in this case, it in fact automatically corrects certain forms of erroneous HTML, because otherwise it wouldn't be able to store it correctly.

The right way to handle HTML whose data you want to be completely preserved except how you alter it, is to grab it in tokens that remember their original representation. I've done this using sgmllib, but this is imperfect -- e.g. there's a get_starttag_text method for getting the exact content of a start tag, but no corresponding method for end tags. It might be good enough anyway.

For example, to write out HTML where all the paragraphs are removed, one might write the function like this:

from cStringIO import StringIO
class SGMLModifier(sgmllib.SGMLParser):
 def __init__(self, *args, **kwargs):
 sgmllib.SGMLParser.__init__(self, *args, **kwargs)
 self._file = StringIO()
 def getvalue(self):
 return self._file.getvalue()
 def start_b(self, attributes):
 # skip it
 pass
 def end_b(self):
 # skip it
 pass
 def unknown_starttag(self, tag, attributes):
 self._file.write(self.get_starttag_text())
 def unknown_endtag(self, tag):
 # we can't get this verbatim.
 self._file.write('</%s>' % tag)
 def handle_comment(self, comment):
 # no verbatim here either.
 self._file.write('<!-- %s -->' % comment)
 def handle_data(self, data):
 self._file.write(data)
 def convert_entityref(self, ref):
 return '&' + ref + ';'
def remove_bold(html):
 parser = SGMLModifier()
 parser.feed(html)
 return parser.getvalue()

This might need a bit more work to not mangle the input. Check the documentation for details on everything.

Question 4

Thank you so much for the reply. Yes, after so much of study i too find that i can't use ElementTree to complete my work.

Question 5

Can you please explain how you used sgmllib to get the text between tags in html file. Please exaplin with some code so that I can understand. I am new to this lib, so please help me out.

Question 6

Hi, thank you so much for that. Just one more question. I have a html file. I want to give that as input file and parse it and then write back to that file. How should I do that ? Any code example which works with your above code please.

Devin Jeanpierre 96.1k5 gold badges60 silver badges80 bronze badges · Accepted Answer · 2011-09-07 14:47:50Z

You can't. ElementTree works by loading the XML, parsing it, and only storing an abstract representation. It writes that out to a string by walking the abstract representation, but it doesn't remember things like which characters were escaped as entities, or whether an element was stored as <foo/> or <foo></foo> (HTML: <foo> or <foo></foo>)

Now, since ElementTree only works with XML (not HTML), I'm guessing you're working with lxml.html -- in this case, it in fact automatically corrects certain forms of erroneous HTML, because otherwise it wouldn't be able to store it correctly.

The right way to handle HTML whose data you want to be completely preserved except how you alter it, is to grab it in tokens that remember their original representation. I've done this using sgmllib, but this is imperfect -- e.g. there's a get_starttag_text method for getting the exact content of a start tag, but no corresponding method for end tags. It might be good enough anyway.

For example, to write out HTML where all the paragraphs are removed, one might write the function like this:

from cStringIO import StringIO
class SGMLModifier(sgmllib.SGMLParser):
 def __init__(self, *args, **kwargs):
 sgmllib.SGMLParser.__init__(self, *args, **kwargs)
 self._file = StringIO()
 def getvalue(self):
 return self._file.getvalue()
 def start_b(self, attributes):
 # skip it
 pass
 def end_b(self):
 # skip it
 pass
 def unknown_starttag(self, tag, attributes):
 self._file.write(self.get_starttag_text())
 def unknown_endtag(self, tag):
 # we can't get this verbatim.
 self._file.write('</%s>' % tag)
 def handle_comment(self, comment):
 # no verbatim here either.
 self._file.write('<!-- %s -->' % comment)
 def handle_data(self, data):
 self._file.write(data)
 def convert_entityref(self, ref):
 return '&' + ref + ';'
def remove_bold(html):
 parser = SGMLModifier()
 parser.feed(html)
 return parser.getvalue()

This might need a bit more work to not mangle the input. Check the documentation for details on everything.

Thank you so much for the reply. Yes, after so much of study i too find that i can't use ElementTree to complete my work.
Can you please explain how you used sgmllib to get the text between tags in html file. Please exaplin with some code so that I can understand. I am new to this lib, so please help me out.
Hi, thank you so much for that. Just one more question. I have a html file. I want to give that as input file and parse it and then write back to that file. How should I do that ? Any code example which works with your above code please.

CollectivesTM on Stack Overflow

python ElementTree write function

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related