3

I am using python ElementTree to read and modify some content of my html files. When I am done with changes and use ElementTree.write function,

1) it adds extra html: infront of all the tags. How should I avoid that?

2) It also adds & where I have special characters. How should i avoid that?

Thank you, Divya.

Rupesh Yadav
12.3k4 gold badges56 silver badges70 bronze badges
asked Sep 7, 2011 at 14:33
1

1 Answer 1

1

You can't. ElementTree works by loading the XML, parsing it, and only storing an abstract representation. It writes that out to a string by walking the abstract representation, but it doesn't remember things like which characters were escaped as entities, or whether an element was stored as <foo/> or <foo></foo> (HTML: <foo> or <foo></foo>)

Now, since ElementTree only works with XML (not HTML), I'm guessing you're working with lxml.html -- in this case, it in fact automatically corrects certain forms of erroneous HTML, because otherwise it wouldn't be able to store it correctly.

The right way to handle HTML whose data you want to be completely preserved except how you alter it, is to grab it in tokens that remember their original representation. I've done this using sgmllib, but this is imperfect -- e.g. there's a get_starttag_text method for getting the exact content of a start tag, but no corresponding method for end tags. It might be good enough anyway.

For example, to write out HTML where all the paragraphs are removed, one might write the function like this:

from cStringIO import StringIO
class SGMLModifier(sgmllib.SGMLParser):
 def __init__(self, *args, **kwargs):
 sgmllib.SGMLParser.__init__(self, *args, **kwargs)
 self._file = StringIO()
 def getvalue(self):
 return self._file.getvalue()
 def start_b(self, attributes):
 # skip it
 pass
 def end_b(self):
 # skip it
 pass
 def unknown_starttag(self, tag, attributes):
 self._file.write(self.get_starttag_text())
 def unknown_endtag(self, tag):
 # we can't get this verbatim.
 self._file.write('</%s>' % tag)
 def handle_comment(self, comment):
 # no verbatim here either.
 self._file.write('<!-- %s -->' % comment)
 def handle_data(self, data):
 self._file.write(data)
 def convert_entityref(self, ref):
 return '&' + ref + ';'
def remove_bold(html):
 parser = SGMLModifier()
 parser.feed(html)
 return parser.getvalue()

This might need a bit more work to not mangle the input. Check the documentation for details on everything.

answered Sep 7, 2011 at 14:47
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you so much for the reply. Yes, after so much of study i too find that i can't use ElementTree to complete my work.
Can you please explain how you used sgmllib to get the text between tags in html file. Please exaplin with some code so that I can understand. I am new to this lib, so please help me out.
Hi, thank you so much for that. Just one more question. I have a html file. I want to give that as input file and parse it and then write back to that file. How should I do that ? Any code example which works with your above code please.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.