0

I'm trying to parse a malformed XHTML page in Python. I just want to get a few tags of the same type from it, but it seems impossible. Normal XHTML parsers doesn't like the malformedness, and BeautifulSoup won't work because of syntax errors in its code. What would be the best way to parse malformed XHTML and get the content of a couple of tags of the same type?

asked Dec 12, 2011 at 10:40

3 Answers 3

2

"Normal" parsers? lxml usually deals fine with malformed html, although it's quite "normal". :-)

answered Dec 12, 2011 at 13:00
Sign up to request clarification or add additional context in comments.

Comments

0

You can try pyquery

I'm not sure how much malformed your XHTML is, but it's worth a try.

answered Dec 12, 2011 at 10:46

Comments

0

Thanks for the help! "Unfortunately" I solved it myself by using this parser and setting html.parser.HTMLParser(strict=False). That made it read malformed XHTML quite well.

answered Dec 13, 2011 at 8:33

1 Comment

Keep in mind that strict=False is the default value, it's deprecated since Python 3.3 and it will be removed in Python 3.5.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.