6

I want to scrape some information off a football (soccer) web page using simple python regexp's. The problem is that players such as the first chap, ÄÄRITALO, comes out as ÄÄRITALO!
That is, html uses escaped markup for the special characters, such as Ä

Is there a simple way of reading the html into the correct python string? If it was XML/XHTML it would be easy, the parser would do it.

lillq
15.5k20 gold badges55 silver badges58 bronze badges
asked Sep 10, 2008 at 0:30
0

3 Answers 3

7

I would recommend BeautifulSoup for HTML scraping. You also need to tell it to convert HTML entities to the corresponding Unicode characters, like so:

>>> from BeautifulSoup import BeautifulSoup 
>>> html = "<html>&#196;&#196;RITALO!</html>"
>>> soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)
>>> print soup.contents[0].string
ÄÄRITALO!

(It would be nice if the standard codecs module included a codec for this, such that you could do "some_string".decode('html_entities') but unfortunately it doesn't!)

EDIT: Another solution: Python developer Fredrik Lundh (author of elementtree, among other things) has a function to unsecape HTML entities on his website, which works with decimal, hex and named entities (BeautifulSoup will not work with the hex ones).

answered Sep 10, 2008 at 0:50
Sign up to request clarification or add additional context in comments.

Comments

2

Try using BeautifulSoup. It should do the trick and give you a nicely formatted DOM to work with as well.

This blog entry seems to have had some success with it.

answered Sep 10, 2008 at 0:48

Comments

0

I haven't tried it myself, but have you tried

http://zesty.ca/python/scrape.html ?

It seems to have a method htmldecode(text) which would do what you want.

answered Sep 10, 2008 at 0:32

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.