Getting international characters from a web page? [duplicate]

Asked 17 years, 4 months ago

Viewed 2k times

I want to scrape some information off a football (soccer) web page using simple python regexp's. The problem is that players such as the first chap, ÄÄRITALO, comes out as ÄÄRITALO!
That is, html uses escaped markup for the special characters, such as Ä

Is there a simple way of reading the html into the correct python string? If it was XML/XHTML it would be easy, the parser would do it.

Improve this question

edited Sep 26, 2008 at 0:53

lillq's user avatar

lillq

15.5k20 gold badges55 silver badges58 bronze badges

asked Sep 10, 2008 at 0:30

Nick Fortescue's user avatar

Nick Fortescue

44.4k27 gold badges109 silver badges137 bronze badges

Add a comment |

3 Answers 3

Sorted by: Reset to default

I would recommend BeautifulSoup for HTML scraping. You also need to tell it to convert HTML entities to the corresponding Unicode characters, like so:

>>> from BeautifulSoup import BeautifulSoup 
>>> html = "<html>&#196;&#196;RITALO!</html>"
>>> soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)
>>> print soup.contents[0].string
ÄÄRITALO!

(It would be nice if the standard codecs module included a codec for this, such that you could do "some_string".decode('html_entities') but unfortunately it doesn't!)

EDIT: Another solution: Python developer Fredrik Lundh (author of elementtree, among other things) has a function to unsecape HTML entities on his website, which works with decimal, hex and named entities (BeautifulSoup will not work with the hex ones).

Improve this answer

edited Sep 12, 2008 at 1:44

answered Sep 10, 2008 at 0:50

dF.'s user avatar

dF.

76.2k31 gold badges136 silver badges137 bronze badges