Decoding HTML Entities With Python

Asked 16 years, 10 months ago

Viewed 7k times

The following Python code uses BeautifulStoneSoup to fetch the LibraryThing API information for Tolkien's "The Children of Húrin".

import urllib2
from BeautifulSoup import BeautifulStoneSoup
URL = ("http://www.librarything.com/services/rest/1.0/"
 "?method=librarything.ck.getwork&id=1907912"
 "&apikey=2a2e596b887f554db2bbbf3b07ff812a")
soup = BeautifulStoneSoup(urllib2.urlopen(URL),
 convertEntities=BeautifulStoneSoup.ALL_ENTITIES)
title_field = soup.find('field', attrs={'name': 'canonicaltitle'})
print title_field.find('fact').string

Unfortunately, instead of 'Húrin', it prints out 'HÃorin'. This is obviously an encoding issue, but I can't work out what I need to do to get the expected output. Help would be greatly appreciated.

Improve this question

asked Mar 9, 2009 at 22:47

Daniel Watkins's user avatar

Daniel Watkins

1,6861 gold badge16 silver badges26 bronze badges

Add a comment |

2 Answers 2

Sorted by: Reset to default

In the source of the web page it looks like this: The Children of HÃºrin. So the encoding is already broken somewhere on their side before it even gets converted to XML...

If it's a general issue with all the books and you need to work around it, this seems to work:

unicode(title_field.find('fact').string).encode("latin1").decode("utf-8")

Improve this answer

edited May 8, 2012 at 12:17

answered Mar 9, 2009 at 23:05

sth's user avatar

sth

231k56 gold badges288 silver badges370 bronze badges

1 Comment

Daniel Watkins

Daniel Watkins Over a year ago

Yup, I guess that's it. I've contacted LibraryThing about sorting it out. Thanks. :)

2009年03月09日T23:21:06.507Z+00:00

The web page may be lying about its encoding. The output looks like UTF-8. If you got a str at the end then you'll need to decode it as UTF-8. If you have a unicode instead then you'll need to encode as Latin-1 first.

Improve this answer

answered Mar 9, 2009 at 22:53

Ignacio Vazquez-Abrams's user avatar

Ignacio Vazquez-Abrams

804k160 gold badges1.4k silver badges1.4k bronze badges

Comments

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

lang-py

CollectivesTM on Stack Overflow

Decoding HTML Entities With Python

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related