I've been banging my head against the wall with this for a while. I'm trying to parse an RSS feed with Python's BeautifulSoup, and every now and then I get errors like:
I don't know what I am talking about
I can't seem to find any python library that will replace those characters with what they should be, so the resulting string looks like this:
I don't know what I am talking about
The closest I've gotten was
urllib.unquote(post_content).decode('utf-8')
But that still does not replace the url encoded character with a '. Does anyone know a good way to replace those urlencoded characters into the ascii characters they represent? There's also other errors that I get like ( and ) appearing as ( and )
-
This question is more suited to Stack Overflow. Programmers SE is about program design issues, not specific questions about source code.logc– logc2015年03月16日 14:01:47 +00:00Commented Mar 16, 2015 at 14:01
1 Answer 1
Those weird strings are called html entities. You can decode them as described by this URL: Decode HTML entities in Python string?. It says to use the function unescape from the module html.parse