I want to convert, in python, special characters like "%$!&@á é ©" and not only '<&">' as all the documentation and references I've found so far shows. cgi.escape doesn't solve the problem.
For example, the string "á ê ĩ &" should be converted to "á ê ĩ &".
Does anyboy know how to solve it? I'm using python 2.6.
-
2Be aware of two things: (1) names entites may cause problems, you should probably use numeric entities instead. (2) Why use entities at all? In most case, a better solution is to UTF-8-encode the document so that it can contain the letters, and not use entities.Konrad Rudolph– Konrad Rudolph2012年03月08日 11:30:50 +00:00Commented Mar 8, 2012 at 11:30
-
1wiki.python.org/moin/EscapingHtmlQuentin– Quentin2012年03月08日 11:32:05 +00:00Commented Mar 8, 2012 at 11:32
-
I agree with you @KonradRudolph. I don't like using entities, but the system in which I'm working uses, so I have no choice. =/Jayme Tosi Neto– Jayme Tosi Neto2012年03月08日 11:35:12 +00:00Commented Mar 8, 2012 at 11:35
-
1@Jayme No problem, sometimes you have no choice. Just wanted to make sure you were aware of this.Konrad Rudolph– Konrad Rudolph2012年03月08日 11:38:06 +00:00Commented Mar 8, 2012 at 11:38
2 Answers 2
You could build your own loop using the dictionaries you can find in http://docs.python.org/library/htmllib.html#module-htmlentitydefs
The one you're looking for is htmlentitydefs.codepoint2name
1 Comment
I found a built in solution searching for the htmlentitydefs.codepoint2name that @Ruben Vermeersch said in his answer. The solution was found here: http://bytes.com/topic/python/answers/594350-convert-unicode-chars-html-entities
Here's the function:
def htmlescape(text):
text = (text).decode('utf-8')
from htmlentitydefs import codepoint2name
d = dict((unichr(code), u'&%s;' % name) for code,name in codepoint2name.iteritems() if code!=38) # exclude "&"
if u"&" in text:
text = text.replace(u"&", u"&")
for key, value in d.iteritems():
if key in text:
text = text.replace(key, value)
return text
Thank you all for helping! ;)