I am trying to write Portuguese to an HTML file but I am getting some funny characters. How do I fix this?
first = """<p style="color: red; font-family: 'Liberation Sans',sans-serif">{}</p>""".format(sentences1[i])
f.write(first)
Expected Output: Hoje, nós nos unimos ao povo...
Actual Output in browser (Firefox on Ubuntu): Hoje, nÃ3s nos unimos ao povo...
I tried doing this:
first = """<p style="color: red; font-family: 'Liberation Sans',sans-serif">{}</p>""".format(sentences1[i])
f.write(first.encode('utf8'))
Output in terminal: UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 65: ordinal not in range(128)
Why am I getting this error and also how can I write other languages to an HTML doc without the funny characters?
Or, is there a different file type that I can write to with the above font formatting?
-
stackoverflow.com/questions/21129020/…liuzhidong– liuzhidong2015年03月17日 14:42:38 +00:00Commented Mar 17, 2015 at 14:42
3 Answers 3
Your format string should be a Unicode string too:
first = u"""<p style="color: red; font-family: 'Liberation Sans',sans-serif">{}</p>""".format(sentences1[i])
f.write(first)
7 Comments
sentences1 list comes from? Can you post the code for it too?^ Read it!
This is what happens when you try to use .format on text read from a file with special characters.
>>> mystrf = u'special text here >> {} << special text'
>>> g = open('u.txt','r')
>>> lines = g.readlines()
>>> mystrf.format(lines[0])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
>>>
Python tries to decode the text from the file as ASCII. So how do we fix that.
We simply tell python the proper encoding.
>>> line = mystrf.format(lines[0].decode('utf-8'))
>>> print line
special text here >> ß << special text
But when we try to write to a file again. It doesn't work.
>>> towrite.write(line)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xdf' in position 21: ordinal not in range(128)
We encode the line before writing to a file again.
>>> towrite.write(line.encode('utf-8'))
Comments
It appears that you're working with a string that is already UTF-8 encoded, so that's OK. The problem is that the meta tag in the HTML output is identifying the text as something other than UTF-8. For example, you may have <meta charset="ISO-8859-1">; you need to change it to <meta charset="UTF-8">.
The term for this kind of character set confusion is Mojibake.
P.S. Your string starts with a Byte Order Mark (BOM), you might want to remove it before working with the string.