Python - Change string to utf8

Question 1

I am trying to write Portuguese to an HTML file but I am getting some funny characters. How do I fix this?

first = """<p style="color: red; font-family: 'Liberation Sans',sans-serif">{}</p>""".format(sentences1[i]) 
f.write(first)

Expected Output: Hoje, nós nos unimos ao povo...

Actual Output in browser (Firefox on Ubuntu): ï»¿Hoje, nÃ3s nos unimos ao povo...

I tried doing this:

first = """<p style="color: red; font-family: 'Liberation Sans',sans-serif">{}</p>""".format(sentences1[i]) 
f.write(first.encode('utf8'))

Output in terminal: UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 65: ordinal not in range(128)

Why am I getting this error and also how can I write other languages to an HTML doc without the funny characters?
Or, is there a different file type that I can write to with the above font formatting?

Question 2

stackoverflow.com/questions/21129020/…

Question 3

Your format string should be a Unicode string too:

first = u"""<p style="color: red; font-family: 'Liberation Sans',sans-serif">{}</p>""".format(sentences1[i]) 
f.write(first)

Question 4

I am still getting this error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)

Question 5

Where does your sentences1 list comes from? Can you post the code for it too?

Question 6

@kedar Traceback (most recent call last): File "p.py", line 80, in <module> first = u"""<p style="color: red; font-family: 'Liberation Sans',sans-serif">{}</p>""".format(sentences1[i]) UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)

Question 7

@ selcuk the sentences1 list is derived from a file. Each sentence is read and stored in the list. My code works perfectly on English text. If I try to write to a different language, I get funny symbols. So I tried to changed the codec and then I get the errors.

Question 8

Do you decode according to the encoding of the file when reading from it?

Question 9

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

^ Read it!

This is what happens when you try to use .format on text read from a file with special characters.

>>> mystrf = u'special text here >> {} << special text'
>>> g = open('u.txt','r')
>>> lines = g.readlines()
>>> mystrf.format(lines[0])
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
>>>

Python tries to decode the text from the file as ASCII. So how do we fix that.

We simply tell python the proper encoding.

>>> line = mystrf.format(lines[0].decode('utf-8'))
>>> print line
special text here >> ß << special text

But when we try to write to a file again. It doesn't work.

>>> towrite.write(line)
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xdf' in position 21: ordinal not in range(128)

We encode the line before writing to a file again.

>>> towrite.write(line.encode('utf-8'))

Question 10

It appears that you're working with a string that is already UTF-8 encoded, so that's OK. The problem is that the meta tag in the HTML output is identifying the text as something other than UTF-8. For example, you may have <meta charset="ISO-8859-1">; you need to change it to <meta charset="UTF-8">.

The term for this kind of character set confusion is Mojibake.

P.S. Your string starts with a Byte Order Mark (BOM), you might want to remove it before working with the string.

Selcuk 60.1k12 gold badges114 silver badges119 bronze badges · Accepted Answer · 2015-03-17 13:55:53Z

1

Your format string should be a Unicode string too:

first = u"""<p style="color: red; font-family: 'Liberation Sans',sans-serif">{}</p>""".format(sentences1[i]) 
f.write(first)

Share

Improve this answer

answered Mar 17, 2015 at 13:55

Selcuk's user avatar

Selcuk

60.1k12 gold badges114 silver badges119 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

user2806040

user2806040 Over a year ago

I am still getting this error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)

2015年03月17日T14:02:14.063Z+00:00

Selcuk

Selcuk Over a year ago

Where does your sentences1 list comes from? Can you post the code for it too?

2015年03月17日T14:13:20.023Z+00:00

user2806040

user2806040 Over a year ago

@kedar Traceback (most recent call last): File "p.py", line 80, in <module> first = u"""<p style="color: red; font-family: 'Liberation Sans',sans-serif">{}</p>""".format(sentences1[i]) UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)

2015年03月17日T14:43:55.037Z+00:00

user2806040

user2806040 Over a year ago

@ selcuk the sentences1 list is derived from a file. Each sentence is read and stored in the list. My code works perfectly on English text. If I try to write to a different language, I get funny symbols. So I tried to changed the codec and then I get the errors.

2015年03月17日T14:46:58.547Z+00:00

Kedar

Kedar Over a year ago

Do you decode according to the encoding of the file when reading from it?

2015年03月17日T14:50:23.483Z+00:00

|

CollectivesTM on Stack Overflow

Python - Change string to utf8

3 Answers 3

7 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

3 Answers 3

7 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related