Convert characters using Python

Asked 6 years, 9 months ago

Viewed 102 times

I receive a text file, but some characters on it are not correct.

One example is the text below:

ApresentaÃ§Ã£o/ divulgaÃ§Ã£o do curso

But the correct text is

Apresentação/ divulgação do curso

I use the Php function utf8_decode and it works, see example below

echo utf8_decode("ApresentaÃ§Ã£o/ divulgaÃ§Ã£o do curso");
result Apresentação/ divulgação do curso

but I can't make it work in Python, I try to use

my_str = 'ApresentaÃ§Ã£o/ divulgaÃ§Ã£o do curso'
print( my_str.decode("utf-8") )

But I got the following error:

AttributeError: 'str' object has no attribute 'decode'

How I can make this work in Python?

Improve this question

asked Mar 21, 2019 at 13:55

fabiobh's user avatar

fabiobh

6894 gold badges15 silver badges41 bronze badges

Can you show the code where you obtain this string? Eg. through opening the said text file.

lenz
– lenz

2019年03月21日 14:04:08 +00:00
Commented Mar 21, 2019 at 14:04
I get the text from a csv file

fabiobh
– fabiobh

2019年03月21日 14:43:40 +00:00
Commented Mar 21, 2019 at 14:43

Add a comment |

2 Answers 2

Sorted by: Reset to default

The string is the result of decoding the raw UTF-8 bytes as latin-1. So just re-encode them as latin-1, then decode as utf-8:

>>> my_str = 'ApresentaÃ§Ã£o/ divulgaÃ§Ã£o do curso'
>>> print( my_str.encode('latin-1').decode("utf-8") )
Apresentação/ divulgação do curso

If this is coming from a file you opened in Python, you likely used latin-1 (or the similar cp1252) as the default encoding for open. In that case, the correct solution is to provide the correct encoding to open so it's decoded correctly in the first place, changing something like:

with open('myfile.txt') as f:
 my_str = f.read()

to:

with open('myfile.txt', encoding='utf-8') as f:
 my_str = f.read()

so no additional encode or decode steps are required.

Improve this answer

answered Mar 21, 2019 at 14:05

ShadowRanger's user avatar

ShadowRanger

158k12 gold badges222 silver badges317 bronze badges

Comments

I think the initial text is in iso-8859-1. This will fix it:

>>> s = 'ApresentaÃ§Ã£o/ divulgaÃ§Ã£o do curso'
>>> bytes(s, 'iso-8859-1').decode('utf-8')
'Apresentação/ divulgação do curso'

Improve this answer

answered Mar 21, 2019 at 14:05

Bogsan's user avatar

Bogsan

6916 silver badges12 bronze badges

1 Comment

lenz

lenz Over a year ago

The text was initially in UTF-8, but somebody (the OP's code?) initially wrongly decoded it with ISO-8859-1.

2019年03月21日T14:05:56.953Z+00:00

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

lang-py

CollectivesTM on Stack Overflow

Convert characters using Python

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related