utf-8 unicode error python

Question 1

new_text = text.decode('utf-8').replace('\u00a0', ' ').replace('\u00ad', ' ').replace('Â', ' ').replace(' ',' ').replace(' ', ' ').replace(' ', ' ').replace('\u20b9',' ').replace('\ufffd',' ').replace('\u037e',' ').replace('\u2022',' ').replace('\u200b',' ').replace('0xc3',' ')

This is the error produced by the code:

new_text = text.decode('utf-8').replace('\u00a0', ' ').replace('\u00ad', ' ').replace('Â', ' ').replace(' ',
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
127.0.0.1 - - [29/Aug/2017 15:22:00] "GET / HTTP/1.1" 500 -

I have tried decoding ascii from unicode.

Question 2

What is text?

Question 3

the text has been generated after converting a pdf document (using watson document converter ): this a part of the text:[ no title Bajaj Allianz General Insurance Company Ltd. GE Plaza, Airport Road, Yerwada, Pune - 411006(India) CERTIFICATE CUM POLICY SCHEDULE Policy Servicing Off: Bajaj Finserv Building, 1st Floor, Behind Weikfield IT-Park, Viman Nagar, Pune-411014 Phone No :1800-209-0144 Product Private Car - Liability Only Policy Period Of Insurance From: 27-May-2017 Policy issued on 25-May-2017 - To: 26-May-2018 Midnight Cover Note No / Insured Name SANJAY SINGH ]

Question 4

Do the replacements one at a time instead of all at once and figure out which one is causing the error. If on Python 2, it is probably .replace('Â', ' ') and you need to use Unicode strings everywhere (u'\u00a0', etc.).

Question 5

You are calling .replace on a unicode object but giving str arguments to it. The arguments are converted to unicode using the default ASCII encoding, which will fail for bytes not in range(128).

To avoid this problem do not mix str and unicode. Either pass unicode arguments to unicode methods:

new_text = text.decode('utf-8').replace(u'\\u00a0', u' ').replace(u'\\u00ad', u' ')...

or do the replacements in the str object, assuming text is a str:

new_text = text.replace('\u00a0', ' ').replace('\u00ad', ' ')...

Question 6

The last piece of your chained replaces seems to be the problem.

text.replace('0xc3', ' ')

THis will try to replace the bytes 0xc3 with a space. In your code snippet it effectively reads

text.decode('utf-8').replace('0xc3', ' ')

which means that you first decode bytes to a (unicode-)string in python and then want to replace the wrong bytes. It should work if you replace the bytes before decoding:

text.replace('0xc3', ' ').decode('utf-8')

Question 7

is there any way to convert utf-8 encoding directly to text.

Question 8

type casting doesn't work also i am working on python 2.7

Question 9

@AryanSingh what do yo mean by "utf-8 encoding" and "text"? Those are not types in python.

Question 10

The last piece is not the problem. That replaces the 4-character text 0xc3 with a space, which, while not what the OP wants, is still valid code.

Stop harming Monica 12.7k1 gold badge40 silver badges63 bronze badges · Accepted Answer · 2017-08-29 12:19:46Z

You are calling .replace on a unicode object but giving str arguments to it. The arguments are converted to unicode using the default ASCII encoding, which will fail for bytes not in range(128).

To avoid this problem do not mix str and unicode. Either pass unicode arguments to unicode methods:

new_text = text.decode('utf-8').replace(u'\\u00a0', u' ').replace(u'\\u00ad', u' ')...

or do the replacements in the str object, assuming text is a str:

new_text = text.replace('\u00a0', ' ').replace('\u00ad', ' ')...

CollectivesTM on Stack Overflow

utf-8 unicode error python

2 Answers 2

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related