new_text = text.decode('utf-8').replace('\u00a0', ' ').replace('\u00ad', ' ').replace('Â', ' ').replace(' ',' ').replace(' ', ' ').replace(' ', ' ').replace('\u20b9',' ').replace('\ufffd',' ').replace('\u037e',' ').replace('\u2022',' ').replace('\u200b',' ').replace('0xc3',' ')
This is the error produced by the code:
new_text = text.decode('utf-8').replace('\u00a0', ' ').replace('\u00ad', ' ').replace('Â', ' ').replace(' ',
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
127.0.0.1 - - [29/Aug/2017 15:22:00] "GET / HTTP/1.1" 500 -
I have tried decoding ascii from unicode.
2 Answers 2
You are calling .replace on a unicode object but giving str arguments to it. The arguments are converted to unicode using the default ASCII encoding, which will fail for bytes not in range(128).
To avoid this problem do not mix str and unicode. Either pass unicode arguments to unicode methods:
new_text = text.decode('utf-8').replace(u'\\u00a0', u' ').replace(u'\\u00ad', u' ')...
or do the replacements in the str object, assuming text is a str:
new_text = text.replace('\u00a0', ' ').replace('\u00ad', ' ')...
Comments
The last piece of your chained replaces seems to be the problem.
text.replace('0xc3', ' ')
THis will try to replace the bytes 0xc3 with a space. In your code snippet it effectively reads
text.decode('utf-8').replace('0xc3', ' ')
which means that you first decode bytes to a (unicode-)string in python and then want to replace the wrong bytes. It should work if you replace the bytes before decoding:
text.replace('0xc3', ' ').decode('utf-8')
4 Comments
0xc3 with a space, which, while not what the OP wants, is still valid code.
text?.replace('Â', ' ')and you need to use Unicode strings everywhere (u'\u00a0', etc.).