I have some Unicode string in a document. All I want is to remove this Unicode code or replace it with some space (" "). Example =""
doc = "Hello my name is Ruth \u2026! I really like swimming and dancing \ud83c"
How do I convert it to the following?
doc = "Hello my name is Ruth! I really like swimming and dancing"
I already tried this: https://stackoverflow.com/a/20078869/5505608, but nothing happens. I'm using Python 3.
1 Answer 1
You can encode to ASCII and ignore errors (i.e. code points that cannot be converted to an ASCII character).
>>> doc = "Hello my name is Ruth \u2026! I really like swimming and dancing \ud83c"
>>> doc.encode('ascii', errors='ignore')
b'Hello my name is Ruth ! I really like swimming and dancing '
If the trailing whitespace bothers you, strip it off. Depending on your use case, you can decode the result again with ASCII. Chaining everything would look like this:
>>> doc.encode('ascii', errors='ignore').strip().decode('ascii')
'Hello my name is Ruth ! I really like swimming and dancing'
answered May 16, 2017 at 20:29
timgeb
79.2k20 gold badges129 silver badges150 bronze badges
Sign up to request clarification or add additional context in comments.
5 Comments
Fregy
i've already tried to encode, the code works but still nothing change. thanks for your reply.
Fregy
my purpose is to clean unicode code from the tweet that i've streamed. I tried the code to my tweet.txt which is contain 10 tweets.
Fregy
which one? @timgeb
timgeb
the one in the answer.
Fregy
the unicode code still appears after using
tweet.encode('ascii', errors='ignore')lang-py
re.sub(r'[^\x00-\x7F]+',' ', text). the code works, but nothing changed @MarkRansomre.suband assign it back totext.