1. Home
2. Questions
3. AI Assist
4. Tags
5. Challenges
6. Chat
7. Articles
8. Users
9. Companies
11. Communities for your favorite technologies. Explore all Collectives
Stack Internal

Stack Overflow for Teams is now called Stack Internal. Bring the best of human thought and AI automation together at your work.
Try for free Learn more
Bring the best of human thought and AI automation together at your work. Learn more

Remove Unicode code (\uxxx) in string Python

Asked 8 years, 7 months ago

Viewed 5k times

I have some Unicode string in a document. All I want is to remove this Unicode code or replace it with some space (" "). Example =""

doc = "Hello my name is Ruth \u2026! I really like swimming and dancing \ud83c"

How do I convert it to the following?

doc = "Hello my name is Ruth! I really like swimming and dancing"

I already tried this: https://stackoverflow.com/a/20078869/5505608, but nothing happens. I'm using Python 3.

Improve this question

edited May 23, 2017 at 11:54

Community's user avatar

Community Bot

11 silver badge

asked May 16, 2017 at 20:10

Fregy's user avatar

Fregy

1111 silver badge7 bronze badges

If the answer you linked didn't work, there's something you're not telling us.

Mark Ransom
– Mark Ransom

2017年05月16日 21:32:46 +00:00
Commented May 16, 2017 at 21:32
i already tried re.sub(r'[^\x00-\x7F]+',' ', text). the code works, but nothing changed @MarkRansom

Fregy
– Fregy

2017年05月17日 05:38:15 +00:00
Commented May 17, 2017 at 5:38
That's because strings don't update in-place, they're immutable. You need to take the return value of re.sub and assign it back to text.

Mark Ransom
– Mark Ransom

2017年05月17日 14:00:27 +00:00
Commented May 17, 2017 at 14:00

Add a comment |

1 Answer 1

Sorted by: Reset to default

You can encode to ASCII and ignore errors (i.e. code points that cannot be converted to an ASCII character).

>>> doc = "Hello my name is Ruth \u2026! I really like swimming and dancing \ud83c"
>>> doc.encode('ascii', errors='ignore')
b'Hello my name is Ruth ! I really like swimming and dancing '

If the trailing whitespace bothers you, strip it off. Depending on your use case, you can decode the result again with ASCII. Chaining everything would look like this:

>>> doc.encode('ascii', errors='ignore').strip().decode('ascii')
'Hello my name is Ruth ! I really like swimming and dancing'

Improve this answer

answered May 16, 2017 at 20:29

timgeb's user avatar

timgeb

79.2k20 gold badges129 silver badges150 bronze badges

5 Comments

Fregy

Fregy Over a year ago

i've already tried to encode, the code works but still nothing change. thanks for your reply.

2017年05月17日T05:33:10.193Z+00:00

Fregy

Fregy Over a year ago

my purpose is to clean unicode code from the tweet that i've streamed. I tried the code to my tweet.txt which is contain 10 tweets.

2017年05月17日T05:48:38.737Z+00:00

Fregy

Fregy Over a year ago

which one? @timgeb

2017年05月17日T06:15:08.763Z+00:00

timgeb

timgeb Over a year ago

the one in the answer.

2017年05月17日T06:15:43.2Z+00:00

Fregy

Fregy Over a year ago

the unicode code still appears after using tweet.encode('ascii', errors='ignore')

2017年05月17日T06:26:43.7Z+00:00

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

lang-py

CollectivesTM on Stack Overflow

Remove Unicode code (\uxxx) in string Python

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related