Replace UTF-8 encoded Twitter emojis in python

Question 1

I have a data set of Tweets encoded to UTF-8. The data is loaded to pandas datagram. This consists Twitter emojis as well (Ex: \xf0\x9f\x8e\xb5 ). I want to replace these emojis with a space or empty character using regex in python. The below is how I tried. But my regex pattern is marked in red and I get a warning. How can I remove these Twitter emojis?

>>> preparedData.head(5).to_dict()
{'id': {0: 1042616899408945154, 1: 1042592536769044487, 2: 1042587702040903680, 3: 1042587263643930626, 4: 1042586780292276230}, 'date': {0: '2018-09-20', 1: '2018-09-20', 2: '2018-09-20', 3: '2018-09-20', 4: '2018-09-20'}, 'time': {0: '03:30:14', 1: '01:53:25', 2: '01:34:13', 3: '01:32:28', 4: '01:30:33'}, 'text': {0: "b'\\xf0\\x9f\\x8c\\xb9 are red, violets are blue, if you want to buy us \\xf0\\x9f\\x92\\x90, here is a CLUE \\xf0\\x9f\\x98\\x89 Our #flowerpowered eye &amp; cheek palette is AL\\xe2\\x80\\xa6 ", 1: "b'\\xf0\\x9f\\x8e\\xb5Is it too late now to say sorry\\xf0\\x9f\\x8e\\xb5 #tartetalk #memes ", 2: "b'@JillianJChase Oh no! Please email your order # to [email protected] &amp; we can help \\xf0\\x9f\\x92\\x95'", 3: 'b"@Danikins__ It\'s best applied with our buffer brush! \\xf0\\x9f\\x92\\x9c\\xc2\\xa0', 4: "b'@AdelaineMorin DEAD \\xf0\\x9f\\xa4\\xa3\\xf0\\x9f\\xa4\\xa3\\xf0\\x9f\\xa4\\xa3'"}, 'hasMedia': {0: 0, 1: 1, 2: 0, 3: 0, 4: 0}, 'hasHashtag': {0: 1, 1: 1, 2: 0, 3: 0, 4: 0}, 'followers_count': {0: 801745, 1: 801745, 2: 801745, 3: 801745, 4: 801745}, 'retweet_count': {0: 17, 1: 94, 2: 0, 3: 0, 4: 0}, 'favourite_count': {0: 181, 1: 408, 2: 0, 3: 0, 4: 14}}

How I tried

preparedData['url removed'] = re.sub(r''/[\x{1F600}-\x{1F64F}]/u'', ' ', preparedData['text'])
print(preparedData.head(5).to_dict())
Warning
This inspection detects names that should resolve but don't. Due to dynamic dispatch and duck typing, this is possible in a limited but useful number of cases. Top-level and class-level items are supported better than instance items

Question 2

Possible duplicate of stackoverflow.com/questions/49207552/… or stackoverflow.com/questions/39536390/…

Question 3

Possible duplicate of Decoding and Encoding in Python

Question 4

Looking at your regular expression: the quoting is wrong (double single quotation marks) and Python regular expressions are not enclosed by slashes.

Question 5

Klaus D in this the hyphen in the middle is highlighted and says 'iillegal character range(to < from). How can I correct this? re.sub(r'[x{1F600}-x{1F64F}]', ' ', preparedData['text'])

Question 6

You can use the smart_str function of django.utils

from django.utils.encoding import smart_str,smart_unicode
cleaned_up_text=smart_str(your_text_with_encoding)

Question 7

Let me know if it helps.

Ankur Gulati 2911 silver badge12 bronze badges · Accepted Answer · 2018-09-22 10:09:12Z

1

You can use the smart_str function of django.utils

from django.utils.encoding import smart_str,smart_unicode
cleaned_up_text=smart_str(your_text_with_encoding)

Share

Improve this answer

answered Sep 22, 2018 at 10:09

Ankur Gulati's user avatar

Ankur Gulati

2911 silver badge12 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Ankur Gulati

Ankur Gulati Over a year ago

Let me know if it helps.

2018年09月22日T10:09:29.723Z+00:00

CollectivesTM on Stack Overflow

Replace UTF-8 encoded Twitter emojis in python

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related