0

I have a data set of Tweets encoded to UTF-8. The data is loaded to pandas datagram. This consists Twitter emojis as well (Ex: \xf0\x9f\x8e\xb5 ). I want to replace these emojis with a space or empty character using regex in python. The below is how I tried. But my regex pattern is marked in red and I get a warning. How can I remove these Twitter emojis?

>>> preparedData.head(5).to_dict()
{'id': {0: 1042616899408945154, 1: 1042592536769044487, 2: 1042587702040903680, 3: 1042587263643930626, 4: 1042586780292276230}, 'date': {0: '2018-09-20', 1: '2018-09-20', 2: '2018-09-20', 3: '2018-09-20', 4: '2018-09-20'}, 'time': {0: '03:30:14', 1: '01:53:25', 2: '01:34:13', 3: '01:32:28', 4: '01:30:33'}, 'text': {0: "b'\\xf0\\x9f\\x8c\\xb9 are red, violets are blue, if you want to buy us \\xf0\\x9f\\x92\\x90, here is a CLUE \\xf0\\x9f\\x98\\x89 Our #flowerpowered eye & cheek palette is AL\\xe2\\x80\\xa6 ", 1: "b'\\xf0\\x9f\\x8e\\xb5Is it too late now to say sorry\\xf0\\x9f\\x8e\\xb5 #tartetalk #memes ", 2: "b'@JillianJChase Oh no! Please email your order # to [email protected] & we can help \\xf0\\x9f\\x92\\x95'", 3: 'b"@Danikins__ It\'s best applied with our buffer brush! \\xf0\\x9f\\x92\\x9c\\xc2\\xa0', 4: "b'@AdelaineMorin DEAD \\xf0\\x9f\\xa4\\xa3\\xf0\\x9f\\xa4\\xa3\\xf0\\x9f\\xa4\\xa3'"}, 'hasMedia': {0: 0, 1: 1, 2: 0, 3: 0, 4: 0}, 'hasHashtag': {0: 1, 1: 1, 2: 0, 3: 0, 4: 0}, 'followers_count': {0: 801745, 1: 801745, 2: 801745, 3: 801745, 4: 801745}, 'retweet_count': {0: 17, 1: 94, 2: 0, 3: 0, 4: 0}, 'favourite_count': {0: 181, 1: 408, 2: 0, 3: 0, 4: 14}}

How I tried

preparedData['url removed'] = re.sub(r''/[\x{1F600}-\x{1F64F}]/u'', ' ', preparedData['text'])
print(preparedData.head(5).to_dict())
Warning
This inspection detects names that should resolve but don't. Due to dynamic dispatch and duck typing, this is possible in a limited but useful number of cases. Top-level and class-level items are supported better than instance items
Jens
9,2969 gold badges65 silver badges84 bronze badges
asked Sep 22, 2018 at 9:33
4
  • Possible duplicate of stackoverflow.com/questions/49207552/… or stackoverflow.com/questions/39536390/… Commented Sep 22, 2018 at 9:37
  • Possible duplicate of Decoding and Encoding in Python Commented Sep 22, 2018 at 9:43
  • Looking at your regular expression: the quoting is wrong (double single quotation marks) and Python regular expressions are not enclosed by slashes. Commented Sep 22, 2018 at 9:55
  • Klaus D in this the hyphen in the middle is highlighted and says 'iillegal character range(to < from). How can I correct this? re.sub(r'[x{1F600}-x{1F64F}]', ' ', preparedData['text']) Commented Sep 22, 2018 at 10:01

1 Answer 1

1

You can use the smart_str function of django.utils

from django.utils.encoding import smart_str,smart_unicode
cleaned_up_text=smart_str(your_text_with_encoding)
answered Sep 22, 2018 at 10:09
Sign up to request clarification or add additional context in comments.

1 Comment

Let me know if it helps.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.