Removing escaped entities from a String in Python [duplicate]

Question 1

I've a huge csv file of tweets. I read them both into the computer and stored them in two separate dictionaries - one for negative tweets, one for positive. I wanted to read the file in and parse it to a dictionary whilst removing any punctuation marks. I've used this code:

tweets = []
for (text, sentiment) in pos_tweets.items() + neg_tweets.items():
 shortenedText = [e.lower() and e.translate(string.maketrans("",""), string.punctuation) for e in text.split() if len(e) >= 3 and not e.startswith('http')]
print shortenedText

It's all worked well barring one minor problem. The huge csv file I've downloaded has unfortunately changed some of the punctuation. I'm not sure what this is called so can't really google it, but effectively some sentence might begin:

"ampampFightin"
"&quot;The truth is out there"
"&altThis is the way I feel"

Is there a way to get rid of all these? I notice the latter two begin with an ampersand - will a simple search for that get rid of it (the only reason I'm asking and not doing is because there's too many tweets for me to manually check)

Question 2

" is a HTML escaped entity. You are looking to un-escape these.

Question 3

Anything that is missing the & or ; characters is malformed and is not likely to be recoverable.

Question 4

htmlhelp.com/reference/html40/entities/special.html Here is a list of all of them in HTML 4.0.

Question 5

First, unescape HTML entities, then remove punctuation chars:

import HTMLParser
tweets = []
for (text, sentiment) in pos_tweets.items() + neg_tweets.items():
 text = HTMLParser.HTMLParser().unescape(text)
 shortenedText = [e.lower() and e.translate(string.maketrans("",""), string.punctuation) for e in text.split() if len(e) >= 3 and not e.startswith('http')]
print shortenedText

Here's an example, how unescape works:

>>> import HTMLParser
>>> HTMLParser.HTMLParser().unescape("&quot;The truth is out there")
u'"The truth is out there'

UPD: the solution to UnicodeDecodeError problem : use text.decode('utf8'). Here's a good explanation why do you need to do this.

Question 6

And to unescape them, should I do a search for anything beginning with an ampersand?

Question 7

Nope, just give it a text and it'll unescape entities that it will find in the text.

Question 8

Thanks for this, but when I run it I get this error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 28: ordinal not in range(128)

Question 9

import html.parser; html.parser.HTMLParser().unescape(text) for Python 3.

Question 10

@Andrew: you can force a string into a particular encoding with str.encode() -- in your case maybe text.encode('us-ascii') ?

alecxe 476k127 gold badges1.1k silver badges1.2k bronze badges · Accepted Answer · 2013-08-09 12:26:04Z

4

First, unescape HTML entities, then remove punctuation chars:

import HTMLParser
tweets = []
for (text, sentiment) in pos_tweets.items() + neg_tweets.items():
 text = HTMLParser.HTMLParser().unescape(text)
 shortenedText = [e.lower() and e.translate(string.maketrans("",""), string.punctuation) for e in text.split() if len(e) >= 3 and not e.startswith('http')]
print shortenedText

Here's an example, how unescape works:

>>> import HTMLParser
>>> HTMLParser.HTMLParser().unescape("&quot;The truth is out there")
u'"The truth is out there'

UPD: the solution to UnicodeDecodeError problem : use text.decode('utf8'). Here's a good explanation why do you need to do this.

Share

Improve this answer

edited May 23, 2017 at 11:43

Community's user avatar

Community Bot

11 silver badge

answered Aug 9, 2013 at 12:26

alecxe's user avatar

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Andrew Martin

Andrew Martin Over a year ago

And to unescape them, should I do a search for anything beginning with an ampersand?

2013年08月09日T12:26:42.817Z+00:00

alecxe

alecxe Over a year ago

Nope, just give it a text and it'll unescape entities that it will find in the text.

2013年08月09日T12:28:20.457Z+00:00

Andrew Martin

Andrew Martin Over a year ago

Thanks for this, but when I run it I get this error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 28: ordinal not in range(128)

2013年08月09日T12:30:37.16Z+00:00

rlms

rlms Over a year ago

import html.parser; html.parser.HTMLParser().unescape(text) for Python 3.

2013年08月09日T12:31:49.01Z+00:00

Mayur Patel

Mayur Patel Over a year ago

@Andrew: you can force a string into a particular encoding with str.encode() -- in your case maybe text.encode('us-ascii') ?

2013年08月09日T12:33:00.757Z+00:00

|

CollectivesTM on Stack Overflow

Removing escaped entities from a String in Python [duplicate]

1 Answer 1

9 Comments

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

9 Comments

Linked

Related