39

I'm writing a web crawler in python, and it involves taking headlines from websites.

One of the headlines should've read : And the Hip's coming, too

But instead it said: And the Hipâ€TMs coming, too

What's going wrong here?

Zero Piraeus
59.7k28 gold badges158 silver badges164 bronze badges
asked Oct 28, 2012 at 16:22
1
  • 4
    It would be easier to help you if you included the relevant code, and the particular website you're parsing. Commented Oct 28, 2012 at 16:27

2 Answers 2

66

It's an encoding error - so if it's a unicode string, this ought to fix it:

text.encode("windows-1252").decode("utf-8")

If it's a plain string, you'll need an extra step:

text.decode("utf-8").encode("windows-1252").decode("utf-8")

Both of these will give you a unicode string.

By the way - to discover how a piece of text like this has been mangled due to encoding issues, you can use chardet:

>>> import chardet
>>> chardet.detect(u"And the Hipâ€TMs coming, too")
{'confidence': 0.5, 'encoding': 'windows-1252'}
answered Oct 28, 2012 at 16:36
Sign up to request clarification or add additional context in comments.

2 Comments

Small warning: chardet is LGPL-licensed, so that's a consideration if it's going in something that's distributed to end users.
A string can't be decoded, so the second codeline you posted must be updated. ( using python3)
15

You need to properly decode the source text. Most likely the source text is in UTF-8 format, not ASCII.

Because you do not provide any context or code for your question it is not possible to give a direct answer.

I suggest you study how unicode and character encoding is done in Python:

http://docs.python.org/2/howto/unicode.html

answered Oct 28, 2012 at 16:26

1 Comment

Yes, it's UTF-8 treated like Windows 1252: u'\N{RIGHT SINGLE QUOTATION MARK}'.encode('utf-8').decode('cp1252').

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.