0

I am using Scrapy for scraping a Persian website.

title = response.xpath('//*[@id="news"]/div/div[2]/div[2]/div[2]/div[2]/div[2]/h1/a/text()').extract()

When I extract title from the site, it's give me encoded string like this:

[u' \t\t\u0628\u06cc\u0645\u0647 10 \u0633\u0627\u0644\u0647\u200c \u062f\u0631 \u062e\u0637 \u062d\u0645\u0644\u0647\u200c\u06cc \u062a\u06cc\u0645 \u0645\u0644\u06cc \t']

After search for decode string in Python I find this way:

title = response.xpath('//*[@id="news"]/div/div[2]/div[2]/div[2]/div[2]/div[2]/h1/a/text()').extract()
print(title[0].decode('utf-8'))

When I run this code it shows me this:

 print(title[0].decode('utf-8'))
 File "/usr/lib64/python2.7/encodings/utf_8.py", line 16, in decode
 return codecs.utf_8_decode(input, errors, True)

What is the problem?

Chris Martin
30.9k12 gold badges83 silver badges142 bronze badges
asked Sep 28, 2015 at 9:15

1 Answer 1

3

Your string is already fine, it's only represented by unicode escapes rather than actual glyphs, so that it can be shown in ASCII consoles as well. Try printing it:

>>> x = [u' \t\t\u0628\u06cc\u0645\u0647 10 \u0633\u0627\u0644\u0647\u200c \u062f\u0631 \u062e\u0637 \u062d\u0645\u0644\u0647\u200c\u06cc \u062a\u06cc\u0645 \u0645\u0644\u06cc \t']
>>> print x[0]
 بیمه 10 ساله‌ در خط حمله‌ی تیم ملی
answered Sep 28, 2015 at 9:30
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.