I am using Scrapy for scraping a Persian website.
title = response.xpath('//*[@id="news"]/div/div[2]/div[2]/div[2]/div[2]/div[2]/h1/a/text()').extract()
When I extract title from the site, it's give me encoded string like this:
[u' \t\t\u0628\u06cc\u0645\u0647 10 \u0633\u0627\u0644\u0647\u200c \u062f\u0631 \u062e\u0637 \u062d\u0645\u0644\u0647\u200c\u06cc \u062a\u06cc\u0645 \u0645\u0644\u06cc \t']
After search for decode string in Python I find this way:
title = response.xpath('//*[@id="news"]/div/div[2]/div[2]/div[2]/div[2]/div[2]/h1/a/text()').extract()
print(title[0].decode('utf-8'))
When I run this code it shows me this:
print(title[0].decode('utf-8'))
File "/usr/lib64/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
What is the problem?
Chris Martin
30.9k12 gold badges83 silver badges142 bronze badges
asked Sep 28, 2015 at 9:15
user1086010
7171 gold badge11 silver badges25 bronze badges
1 Answer 1
Your string is already fine, it's only represented by unicode escapes rather than actual glyphs, so that it can be shown in ASCII consoles as well. Try printing it:
>>> x = [u' \t\t\u0628\u06cc\u0645\u0647 10 \u0633\u0627\u0644\u0647\u200c \u062f\u0631 \u062e\u0637 \u062d\u0645\u0644\u0647\u200c\u06cc \u062a\u06cc\u0645 \u0645\u0644\u06cc \t']
>>> print x[0]
بیمه 10 ساله در خط حملهی تیم ملی
answered Sep 28, 2015 at 9:30
Stefano Sanfilippo
33.2k7 gold badges85 silver badges83 bronze badges
Sign up to request clarification or add additional context in comments.
Comments
lang-py