I've written a script which parses name and price of different items from craigslist. Usually a script throws error when it finds the name or the price is None. I've fixed it and it fetches results successfully now. I hope I did it flawlessly.
import requests
from lxml import html
page = requests.get('http://bangalore.craigslist.co.in/search/rea?s=120').text
tree = html.fromstring(page)
rows = tree.xpath('//li[@class="result-row"]')
for row in rows:
link = row.xpath('.//a[contains(@class,"hdrlnk")]/text()')[0] if len(row.xpath('.//a[contains(@class,"hdrlnk")]/text()'))>0 else ""
price = row.xpath('.//span[@class="result-price"]/text()')[0] if len(row.xpath('.//span[@class="result-price"]/text()'))>0 else ""
print (link,price)
2 Answers 2
It is usually easier to ask forgiveness than permission. You could just surround the statements with try..except
blocks:
import requests
from lxml import html
page = requests.get('http://bangalore.craigslist.co.in/search/rea?s=120').text
tree = html.fromstring(page)
for row in tree.xpath('//li[@class="result-row"]'):
try:
link = row.xpath('.//a[contains(@class,"hdrlnk")]/text()')[0]
except IndexError:
link = ""
try:
price = row.xpath('.//span[@class="result-price"]/text()')[0]
except IndexError:
price = ""
print (link, price)
If you have many such actions, you could put it into a function:
def get_if_exists(row, path, index=0, default=""):
"""
Gets the object at `index` from the xpath `path` from `row`.
Returns the `default` if it does not exist.
"""
try:
return row.xpath(path)[index]
except IndexError:
return default
Which you could use here like this:
for row in tree.xpath('//li[@class="result-row"]'):
# Using the defined default values for index and default:
link = get_if_exists(row, './/a[contains(@class,"hdrlnk")]/text()')
# Manually setting them instead:
price = get_if_exists(row, './/span[@class="result-price"]/text()', 0, "")
print (link, price)
-
\$\begingroup\$ Thanks sir Graipher, for your suggestion. If I pursue your second method where it is written for many similar actions, it will save me a lot of hard work cause you know the way i have written the for loop in my script is tedious. I can't dovetail the script with your function, though! \$\endgroup\$SIM– SIM2017年05月26日 12:07:30 +00:00Commented May 26, 2017 at 12:07
-
\$\begingroup\$ @SMth80 I don't understand what you mean with "I can't dovetail the script". Do you mean run? It works fine for me, does it throw any error? \$\endgroup\$Graipher– Graipher2017年05月26日 12:33:38 +00:00Commented May 26, 2017 at 12:33
-
\$\begingroup\$ Nope sir, I meant, I can't rearrange my script with your function in it. I'm little behind in applying function suggested by you. \$\endgroup\$SIM– SIM2017年05月26日 12:52:10 +00:00Commented May 26, 2017 at 12:52
-
\$\begingroup\$ You just replace the
for
loop with the one I wrote? If you have more code where you use this, then you should have included it in the question. You can always ask a new question with more context included. \$\endgroup\$Graipher– Graipher2017年05月26日 13:24:18 +00:00Commented May 26, 2017 at 13:24 -
\$\begingroup\$ @SMth80 Yes, it is. It also seems to work for me, does it not work for you? \$\endgroup\$Graipher– Graipher2017年05月26日 13:34:05 +00:00Commented May 26, 2017 at 13:34
I have learnt findtext method very lately using which it is very easy to parse text content from xpath expressions without going through complicated process. The most charming feature of this findtext method is that it always gives the result as None (by default) when expected element is not present. Moreover, it makes the code concise and clean. If anyone stumbles across the aforesaid problem, he might wanna give this a try additionally.
import requests
from lxml import html
page = requests.get('http://bangalore.craigslist.co.in/search/rea?s=120').text
tree = html.fromstring(page)
for row in tree.xpath('//li[@class="result-row"]'):
link = row.findtext('.//a[@data-id]')
price = row.findtext('.//span[@class="result-price"]')
print (link, price)