Craigslist parser

Question 1

I've written a script which parses name and price of different items from craigslist. Usually a script throws error when it finds the name or the price is None. I've fixed it and it fetches results successfully now. I hope I did it flawlessly.

import requests
from lxml import html
page = requests.get('http://bangalore.craigslist.co.in/search/rea?s=120').text
tree = html.fromstring(page)
rows = tree.xpath('//li[@class="result-row"]')
for row in rows:
 link = row.xpath('.//a[contains(@class,"hdrlnk")]/text()')[0] if len(row.xpath('.//a[contains(@class,"hdrlnk")]/text()'))>0 else ""
 price = row.xpath('.//span[@class="result-price"]/text()')[0] if len(row.xpath('.//span[@class="result-price"]/text()'))>0 else ""
 print (link,price)

Question 2

It is usually easier to ask forgiveness than permission. You could just surround the statements with try..except blocks:

import requests
from lxml import html
page = requests.get('http://bangalore.craigslist.co.in/search/rea?s=120').text
tree = html.fromstring(page)
for row in tree.xpath('//li[@class="result-row"]'):
 try:
 link = row.xpath('.//a[contains(@class,"hdrlnk")]/text()')[0]
 except IndexError:
 link = ""
 try:
 price = row.xpath('.//span[@class="result-price"]/text()')[0]
 except IndexError:
 price = ""
 print (link, price)

If you have many such actions, you could put it into a function:

def get_if_exists(row, path, index=0, default=""):
 """
 Gets the object at `index` from the xpath `path` from `row`.
 Returns the `default` if it does not exist.
 """
 try:
 return row.xpath(path)[index]
 except IndexError:
 return default

Which you could use here like this:

for row in tree.xpath('//li[@class="result-row"]'):
 # Using the defined default values for index and default:
 link = get_if_exists(row, './/a[contains(@class,"hdrlnk")]/text()')
 # Manually setting them instead:
 price = get_if_exists(row, './/span[@class="result-price"]/text()', 0, "")
 print (link, price)

Question 3

Thanks sir Graipher, for your suggestion. If I pursue your second method where it is written for many similar actions, it will save me a lot of hard work cause you know the way i have written the for loop in my script is tedious. I can't dovetail the script with your function, though!

Question 4

@SMth80 I don't understand what you mean with "I can't dovetail the script". Do you mean run? It works fine for me, does it throw any error?

Question 5

Nope sir, I meant, I can't rearrange my script with your function in it. I'm little behind in applying function suggested by you.

Question 6

You just replace the for loop with the one I wrote? If you have more code where you use this, then you should have included it in the question. You can always ask a new question with more context included.

Question 7

@SMth80 Yes, it is. It also seems to work for me, does it not work for you?

Question 8

I have learnt findtext method very lately using which it is very easy to parse text content from xpath expressions without going through complicated process. The most charming feature of this findtext method is that it always gives the result as None (by default) when expected element is not present. Moreover, it makes the code concise and clean. If anyone stumbles across the aforesaid problem, he might wanna give this a try additionally.

import requests
from lxml import html
page = requests.get('http://bangalore.craigslist.co.in/search/rea?s=120').text
tree = html.fromstring(page)
for row in tree.xpath('//li[@class="result-row"]'):
 link = row.findtext('.//a[@data-id]')
 price = row.findtext('.//span[@class="result-price"]')
 print (link, price)

Graipher Graipher 41.6k7 gold badges70 silver badges134 bronze badges · Accepted Answer · 2017-05-26 10:28:27Z

It is usually easier to ask forgiveness than permission. You could just surround the statements with try..except blocks:

import requests
from lxml import html
page = requests.get('http://bangalore.craigslist.co.in/search/rea?s=120').text
tree = html.fromstring(page)
for row in tree.xpath('//li[@class="result-row"]'):
 try:
 link = row.xpath('.//a[contains(@class,"hdrlnk")]/text()')[0]
 except IndexError:
 link = ""
 try:
 price = row.xpath('.//span[@class="result-price"]/text()')[0]
 except IndexError:
 price = ""
 print (link, price)

If you have many such actions, you could put it into a function:

def get_if_exists(row, path, index=0, default=""):
 """
 Gets the object at `index` from the xpath `path` from `row`.
 Returns the `default` if it does not exist.
 """
 try:
 return row.xpath(path)[index]
 except IndexError:
 return default

Which you could use here like this:

for row in tree.xpath('//li[@class="result-row"]'):
 # Using the defined default values for index and default:
 link = get_if_exists(row, './/a[contains(@class,"hdrlnk")]/text()')
 # Manually setting them instead:
 price = get_if_exists(row, './/span[@class="result-price"]/text()', 0, "")
 print (link, price)

Thanks sir Graipher, for your suggestion. If I pursue your second method where it is written for many similar actions, it will save me a lot of hard work cause you know the way i have written the for loop in my script is tedious. I can't dovetail the script with your function, though!
@SMth80 I don't understand what you mean with "I can't dovetail the script". Do you mean run? It works fine for me, does it throw any error?
Nope sir, I meant, I can't rearrange my script with your function in it. I'm little behind in applying function suggested by you.
You just replace the for loop with the one I wrote? If you have more code where you use this, then you should have included it in the question. You can always ask a new question with more context included.
@SMth80 Yes, it is. It also seems to work for me, does it not work for you?

Stack Exchange Network

Craigslist parser

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Craigslist parser

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions