Scraping through product pages

Question 1

I'm working through a scraping function where pages of results lead to product pages. I've added a default maximum number of results pages, and pages per set of results, to prevent a simple mistake of crawling the whole site.

Here's what I have so far. Does the way I'm implementing the maximums with the for loops make sense? Is there a more "pythonic" way? I'm coming at this from a completely learning perspective.

def my_crawler(url, max_pages = 1, max_items = 1):
 for page_number in range(1, max_pages + 1):
 source_code = requests.get(url + str(page_number)).text
 products = SoupStrainer(class_ = 'productTags')
 soup = BeautifulSoup(source_code, 'html.parser', parse_only=products)
 for item_number, a in enumerate(soup.find_all('a')):
 print(str(item_number) + ': ' + a['href'])
 if item_number == max_items - 1: break
my_crawler('http://www.thesite.com/productResults.aspx?&No=')

Question 2

There's not a lot to say about such a simple program, mainly nitpicks.

You are violating PEP8 at a few places. The code reformatted to follow PEP8 would look like this:

def my_crawler(url, max_pages=1, max_items=1):
 for page_number in range(1, max_pages + 1):
 source_code = requests.get(url + str(page_number)).text
 products = SoupStrainer(class_='productTags')
 soup = BeautifulSoup(source_code, 'html.parser', parse_only=products)
 for item_number, a in enumerate(soup.find_all('a')):
 print(str(item_number) + ': ' + a['href'])
 if item_number == max_items - 1:
 break

In particular:

Write method parameters compactly as max_pages=1 instead of max_pages = 1
Break line after :

Instead of range(1, max_pages + 1) I would find range(max_pages) simpler if you only need to do max_pages + 1 in the code. So I'd rewrite the above as:

for page_number in range(max_pages):
 source_code = requests.get(url + str(page_number + 1)).text

And as a mild case of paranoia, I'd prefer the if statement for breaking out of the loop this way:

if item_number >= max_items - 1:
 break

And yes, it's fine to break out of the loop this way. I don't think there's a better way.

When you print the item number and the link, I think this is slightly more readable:

print('{}: {}'.format(item_number, a['href']))

Finally, don't execute code in global scope, it's recommended to wrap code in an if __name__ == '__name__': guard:

if __name__ == '__name__':
 my_crawler('http://www.thesite.com/productResults.aspx?&No=')

These are mainly just nitpicks though. You're doing fine.

Question 3

Awesome. Thanks @janos. This is exactly the kind of feedback I was looking for. Is the paranoia for item_number >= max_items from experiencing a bug or just good practice?

Question 4

I'm not aware if it's a known "good practice". I can't recall experiencing a bug, but then again, I've always been this way so maybe that's why it never happened to me.

janos janos 113k15 gold badges154 silver badges396 bronze badges · Accepted Answer · 2015-02-21 19:40:32Z

There's not a lot to say about such a simple program, mainly nitpicks.

You are violating PEP8 at a few places. The code reformatted to follow PEP8 would look like this:

def my_crawler(url, max_pages=1, max_items=1):
 for page_number in range(1, max_pages + 1):
 source_code = requests.get(url + str(page_number)).text
 products = SoupStrainer(class_='productTags')
 soup = BeautifulSoup(source_code, 'html.parser', parse_only=products)
 for item_number, a in enumerate(soup.find_all('a')):
 print(str(item_number) + ': ' + a['href'])
 if item_number == max_items - 1:
 break

In particular:

Write method parameters compactly as max_pages=1 instead of max_pages = 1
Break line after :

Instead of range(1, max_pages + 1) I would find range(max_pages) simpler if you only need to do max_pages + 1 in the code. So I'd rewrite the above as:

for page_number in range(max_pages):
 source_code = requests.get(url + str(page_number + 1)).text

And as a mild case of paranoia, I'd prefer the if statement for breaking out of the loop this way:

if item_number >= max_items - 1:
 break

And yes, it's fine to break out of the loop this way. I don't think there's a better way.

When you print the item number and the link, I think this is slightly more readable:

print('{}: {}'.format(item_number, a['href']))

Finally, don't execute code in global scope, it's recommended to wrap code in an if __name__ == '__name__': guard:

if __name__ == '__name__':
 my_crawler('http://www.thesite.com/productResults.aspx?&No=')

These are mainly just nitpicks though. You're doing fine.

Awesome. Thanks @janos. This is exactly the kind of feedback I was looking for. Is the paranoia for item_number >= max_items from experiencing a bug or just good practice?
I'm not aware if it's a known "good practice". I can't recall experiencing a bug, but then again, I've always been this way so maybe that's why it never happened to me.

Stack Exchange Network

Scraping through product pages

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Scraping through product pages

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions