2
\$\begingroup\$

I'm working through a scraping function where pages of results lead to product pages. I've added a default maximum number of results pages, and pages per set of results, to prevent a simple mistake of crawling the whole site.

Here's what I have so far. Does the way I'm implementing the maximums with the for loops make sense? Is there a more "pythonic" way? I'm coming at this from a completely learning perspective.

def my_crawler(url, max_pages = 1, max_items = 1):
 for page_number in range(1, max_pages + 1):
 source_code = requests.get(url + str(page_number)).text
 products = SoupStrainer(class_ = 'productTags')
 soup = BeautifulSoup(source_code, 'html.parser', parse_only=products)
 for item_number, a in enumerate(soup.find_all('a')):
 print(str(item_number) + ': ' + a['href'])
 if item_number == max_items - 1: break
my_crawler('http://www.thesite.com/productResults.aspx?&No=')
Simon Forsberg
59.7k9 gold badges157 silver badges311 bronze badges
asked Feb 21, 2015 at 19:04
\$\endgroup\$

1 Answer 1

1
\$\begingroup\$

There's not a lot to say about such a simple program, mainly nitpicks.

You are violating PEP8 at a few places. The code reformatted to follow PEP8 would look like this:

def my_crawler(url, max_pages=1, max_items=1):
 for page_number in range(1, max_pages + 1):
 source_code = requests.get(url + str(page_number)).text
 products = SoupStrainer(class_='productTags')
 soup = BeautifulSoup(source_code, 'html.parser', parse_only=products)
 for item_number, a in enumerate(soup.find_all('a')):
 print(str(item_number) + ': ' + a['href'])
 if item_number == max_items - 1:
 break

In particular:

  • Write method parameters compactly as max_pages=1 instead of max_pages = 1
  • Break line after :

Instead of range(1, max_pages + 1) I would find range(max_pages) simpler if you only need to do max_pages + 1 in the code. So I'd rewrite the above as:

for page_number in range(max_pages):
 source_code = requests.get(url + str(page_number + 1)).text

And as a mild case of paranoia, I'd prefer the if statement for breaking out of the loop this way:

if item_number >= max_items - 1:
 break

And yes, it's fine to break out of the loop this way. I don't think there's a better way.

When you print the item number and the link, I think this is slightly more readable:

print('{}: {}'.format(item_number, a['href']))

Finally, don't execute code in global scope, it's recommended to wrap code in an if __name__ == '__name__': guard:

if __name__ == '__name__':
 my_crawler('http://www.thesite.com/productResults.aspx?&No=')

These are mainly just nitpicks though. You're doing fine.

answered Feb 21, 2015 at 19:40
\$\endgroup\$
2
  • 1
    \$\begingroup\$ Awesome. Thanks @janos. This is exactly the kind of feedback I was looking for. Is the paranoia for item_number >= max_items from experiencing a bug or just good practice? \$\endgroup\$ Commented Feb 21, 2015 at 20:06
  • 1
    \$\begingroup\$ I'm not aware if it's a known "good practice". I can't recall experiencing a bug, but then again, I've always been this way so maybe that's why it never happened to me. \$\endgroup\$ Commented Feb 21, 2015 at 20:15

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.