I'm working through a scraping function where pages of results lead to product pages. I've added a default maximum number of results pages, and pages per set of results, to prevent a simple mistake of crawling the whole site.
Here's what I have so far. Does the way I'm implementing the maximums with the for
loops make sense? Is there a more "pythonic" way? I'm coming at this from a completely learning perspective.
def my_crawler(url, max_pages = 1, max_items = 1):
for page_number in range(1, max_pages + 1):
source_code = requests.get(url + str(page_number)).text
products = SoupStrainer(class_ = 'productTags')
soup = BeautifulSoup(source_code, 'html.parser', parse_only=products)
for item_number, a in enumerate(soup.find_all('a')):
print(str(item_number) + ': ' + a['href'])
if item_number == max_items - 1: break
my_crawler('http://www.thesite.com/productResults.aspx?&No=')
1 Answer 1
There's not a lot to say about such a simple program, mainly nitpicks.
You are violating PEP8 at a few places. The code reformatted to follow PEP8 would look like this:
def my_crawler(url, max_pages=1, max_items=1):
for page_number in range(1, max_pages + 1):
source_code = requests.get(url + str(page_number)).text
products = SoupStrainer(class_='productTags')
soup = BeautifulSoup(source_code, 'html.parser', parse_only=products)
for item_number, a in enumerate(soup.find_all('a')):
print(str(item_number) + ': ' + a['href'])
if item_number == max_items - 1:
break
In particular:
- Write method parameters compactly as
max_pages=1
instead ofmax_pages = 1
- Break line after
:
Instead of range(1, max_pages + 1)
I would find range(max_pages)
simpler if you only need to do max_pages + 1
in the code. So I'd rewrite the above as:
for page_number in range(max_pages):
source_code = requests.get(url + str(page_number + 1)).text
And as a mild case of paranoia,
I'd prefer the if
statement for breaking out of the loop this way:
if item_number >= max_items - 1:
break
And yes, it's fine to break out of the loop this way. I don't think there's a better way.
When you print the item number and the link, I think this is slightly more readable:
print('{}: {}'.format(item_number, a['href']))
Finally, don't execute code in global scope,
it's recommended to wrap code in an if __name__ == '__name__':
guard:
if __name__ == '__name__':
my_crawler('http://www.thesite.com/productResults.aspx?&No=')
These are mainly just nitpicks though. You're doing fine.
-
1\$\begingroup\$ Awesome. Thanks @janos. This is exactly the kind of feedback I was looking for. Is the paranoia for item_number >= max_items from experiencing a bug or just good practice? \$\endgroup\$GollyJer– GollyJer2015年02月21日 20:06:56 +00:00Commented Feb 21, 2015 at 20:06
-
1\$\begingroup\$ I'm not aware if it's a known "good practice". I can't recall experiencing a bug, but then again, I've always been this way so maybe that's why it never happened to me. \$\endgroup\$janos– janos2015年02月21日 20:15:54 +00:00Commented Feb 21, 2015 at 20:15