I started learning Python recently and I really like it, so I decided to share one of my first projects mainly in hopes of someone telling me what I can do to make it run faster (threading/multiprocessing?).
from requests import get
from bs4 import BeautifulSoup
from time import time
from re import compile
print('***PYTHON LEAGUE OF LEGENDS USERNAME SCRAPER***')
print('This script scrapes usernames from lolprofile.net')
region = input('Enter the region for scraping(eune/euw/na/br/tr/kr/jp/lan/las/oce/ru)\n')
numStart = input('What page to start on? Min 0\n')
numEnd = input('What page to end on? Min starting page + 1\n')
size = [] #for logging
#count = -1 #for logging
def setUrl(pageNum, region):
global url
url = 'http://lolprofile.net/leaderboards/'+region+'/'+pageNum
def is_ascii(i):
return all(ord(c) < 128 for c in i)
setUrl(numStart, region)
start = time()
while int(numStart) != int(numEnd):
print(len(size))
page = get(url)
soup = BeautifulSoup(page.text, "lxml")
userName = [a.string for a in soup.findAll(href=compile('http://lolprofile.net/summoner/*'))]
with open('usernames1.txt', 'a') as file:
for i in userName:
if is_ascii(i) and (' ' in i) == False:
file.write('%s\n' % i.lower())
size.append('0')
numStart = int(numStart)
numStart += 1
setUrl(str(numStart), region)
#count += 1
#if count % 250 == 0: #every n iterations print progress
# print(len(size))
end = time()
print(len(size),'usernames scraped in a total of',end-start,'seconds')
2 Answers 2
If you're after speed, I'd suggest scrapy
. I was looking for an excuse to try it out and saw your question. When I ran your code on the first 10 pages of the NA leaderboard, it took a little over 4 seconds. Running the below takes about 0.3 seconds, presumably due to initiating all the HTTP requests in parallel:
test.py:
class LolSpider(scrapy.Spider):
name = 'lolspider'
start_urls = ['http://lolprofile.net/leaderboards/na/{}'.format(page) for page in range(10)]
def parse(self, response):
for name in response.xpath('//a[re:test(@href, "http://lolprofile.net/summoner/")]//text()').extract():
yield { 'name': name }
Running:
$ scrapy runspider test.py -o names.json
names.json:
[
{"name": "<first name here>"},
{"name": "<second name here>"},
...
]
To actually provide some code review feedback:
import requests # I prefer this and then requests.get over "from requests import get", since "get" is too common a word
from bs4 import BeautifulSoup
import time # ditto here
import re # and here
print('***PYTHON LEAGUE OF LEGENDS USERNAME SCRAPER***')
print('This script scrapes usernames from lolprofile.net')
region = input('Enter the region for scraping(eune/euw/na/br/tr/kr/jp/lan/las/oce/ru)\n')
num_start = int(input('What page to start on? Min 0\n')) # cast to int once here
num_end = int(input('What page to end on? Min starting page + 1\n')) # ditto
size = 0 # use a simple count rather than a list
# Python style dictates snake case
# get the URL rather than set a global variable
def get_url(page_num, region):
# use string formatting rather than concatenation
return 'http://lolprofile.net/leaderboards/{}/{}'.format(region, page_num)
def is_ascii(i):
return all(ord(c) < 128 for c in i)
start = time.time()
# for loop instead of while avoids the need to increment by hand
for page_num in range(num_start, num_end + 1):
url = get_url(page_num, region)
print(size)
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser") # html.parser
# /.* (slash and then anything) rather than /* (any number of slashes) in the regular expression
user_names = [a.string for a in soup.findAll(href=re.compile('http://lolprofile.net/summoner/.*'))]
with open('usernames1.txt', 'a') as file:
for i in user_names:
if is_ascii(i) and ' ' not in i: # not in
file.write('%s\n' % i.lower())
size += 1
end = time.time()
print('{} usernames scraped in a total of {} seconds.'.format(size, end-start))
-
\$\begingroup\$ I wanted to use scrapy at first but bs seemed simpler to me, now I'll definitely switch to scrapy, thanks a lot! \$\endgroup\$edsheeran– edsheeran2016年08月26日 07:01:36 +00:00Commented Aug 26, 2016 at 7:01
I assume the slowest part by far of this scraper is fetching each page. I agree with @smarx's great answer that using scrapy would be fastest and easiest. But, for educational purposes, let's parallelize your scraper.
To do this cleanly, it really helps to break your code into a few functions. This is also a good habit for organizing larger programs, or really code of any size, even short scripts like this.
Define one function that you can then apply to all (or many) of the pages concurrently:
# Compile the regex once, instead of on every function call
USERNAME_PATTERN = re.compile('http://lolprofile.net/summoner/.+')
def fetch_and_parse_names(url):
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
return (a.string for a in soup.findAll(href=USERNAME_PATTERN))
Now, one option for actually making the concurrent requests is concurrent.futures
in the standard library.
def get_names(urls):
# Create a concurrent executor
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
# Apply the fetch-and-parse function concurrently with executor.map,
# and join the results together
return itertools.chain.from_iterable(executor.map(fetch_and_parse_names, urls))
The executor can fire off a bunch of requests in a short amount of time, many more than you have physical CPUs, because waiting for requests.get()
is an I/O bound problem.
Your main function then just has to generate the URLs you want, call the concurrent scraper, and write the resulting names.
def get_url(region, page):
return 'http://lolprofile.net/leaderboards/%s/%d' % (region, page)
# `s` or `string` are more idiomatic names for a string than `i`
def is_ascii(s):
return all(ord(c) < 128 for c in s)
def is_valid_name(name):
return is_ascii(name) and ' ' not in name
def main():
region = input('Enter the region to scrape (eune/euw/na/br/tr/kr/jp/lan/las/oce/ru)\n')
start = int(input('What page to start on? '))
end = int(input('What page to end on? '))
start_time = time.time()
urls = [get_url(region, i) for i in range(start, end + 1)]
names = (name.lower() for name in get_names(urls) if is_valid_name(name))
size = 0
with open('usernames1.txt', 'a') as out:
for name in names:
out.write(name + '\n')
size += 1
end_time = time.time()
print('%d usernames scraped in %.4f seconds.' % (size, end_time - start_time))
Also consider what timing you want to measure -- do you want to include writing the names to file? Processing time? etc.
-
\$\begingroup\$ So, how fast is it? And I'd love a third answer with asyncio. :) \$\endgroup\$Quentin Pradet– Quentin Pradet2016年08月26日 06:07:52 +00:00Commented Aug 26, 2016 at 6:07
-
1\$\begingroup\$ @QuentinPradet Performs like scrapy, maybe a bit faster depending how you measure :) I'm getting ~1.67s to fetch and process 20 pages here, vs ~1.75s with scrapy. \$\endgroup\$BenC– BenC2016年08月26日 06:52:58 +00:00Commented Aug 26, 2016 at 6:52
-
\$\begingroup\$ Thanks a lot for the suggestion ! Could I use this with scrapy to make it even faster? \$\endgroup\$edsheeran– edsheeran2016年08月26日 06:57:42 +00:00Commented Aug 26, 2016 at 6:57
-
\$\begingroup\$ @edsheeran Not usefully. Scrapy has its own concurrency mechanisms and this code can't fundamentally buy you anything extra -- just an explanation of how to build the logic yourself, if you wanted. Scrapy or similar is the practical answer :) \$\endgroup\$BenC– BenC2016年08月26日 07:02:02 +00:00Commented Aug 26, 2016 at 7:02
-
\$\begingroup\$ @BenC After running the code for about an hour I'm getting memory allocation errors, I tried getting an the names on first page -> outputing them to a file -> 2nd page -> etc.. But that didn't work, how would I do something like that? \$\endgroup\$edsheeran– edsheeran2016年08月26日 11:57:16 +00:00Commented Aug 26, 2016 at 11:57