I'm fetching some data by using bs4, and want to know why access speed gets slower as my program progresses.
I'm using thread, and I know threading isn't so fast because of GIL.
My problem is that my program gets slower gradually. When the number of word is over 300,it become much slower than that of word is about 100 words.
GIL causes this?
(Multiprocessing can solve GIL problem,however,I want to find other causes,if any.)
What is the better web-scraping code to download at once?
Thanks.
import os,gc,queue
import requests, bs4
from threading import Thread, BoundedSemaphore
class add_line:
def __init__(self,semaphore,q):
self.q = q # queue
self.semaphore = semaphore
def put_queue(self,li): #insert
self.q.put(li)
def fetch_word(self,w_list):
fetched_text = ""# targeted text
path = "https://ejje.weblio.jp/content/"+ w_list[1]
res = requests.get(path)
res.raise_for_status()
no_starch_soup = bs4.BeautifulSoup(res.text)
fetched_text = no_starch_soup.select('.conjugateRowR > table > tr > td')
if len(fetched_text)== 0: #if get nothing
fetched_text = no_starch_soup.select('.conjugateRowR > a')
if len(fetched_text) ==0:# if get nothing too
self.put_queue(w_list)# return original conjurate
else :
li_ = list(map(lambda x:x.contents[0], fetched_text))
li_.insert(0,w_list[0])
self.put_queue(li_)
else:
li_=list(map(lambda x:x.contents[0], fetched_text))
li_.insert(0,w_list[0])
self.put_queue(li_)
gc.collect()
self.semaphore.release()
if __name__ == "__main__":
# creating thread
maxconnections = 5
path = "./target_index1900.txt"
"""
the part of this text data is below(the number of words is over 1000)
510 abandon
510 abandonment
1807 abide
1318 abolish
1318 abolition
1167 abortion
1167 abortive
1096 abound
1683 abrupt
1683 abruptly
1199 absolute
1199 absolutely
507 absorb
507 absorption
899 abstract
899 abstraction
"""
#set semaphore
pool_sema = BoundedSemaphore(value=maxconnections)
q = queue.Queue()# put word in queue
al = add_line(pool_sema,q)
path_w = "./threading_test_result.txt"
with open(path) as f:
while True:
lines = f.readline()
if not lines:
break
pool_sema.acquire()
Thread(target=al.fetch_word, args=(lines.split(" "),)).start()
with open(path_w, mode='a') as f:
while not q.empty():
for x in q.get():
f.write(x)
f.write(" ")
f.write("\n")
f.close()
gc.collect()
print("Done!")
-
1\$\begingroup\$ Why do you think it’s the GIL? Have you tried running your code without threads? That’s always a good thing to do before actually implementing threads :) \$\endgroup\$Grajdeanu Alex– Grajdeanu Alex2019年09月16日 13:42:47 +00:00Commented Sep 16, 2019 at 13:42
-
\$\begingroup\$ The server may be throttling your requests. \$\endgroup\$Janne Karila– Janne Karila2019年09月17日 05:08:26 +00:00Commented Sep 17, 2019 at 5:08
-
\$\begingroup\$ Why manually garbage collect right as your function is exiting? That's just adding a process that's handled by the operating system \$\endgroup\$C.Nivs– C.Nivs2019年09月17日 14:01:02 +00:00Commented Sep 17, 2019 at 14:01
3 Answers 3
Whitespace formatting
Apply a linter that will give you PEP8 suggestions. Among other things, it will suggest the following:
import os,gc,queue
There should be spaces after those commas. It'll also suggest that there be one line per import.
class add_line:
should be
class AddLine:
and have a couple of newlines before it.
else :
shouldn't have a space before the colon.
f-strings
"https://ejje.weblio.jp/content/"+ w_list[1]
can be
f'https://ejje.weblio.jp/content/{w_list[1]}'
Falsiness
if len(fetched_text)== 0:
can be
if not fetched_text:
Don't repeat yourself
li_ = list(map(lambda x:x.contents[0], fetched_text))
li_.insert(0,w_list[0])
self.put_queue(li_)
appears twice. Put it in a function or restructure your if
blocks.
File iteration
with open(path) as f:
while True:
lines = f.readline()
if not lines:
break
First of all, your variable name is confusing. This isn't lines
, it's line
.
Also: just change your iteration to
with open(path) as f:
for line in f:
As for this:
with open(path_w, mode='a') as f:
# ...
f.close()
Get rid of the close
; the file is already closed at the end of the with
.
main
You do check __name__
, but you don't give all of that code function scope. You should move it into a main
function. Otherwise, every one of those variables will be visible from other functions.
-
\$\begingroup\$ Thank you, Reinderien! I write codes referencing to "The Art of Readable Code", however, it' hard to live up to this! Anyway, I learned a lot of new things from you! Thank you! \$\endgroup\$the_commander_x– the_commander_x2019年09月20日 00:22:04 +00:00Commented Sep 20, 2019 at 0:22
Ignoring threading, it's usually best to benchmark your code on a single thread first, then move to multiple threads as necessary. Some glaring things that could yield slower performance:
- You are not using a
Session
for requests. There is not a guarantee that sessions are thread-safe, but you could allocate a session for each thread, giving each thread it's own connection pool. This prevents the allocation of a Session under the hood when calling a barerequests.get
method. From the source:
def request(method, url, **kwargs):
"""
Snipping out docstring for brevity
"""
# By using the 'with' statement we are sure the session is closed, thus we
# avoid leaving sockets open which can trigger a ResourceWarning in some
# cases, and look like a memory leak in others.
with sessions.Session() as session:
return session.request(method=method, url=url, **kwargs)
def get(url, params=None, **kwargs):
r"""Sends a GET requeste
"""
kwargs.setdefault('allow_redirects', True)
return request('get', url, params=params, **kwargs)
So refactor to use a single session:
import requests
url = 'http://google.com'
s = requests.Session()
r = s.get(url)
- You are using lots of
list.insert(0, item)
, which is an O(N) operation, where N is the length of the list. This will naturally incur overhead as your list grows. A refactor might be better here using a data structure meant for left-append, like adeque
:
from collections import deque
li_ = deque(map(lambda x: x.contents[0], fetched_text))
li_.appendleft(w_list[0])
This is much faster than list.insert(0, object)
:
(base) ➜ ~ python -m timeit -s 'from collections import deque; x = deque(range(10000))' 'x.appendleft(1)'
10000000 loops, best of 3: 0.06 usec per loop
(base) ➜ ~ python -m timeit -s 'from collections import deque; x = list(range(10000))' 'x.insert(0, 1)'
100000 loops, best of 3: 26.2 usec per loop
These still retain order and can be iterated over just like lists
- Explicitly checking if the length of a
list
is 0 is not pythonic and slower:
(base) ➜ ~ python -m timeit -s 'x = []' 'len(x)==0'
10000000 loops, best of 3: 0.0552 usec per loop
(base) ➜ ~ python -m timeit -s 'x = []' 'not x'
100000000 loops, best of 3: 0.0186 usec per loop
Use if not fetched_text
instead
Your
self.put_queue
method is unnecessary. It's more readable to just useself.q.put(value)
While iterating files, you don't need to check that the line is empty, this is an extra
if
statement that's implicitly handled by iterating over the file directly:
with open(somefile) as fh:
for line in fh:
# do things
This for
loop will terminate when the file handle is exhausted
-
\$\begingroup\$ Thank you, C.Nivs! I wasn't cautious about a memory leak when using bs4 ever. I'm reading Session Objects page. \$\endgroup\$the_commander_x– the_commander_x2019年09月20日 00:12:10 +00:00Commented Sep 20, 2019 at 0:12
Threading
I'd like to join the others in recommending you drop your threading. You're also doing it wrong - you're only grabbing the semaphore for creating the thread, but you never acquire it inside of the thread when executing.
If you use a lock of semaphore, do it with a context manager:
with my_semaphore:
# code...
And if you can't, you're doing something wrong.
Garbage Collection
Drop all gc.collect()
lines. Python does this automatically at a frequency and at moments that are plenty sufficient for your application.
Lets have a look at your class.
class add_line: # Class names should be TitleCase, so make this AddLine
def __init__(self,semaphore,q): # q is to short. Just use queue here.
self.q = q # queue # Same here, of course. self.queue is perfectly fine.
self.semaphore = semaphore
def put_queue(self,li): #insert # This function serves no purpose. Just use
self.q.put(li) # self.queue.put(li) directly elsewhere.
def fetch_word(self,w_list):
fetched_text = ""# targeted text # Remove this line - it's created below, and not used inbetween.
path = "https://ejje.weblio.jp/content/"+ w_list[1]
res = requests.get(path) # If you expect to use this more than ten times, use a session like @C.Nivs explains clearly.
res.raise_for_status()
no_starch_soup = bs4.BeautifulSoup(res.text)
# See below for how to change this bit.
# [Skipped Code]
self.semaphore.release() # This thread never acquired it.
Take a good look at what we're doing in that double conditional statements there. We select something from the soup, and if it's empty, from elsewhere. This is a great place to use a simple or
and reduce it to this:
fetched_text = no_starch_soup.select('.conjugateRowR > table > tr > td') or \
no_starch_soup.select('.conjugateRowR > a')
if not fetched_text: #if get nothing
self.put_queue(w_list) # return original conjurate
else:
li_=list(map(lambda x:x.contents[0], fetched_text))
li_.insert(0,w_list[0])
self.put_queue(li_)
The \
is a a line continuation token - it allows you to split a single line of python code over two lines if it's to long. It's rarely a good idea, but in this case, I consider it warranted. The statement is simply - it tries no_starch_soup.select('.conjugateRowR > table > tr > td')
first, and if that results in a False value (like an empty string, empty list, False
, None
, 0, empty dict... you get the point), then it will evaluate no_starch_soup.select('.conjugateRowR > a')
and put the result in the variable.
Now lets have a look at your appended list:
li_=list(map(lambda x:x.contents[0], fetched_text))
li_.insert(0,w_list[0])
self.put_queue(li_)
This could just as easily be:
self.put_queue(w_list[0] + list(map(lambda x:x.contents[0], fetched_text)))
At least, if we're allowed to make the assumption that w_list is small, which it seems to be as it's the result of a string.split() operation, which is used on a string read from a file opened for text reading.
However, I would refactor this class into a function, as it has no state you're modifying anyway. Here I'll assume you're really determined to keep threading, despite threading only having advantages if you're doing a lot of work outside python code, like lengthy file I/O or complicated math in numpy.
def add_line(queue, semaphore, w_list):
with semaphore: # This single line takes care of all threading headaches for the entire function
path = "https://ejje.weblio.jp/content/"+ w_list[1]
no_starch_soup = bs4.BeautifulSoup(requests.get(path).text)
fetched_text = no_starch_soup.select('.conjugateRowR > table > tr > td') or \
no_starch_soup.select('.conjugateRowR > a')
if fetched_text: #if get nothing
queue.put(w_list) # return original conjurate
else:
queue.put(w_list[0] + list(map(lambda x:x.contents[0], fetched_text)))
And then lets go into your script Thread generation:
from queue import Queue # I like this type of import, makes you need less dots.
# [...]
sema = BoundedSemaphore(value=maxconnections)
queue = Queue() # Words to write to output file
path_w = "./threading_test_result.txt"
with open(path) as f:
for line in f: # lines is plural, line is singular.
Thread(target=add_line, args=(queue, sema, lines.split(" "))).start()
Keep in mind that this will spawn a Thread for every line in the file. Unless this is somewhere lower than around 30 lines, it's very likely that this is what causes your slowdowns. Therefore I'd just forget about the threads if I were you.
And your output generation:
with open(path_w, mode='a') as f:
while not q.empty():
for x in q.get():
f.write(x)
f.write(" ")
f.write("\n")
f.close()
Can very easily be replaced by:
with open(path_w, mode='a') as f:
while not queue.empty():
print(*queue.get(), file=f)
Wait what, isn't that the super-basic simple output-to-console command for starters with python? Why yes it is. It's a very powerful function. With the *
before the entry
, we unpack the list and feed it's elements as seperate arguments to the function. These objects will all be converted to strings and then printed, seperated by the sep=
keyword argument, which happens to be a string with a single space by default. Then it's finished by the end=
keyword argument, which by default is your system specific line ending, so we don't need to put any thought in that either. And then we can tell it where to print, and the file=
keyword argument perfectly understands file handles.
You should probably check to make sure your program doesn't empty the queue faster than the other threads can fill it, which is another reason to drop threading here. If it does, you can wait and restart processing after a second or so.
I've got the feeling that dependent on what sort of data you expect in fetched_text
in the function, there's more improvements possible, but I must confess to not knowing the BeautifulSoup module at all, so I'll pass on that.
-
\$\begingroup\$ I'd watch out,
queue.Queue
does not support iteration, so yourfor entry in queue
loop will not work \$\endgroup\$C.Nivs– C.Nivs2019年09月20日 17:04:49 +00:00Commented Sep 20, 2019 at 17:04 -
\$\begingroup\$ Whoops. Good one. lemme fix. I really should check the more obscure data containers for iteration and such. \$\endgroup\$Gloweye– Gloweye2019年09月20日 19:42:13 +00:00Commented Sep 20, 2019 at 19:42
-
\$\begingroup\$ Gloweye, I created about 10 threads per 30 lines, as you said, " This will spawn a Thread for every line in the file". It works three times faster than before. Thread worked correctly if a code to output is put after "for loop" makes thread[i].join. Some sets of threads being run one by one, GC seems to throw previous thread away. Next time, I try to do this with multiprocessing. Thank you! \$\endgroup\$the_commander_x– the_commander_x2019年09月21日 01:28:46 +00:00Commented Sep 21, 2019 at 1:28
-
\$\begingroup\$ @Gloweye
queue.Queue
also doesn't have falsey behavior when empty. The following:from queue import Queue; q = Queue(); x = 'not empty' if q else ''; print(x)' prints
'not empty', so
while q` will be an infinite loop. Take a look at the docs \$\endgroup\$C.Nivs– C.Nivs2019年09月27日 13:37:25 +00:00Commented Sep 27, 2019 at 13:37 -
1\$\begingroup\$ I'm starting to dislike queue's....weird that queue's don't have a boolean meaning in the inverse of empty(). Fixed the issue, though. \$\endgroup\$Gloweye– Gloweye2019年09月27日 13:48:53 +00:00Commented Sep 27, 2019 at 13:48
Explore related questions
See similar questions with these tags.