Effective fetching data algorithm

Question 1

I'm fetching some data by using bs4, and want to know why access speed gets slower as my program progresses.

I'm using thread, and I know threading isn't so fast because of GIL.

My problem is that my program gets slower gradually. When the number of word is over 300,it become much slower than that of word is about 100 words.

GIL causes this?

(Multiprocessing can solve GIL problem,however,I want to find other causes,if any.)

What is the better web-scraping code to download at once?

Thanks.

import os,gc,queue
import requests, bs4
from threading import Thread, BoundedSemaphore
class add_line:
 def __init__(self,semaphore,q):
 self.q = q # queue
 self.semaphore = semaphore
 def put_queue(self,li): #insert 
 self.q.put(li)
 def fetch_word(self,w_list):
 fetched_text = ""# targeted text
 path = "https://ejje.weblio.jp/content/"+ w_list[1]
 res = requests.get(path)
 res.raise_for_status()
 no_starch_soup = bs4.BeautifulSoup(res.text)
 fetched_text = no_starch_soup.select('.conjugateRowR > table > tr > td')
 if len(fetched_text)== 0: #if get nothing
 fetched_text = no_starch_soup.select('.conjugateRowR > a')
 if len(fetched_text) ==0:# if get nothing too
 self.put_queue(w_list)# return original conjurate
 else :
 li_ = list(map(lambda x:x.contents[0], fetched_text))
 li_.insert(0,w_list[0])
 self.put_queue(li_)
 else:
 li_=list(map(lambda x:x.contents[0], fetched_text))
 li_.insert(0,w_list[0])
 self.put_queue(li_)
 gc.collect()
 self.semaphore.release() 
if __name__ == "__main__": 
 # creating thread
 maxconnections = 5
 path = "./target_index1900.txt"
 """
 the part of this text data is below(the number of words is over 1000)
 510 abandon
 510 abandonment
 1807 abide
 1318 abolish
 1318 abolition
 1167 abortion
 1167 abortive
 1096 abound
 1683 abrupt
 1683 abruptly
 1199 absolute
 1199 absolutely
 507 absorb
 507 absorption
 899 abstract
 899 abstraction
 """
 #set semaphore 
 pool_sema = BoundedSemaphore(value=maxconnections)
 q = queue.Queue()# put word in queue 
 al = add_line(pool_sema,q)
 path_w = "./threading_test_result.txt"
 with open(path) as f:
 while True:
 lines = f.readline()
 if not lines:
 break
 pool_sema.acquire()
 Thread(target=al.fetch_word, args=(lines.split(" "),)).start()
 with open(path_w, mode='a') as f:
 while not q.empty():
 for x in q.get():
 f.write(x)
 f.write(" ")
 f.write("\n")
 f.close()
 gc.collect()
 print("Done!")

Question 2

Why do you think it’s the GIL? Have you tried running your code without threads? That’s always a good thing to do before actually implementing threads :)

Question 3

The server may be throttling your requests.

Question 4

Why manually garbage collect right as your function is exiting? That's just adding a process that's handled by the operating system

Question 5

Whitespace formatting

Apply a linter that will give you PEP8 suggestions. Among other things, it will suggest the following:

import os,gc,queue

There should be spaces after those commas. It'll also suggest that there be one line per import.

class add_line:

should be

class AddLine:

and have a couple of newlines before it.

else :

shouldn't have a space before the colon.

f-strings

"https://ejje.weblio.jp/content/"+ w_list[1]

can be

f'https://ejje.weblio.jp/content/{w_list[1]}'

Falsiness

if len(fetched_text)== 0:

can be

if not fetched_text:

Don't repeat yourself

 li_ = list(map(lambda x:x.contents[0], fetched_text))
 li_.insert(0,w_list[0])
 self.put_queue(li_)

appears twice. Put it in a function or restructure your if blocks.

File iteration

with open(path) as f:
 while True:
 lines = f.readline()
 if not lines:
 break

First of all, your variable name is confusing. This isn't lines, it's line.

Also: just change your iteration to

with open(path) as f:
 for line in f:

As for this:

with open(path_w, mode='a') as f:
 # ...
 f.close()

Get rid of the close; the file is already closed at the end of the with.

`main`

You do check __name__, but you don't give all of that code function scope. You should move it into a main function. Otherwise, every one of those variables will be visible from other functions.

Question 6

Thank you, Reinderien! I write codes referencing to "The Art of Readable Code", however, it' hard to live up to this! Anyway, I learned a lot of new things from you! Thank you!

Question 7

Ignoring threading, it's usually best to benchmark your code on a single thread first, then move to multiple threads as necessary. Some glaring things that could yield slower performance:

You are not using a Session for requests. There is not a guarantee that sessions are thread-safe, but you could allocate a session for each thread, giving each thread it's own connection pool. This prevents the allocation of a Session under the hood when calling a bare requests.get method. From the source:

def request(method, url, **kwargs):
 """
 Snipping out docstring for brevity
 """
 # By using the 'with' statement we are sure the session is closed, thus we
 # avoid leaving sockets open which can trigger a ResourceWarning in some
 # cases, and look like a memory leak in others.
 with sessions.Session() as session:
 return session.request(method=method, url=url, **kwargs)
def get(url, params=None, **kwargs):
 r"""Sends a GET requeste
 """
 kwargs.setdefault('allow_redirects', True)
 return request('get', url, params=params, **kwargs)

So refactor to use a single session:

import requests
url = 'http://google.com'
s = requests.Session()
r = s.get(url)

You are using lots of list.insert(0, item), which is an O(N) operation, where N is the length of the list. This will naturally incur overhead as your list grows. A refactor might be better here using a data structure meant for left-append, like a deque:

from collections import deque
li_ = deque(map(lambda x: x.contents[0], fetched_text))
li_.appendleft(w_list[0])

This is much faster than list.insert(0, object):

(base) ➜ ~ python -m timeit -s 'from collections import deque; x = deque(range(10000))' 'x.appendleft(1)'
10000000 loops, best of 3: 0.06 usec per loop
(base) ➜ ~ python -m timeit -s 'from collections import deque; x = list(range(10000))' 'x.insert(0, 1)'
100000 loops, best of 3: 26.2 usec per loop

These still retain order and can be iterated over just like lists

Explicitly checking if the length of a list is 0 is not pythonic and slower:

(base) ➜ ~ python -m timeit -s 'x = []' 'len(x)==0'
10000000 loops, best of 3: 0.0552 usec per loop
(base) ➜ ~ python -m timeit -s 'x = []' 'not x'
100000000 loops, best of 3: 0.0186 usec per loop

Use if not fetched_text instead

Your self.put_queue method is unnecessary. It's more readable to just use self.q.put(value)
While iterating files, you don't need to check that the line is empty, this is an extra if statement that's implicitly handled by iterating over the file directly:

with open(somefile) as fh:
 for line in fh:
 # do things

This for loop will terminate when the file handle is exhausted

Question 8

Thank you, C.Nivs! I wasn't cautious about a memory leak when using bs4 ever. I'm reading Session Objects page.

Question 9

Threading

I'd like to join the others in recommending you drop your threading. You're also doing it wrong - you're only grabbing the semaphore for creating the thread, but you never acquire it inside of the thread when executing.

If you use a lock of semaphore, do it with a context manager:

with my_semaphore:
 # code...

And if you can't, you're doing something wrong.

Garbage Collection

Drop all gc.collect() lines. Python does this automatically at a frequency and at moments that are plenty sufficient for your application.

Lets have a look at your class.

class add_line: # Class names should be TitleCase, so make this AddLine
 def __init__(self,semaphore,q): # q is to short. Just use queue here.
 self.q = q # queue # Same here, of course. self.queue is perfectly fine.
 self.semaphore = semaphore
 def put_queue(self,li): #insert # This function serves no purpose. Just use 
 self.q.put(li) # self.queue.put(li) directly elsewhere.
 def fetch_word(self,w_list):
 fetched_text = ""# targeted text # Remove this line - it's created below, and not used inbetween.
 path = "https://ejje.weblio.jp/content/"+ w_list[1]
 res = requests.get(path) # If you expect to use this more than ten times, use a session like @C.Nivs explains clearly.
 res.raise_for_status()
 no_starch_soup = bs4.BeautifulSoup(res.text)
 # See below for how to change this bit.
 # [Skipped Code]
 self.semaphore.release() # This thread never acquired it.

Take a good look at what we're doing in that double conditional statements there. We select something from the soup, and if it's empty, from elsewhere. This is a great place to use a simple or and reduce it to this:

 fetched_text = no_starch_soup.select('.conjugateRowR > table > tr > td') or \
 no_starch_soup.select('.conjugateRowR > a')
 if not fetched_text: #if get nothing 
 self.put_queue(w_list) # return original conjurate
 else:
 li_=list(map(lambda x:x.contents[0], fetched_text))
 li_.insert(0,w_list[0])
 self.put_queue(li_)

The \ is a a line continuation token - it allows you to split a single line of python code over two lines if it's to long. It's rarely a good idea, but in this case, I consider it warranted. The statement is simply - it tries no_starch_soup.select('.conjugateRowR > table > tr > td') first, and if that results in a False value (like an empty string, empty list, False, None, 0, empty dict... you get the point), then it will evaluate no_starch_soup.select('.conjugateRowR > a') and put the result in the variable.

Now lets have a look at your appended list:

 li_=list(map(lambda x:x.contents[0], fetched_text))
 li_.insert(0,w_list[0])
 self.put_queue(li_)

This could just as easily be:

 self.put_queue(w_list[0] + list(map(lambda x:x.contents[0], fetched_text)))

At least, if we're allowed to make the assumption that w_list is small, which it seems to be as it's the result of a string.split() operation, which is used on a string read from a file opened for text reading.

However, I would refactor this class into a function, as it has no state you're modifying anyway. Here I'll assume you're really determined to keep threading, despite threading only having advantages if you're doing a lot of work outside python code, like lengthy file I/O or complicated math in numpy.

def add_line(queue, semaphore, w_list):
 with semaphore: # This single line takes care of all threading headaches for the entire function
 path = "https://ejje.weblio.jp/content/"+ w_list[1]
 no_starch_soup = bs4.BeautifulSoup(requests.get(path).text)
 fetched_text = no_starch_soup.select('.conjugateRowR > table > tr > td') or \
 no_starch_soup.select('.conjugateRowR > a')
 if fetched_text: #if get nothing 
 queue.put(w_list) # return original conjurate
 else:
 queue.put(w_list[0] + list(map(lambda x:x.contents[0], fetched_text)))

And then lets go into your script Thread generation:

from queue import Queue # I like this type of import, makes you need less dots.
# [...]
 sema = BoundedSemaphore(value=maxconnections)
 queue = Queue() # Words to write to output file
 path_w = "./threading_test_result.txt"
 with open(path) as f:
 for line in f: # lines is plural, line is singular. 
 Thread(target=add_line, args=(queue, sema, lines.split(" "))).start()

Keep in mind that this will spawn a Thread for every line in the file. Unless this is somewhere lower than around 30 lines, it's very likely that this is what causes your slowdowns. Therefore I'd just forget about the threads if I were you.

And your output generation:

 with open(path_w, mode='a') as f:
 while not q.empty():
 for x in q.get():
 f.write(x)
 f.write(" ")
 f.write("\n")
 f.close()

Can very easily be replaced by:

 with open(path_w, mode='a') as f:
 while not queue.empty():
 print(*queue.get(), file=f)

Wait what, isn't that the super-basic simple output-to-console command for starters with python? Why yes it is. It's a very powerful function. With the * before the entry, we unpack the list and feed it's elements as seperate arguments to the function. These objects will all be converted to strings and then printed, seperated by the sep= keyword argument, which happens to be a string with a single space by default. Then it's finished by the end= keyword argument, which by default is your system specific line ending, so we don't need to put any thought in that either. And then we can tell it where to print, and the file= keyword argument perfectly understands file handles.

You should probably check to make sure your program doesn't empty the queue faster than the other threads can fill it, which is another reason to drop threading here. If it does, you can wait and restart processing after a second or so.

I've got the feeling that dependent on what sort of data you expect in fetched_text in the function, there's more improvements possible, but I must confess to not knowing the BeautifulSoup module at all, so I'll pass on that.

Question 10

I'd watch out, queue.Queue does not support iteration, so your for entry in queue loop will not work

Question 11

Whoops. Good one. lemme fix. I really should check the more obscure data containers for iteration and such.

Question 12

Gloweye, I created about 10 threads per 30 lines, as you said, " This will spawn a Thread for every line in the file". It works three times faster than before. Thread worked correctly if a code to output is put after "for loop" makes thread[i].join. Some sets of threads being run one by one, GC seems to throw previous thread away. Next time, I try to do this with multiprocessing. Thank you!

Question 13

@Gloweye queue.Queue also doesn't have falsey behavior when empty. The following: from queue import Queue; q = Queue(); x = 'not empty' if q else ''; print(x)' prints 'not empty', so while q` will be an infinite loop. Take a look at the docs

Question 14

I'm starting to dislike queue's....weird that queue's don't have a boolean meaning in the inverse of empty(). Fixed the issue, though.

Reinderien Reinderien 70.9k5 gold badges76 silver badges256 bronze badges · Answer 1 · 2019-09-17 14:39:37Z

Whitespace formatting

Apply a linter that will give you PEP8 suggestions. Among other things, it will suggest the following:

import os,gc,queue

There should be spaces after those commas. It'll also suggest that there be one line per import.

class add_line:

should be

class AddLine:

and have a couple of newlines before it.

else :

shouldn't have a space before the colon.

f-strings

"https://ejje.weblio.jp/content/"+ w_list[1]

can be

f'https://ejje.weblio.jp/content/{w_list[1]}'

Falsiness

if len(fetched_text)== 0:

can be

if not fetched_text:

Don't repeat yourself

 li_ = list(map(lambda x:x.contents[0], fetched_text))
 li_.insert(0,w_list[0])
 self.put_queue(li_)

appears twice. Put it in a function or restructure your if blocks.

File iteration

with open(path) as f:
 while True:
 lines = f.readline()
 if not lines:
 break

First of all, your variable name is confusing. This isn't lines, it's line.

Also: just change your iteration to

with open(path) as f:
 for line in f:

As for this:

with open(path_w, mode='a') as f:
 # ...
 f.close()

Get rid of the close; the file is already closed at the end of the with.

`main`

You do check __name__, but you don't give all of that code function scope. You should move it into a main function. Otherwise, every one of those variables will be visible from other functions.

Thank you, Reinderien! I write codes referencing to "The Art of Readable Code", however, it' hard to live up to this! Anyway, I learned a lot of new things from you! Thank you!

C.Nivs C.Nivs 3,11714 silver badges32 bronze badges · Answer 2 · 2019-09-17 14:58:40Z

Ignoring threading, it's usually best to benchmark your code on a single thread first, then move to multiple threads as necessary. Some glaring things that could yield slower performance:

You are not using a Session for requests. There is not a guarantee that sessions are thread-safe, but you could allocate a session for each thread, giving each thread it's own connection pool. This prevents the allocation of a Session under the hood when calling a bare requests.get method. From the source:

def request(method, url, **kwargs):
 """
 Snipping out docstring for brevity
 """
 # By using the 'with' statement we are sure the session is closed, thus we
 # avoid leaving sockets open which can trigger a ResourceWarning in some
 # cases, and look like a memory leak in others.
 with sessions.Session() as session:
 return session.request(method=method, url=url, **kwargs)
def get(url, params=None, **kwargs):
 r"""Sends a GET requeste
 """
 kwargs.setdefault('allow_redirects', True)
 return request('get', url, params=params, **kwargs)

So refactor to use a single session:

import requests
url = 'http://google.com'
s = requests.Session()
r = s.get(url)

You are using lots of list.insert(0, item), which is an O(N) operation, where N is the length of the list. This will naturally incur overhead as your list grows. A refactor might be better here using a data structure meant for left-append, like a deque:

from collections import deque
li_ = deque(map(lambda x: x.contents[0], fetched_text))
li_.appendleft(w_list[0])

This is much faster than list.insert(0, object):

(base) ➜ ~ python -m timeit -s 'from collections import deque; x = deque(range(10000))' 'x.appendleft(1)'
10000000 loops, best of 3: 0.06 usec per loop
(base) ➜ ~ python -m timeit -s 'from collections import deque; x = list(range(10000))' 'x.insert(0, 1)'
100000 loops, best of 3: 26.2 usec per loop

These still retain order and can be iterated over just like lists

Explicitly checking if the length of a list is 0 is not pythonic and slower:

(base) ➜ ~ python -m timeit -s 'x = []' 'len(x)==0'
10000000 loops, best of 3: 0.0552 usec per loop
(base) ➜ ~ python -m timeit -s 'x = []' 'not x'
100000000 loops, best of 3: 0.0186 usec per loop

Use if not fetched_text instead

Your self.put_queue method is unnecessary. It's more readable to just use self.q.put(value)
While iterating files, you don't need to check that the line is empty, this is an extra if statement that's implicitly handled by iterating over the file directly:

with open(somefile) as fh:
 for line in fh:
 # do things

This for loop will terminate when the file handle is exhausted

Thank you, C.Nivs! I wasn't cautious about a memory leak when using bs4 ever. I'm reading Session Objects page.

Gloweye Gloweye 1,74611 silver badges20 bronze badges · Answer 3 · 2019-09-17 21:21:16Z

Threading

I'd like to join the others in recommending you drop your threading. You're also doing it wrong - you're only grabbing the semaphore for creating the thread, but you never acquire it inside of the thread when executing.

If you use a lock of semaphore, do it with a context manager:

with my_semaphore:
 # code...

And if you can't, you're doing something wrong.

Garbage Collection

Drop all gc.collect() lines. Python does this automatically at a frequency and at moments that are plenty sufficient for your application.

Lets have a look at your class.

class add_line: # Class names should be TitleCase, so make this AddLine
 def __init__(self,semaphore,q): # q is to short. Just use queue here.
 self.q = q # queue # Same here, of course. self.queue is perfectly fine.
 self.semaphore = semaphore
 def put_queue(self,li): #insert # This function serves no purpose. Just use 
 self.q.put(li) # self.queue.put(li) directly elsewhere.
 def fetch_word(self,w_list):
 fetched_text = ""# targeted text # Remove this line - it's created below, and not used inbetween.
 path = "https://ejje.weblio.jp/content/"+ w_list[1]
 res = requests.get(path) # If you expect to use this more than ten times, use a session like @C.Nivs explains clearly.
 res.raise_for_status()
 no_starch_soup = bs4.BeautifulSoup(res.text)
 # See below for how to change this bit.
 # [Skipped Code]
 self.semaphore.release() # This thread never acquired it.

Take a good look at what we're doing in that double conditional statements there. We select something from the soup, and if it's empty, from elsewhere. This is a great place to use a simple or and reduce it to this:

 fetched_text = no_starch_soup.select('.conjugateRowR > table > tr > td') or \
 no_starch_soup.select('.conjugateRowR > a')
 if not fetched_text: #if get nothing 
 self.put_queue(w_list) # return original conjurate
 else:
 li_=list(map(lambda x:x.contents[0], fetched_text))
 li_.insert(0,w_list[0])
 self.put_queue(li_)

The \ is a a line continuation token - it allows you to split a single line of python code over two lines if it's to long. It's rarely a good idea, but in this case, I consider it warranted. The statement is simply - it tries no_starch_soup.select('.conjugateRowR > table > tr > td') first, and if that results in a False value (like an empty string, empty list, False, None, 0, empty dict... you get the point), then it will evaluate no_starch_soup.select('.conjugateRowR > a') and put the result in the variable.

Now lets have a look at your appended list:

 li_=list(map(lambda x:x.contents[0], fetched_text))
 li_.insert(0,w_list[0])
 self.put_queue(li_)

This could just as easily be:

 self.put_queue(w_list[0] + list(map(lambda x:x.contents[0], fetched_text)))

At least, if we're allowed to make the assumption that w_list is small, which it seems to be as it's the result of a string.split() operation, which is used on a string read from a file opened for text reading.

However, I would refactor this class into a function, as it has no state you're modifying anyway. Here I'll assume you're really determined to keep threading, despite threading only having advantages if you're doing a lot of work outside python code, like lengthy file I/O or complicated math in numpy.

def add_line(queue, semaphore, w_list):
 with semaphore: # This single line takes care of all threading headaches for the entire function
 path = "https://ejje.weblio.jp/content/"+ w_list[1]
 no_starch_soup = bs4.BeautifulSoup(requests.get(path).text)
 fetched_text = no_starch_soup.select('.conjugateRowR > table > tr > td') or \
 no_starch_soup.select('.conjugateRowR > a')
 if fetched_text: #if get nothing 
 queue.put(w_list) # return original conjurate
 else:
 queue.put(w_list[0] + list(map(lambda x:x.contents[0], fetched_text)))

And then lets go into your script Thread generation:

from queue import Queue # I like this type of import, makes you need less dots.
# [...]
 sema = BoundedSemaphore(value=maxconnections)
 queue = Queue() # Words to write to output file
 path_w = "./threading_test_result.txt"
 with open(path) as f:
 for line in f: # lines is plural, line is singular. 
 Thread(target=add_line, args=(queue, sema, lines.split(" "))).start()

Keep in mind that this will spawn a Thread for every line in the file. Unless this is somewhere lower than around 30 lines, it's very likely that this is what causes your slowdowns. Therefore I'd just forget about the threads if I were you.

And your output generation:

 with open(path_w, mode='a') as f:
 while not q.empty():
 for x in q.get():
 f.write(x)
 f.write(" ")
 f.write("\n")
 f.close()

Can very easily be replaced by:

 with open(path_w, mode='a') as f:
 while not queue.empty():
 print(*queue.get(), file=f)

Wait what, isn't that the super-basic simple output-to-console command for starters with python? Why yes it is. It's a very powerful function. With the * before the entry, we unpack the list and feed it's elements as seperate arguments to the function. These objects will all be converted to strings and then printed, seperated by the sep= keyword argument, which happens to be a string with a single space by default. Then it's finished by the end= keyword argument, which by default is your system specific line ending, so we don't need to put any thought in that either. And then we can tell it where to print, and the file= keyword argument perfectly understands file handles.

You should probably check to make sure your program doesn't empty the queue faster than the other threads can fill it, which is another reason to drop threading here. If it does, you can wait and restart processing after a second or so.

I've got the feeling that dependent on what sort of data you expect in fetched_text in the function, there's more improvements possible, but I must confess to not knowing the BeautifulSoup module at all, so I'll pass on that.

I'd watch out, queue.Queue does not support iteration, so your for entry in queue loop will not work
Whoops. Good one. lemme fix. I really should check the more obscure data containers for iteration and such.
Gloweye, I created about 10 threads per 30 lines, as you said, " This will spawn a Thread for every line in the file". It works three times faster than before. Thread worked correctly if a code to output is put after "for loop" makes thread[i].join. Some sets of threads being run one by one, GC seems to throw previous thread away. Next time, I try to do this with multiprocessing. Thank you!
@Gloweye queue.Queue also doesn't have falsey behavior when empty. The following: from queue import Queue; q = Queue(); x = 'not empty' if q else ''; print(x)' prints 'not empty', so while q` will be an infinite loop. Take a look at the docs
I'm starting to dislike queue's....weird that queue's don't have a boolean meaning in the inverse of empty(). Fixed the issue, though.

Stack Exchange Network

Effective fetching data algorithm

3 Answers 3

Whitespace formatting

f-strings

Falsiness

Don't repeat yourself

File iteration

`main`

Threading

Garbage Collection

Lets have a look at your class.

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Effective fetching data algorithm

3 Answers 3

Whitespace formatting

f-strings

Falsiness

Don't repeat yourself

File iteration

main

Threading

Garbage Collection

Lets have a look at your class.

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

`main`