High Speed Python Web Scraper Optimised

Question 1

I needed a lot of data for a tensorflow project so I made a web scraper to get all of the text and links off of websites then to repeat the process at all of those links.

I left it on overnight and it did not get much done, so I spent the day optimizing it. I can't find any ways to optimize it more (If anyone knows a way I would like to hear it.) I used it on CNN and now I have a 9 gig text file.

BTW: I used faster_than_requests and selectolax because they are faster then urllib3 and bs4 and you should check them out

import cython # helps speed up code
from selectolax.parser import HTMLParser #bs4 but faster
import faster_than_requests #urllib but faster
import _pickle as pickle #saving code
from colorama import init #just makes error messages stand out
from colorama import Fore, Back, Style
init() #colorama thing
#cdef is a cython thing, helping speed up code
cdef int i = 0
cdef list urls
cdef list text
cdef set visits #is a set for efficiency 
cdef str mainsite = "https://www.cnn.com" #The main site keeps the scraper 
 #from straying too far from its original site.
cdef str source
parsing = True
try:
 with open('visits.pickle', 'rb') as f:
 visload = pickle.load(f)
 visits = visload[1]
 i = visload[0]
except Exception as e:
 print(Back.RED + "Error loading Visits: " + str(e))
 visits = set('')
 i = 0
try:
 with open('txt.pickle', 'rb') as f:
 txt = pickle.load(f)
except Exception as e:
 print(Back.RED + "Error loading txt: " + str(e))
 txt = []
try:
 with open('links.pickle', 'rb') as f:
 urls = pickle.load(f)
except Exception as e:
 print(Back.RED + "Error loading urls: " + str(e))
 urls = ["https://www.cnn.com"]
while parsing:
 try:
 if urls[0][0] == "/": #checks to see if it can go 
 # to the site directly or it needs to add 
 #the main site to the front
 source = faster_than_requests.get2str(mainsite + urls[0])
 dom = HTMLParser(source)
 print(Back.BLACK + mainsite + urls[0])
 else:
 source = faster_than_requests.get2str(urls[0])
 dom = HTMLParser(source)
 print(Back.BLACK + urls[0])
 for tag in dom.tags('p'):
 txt.append(str(dom.text())) #finds text and saves it
 for tag in dom.tags('a'): 
 attrs = tag.attributes
 if 'href' in attrs:
 urls.append(attrs['href']) #finds links and saves them
 except:
 print(Back.RED + f"Error: {urls[0]}") # it will through up an error 
 # if it tries to go to a sub-page
 # of another site, but this is
 # an intended feature 
 visits.add(urls[0])
 #visits keeps track of visites web pages 
 i = i + 1
 clean = True
 #clean make shure that it does not repeat a webpage.
 while clean:
 if urls[0] in visits:
 del(urls[0])
 else:
 clean = False
 print(Back.BLACK + f"urls:{len(urls)}, i:{i}, text lang:{len(txt)}")
 if i % 10000 == 0:
 #Save every 10000 webpages
 with open('txt.pickle', 'wb') as f:
 pickle.dump(txt, f, pickle.HIGHEST_PROTOCOL)
 with open('links.pickle', 'wb') as f:
 pickle.dump(urls, f, pickle.HIGHEST_PROTOCOL)
 with open('visits.pickle', 'wb') as f:
 pickle.dump([i, visits], f, pickle.HIGHEST_PROTOCOL)
 if 0 == len(urls):
 parsing = False
 print(txt)
 with open('txt.pickle', 'wb') as f:
 pickle.dump(txt, f, pickle.HIGHEST_PROTOCOL)
 with open('links.pickle', 'wb') as f:
 pickle.dump(urls, f, pickle.HIGHEST_PROTOCOL)
 with open('visits.pickle', 'wb') as f:
 pickle.dump([i, visits], f, pickle.HIGHEST_PROTOCOL)

Question 2

Is there a good alternative to pickle that A- does not take three hours to save huge files, and B- compacts files better

Question 3

If you want absolute performance

You could avoid printing anything at all. Since that can be slow in some cases. If you absolutly need to know at least what happened you could try flushing at the end of the process.
I know with python is delicated, but you could try visiting multiple pages at the same time? Threads or the equivalent in python

Question 4

I was wondering, if print was slowing it down. I might only do it when the file saves.

InusualZ InusualZ 761 bronze badge · Accepted Answer · 2019-09-29 20:11:36Z

If you want absolute performance

You could avoid printing anything at all. Since that can be slow in some cases. If you absolutly need to know at least what happened you could try flushing at the end of the process.
I know with python is delicated, but you could try visiting multiple pages at the same time? Threads or the equivalent in python

I was wondering, if print was slowing it down. I might only do it when the file saves.

Stack Exchange Network

High Speed Python Web Scraper Optimised

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

High Speed Python Web Scraper Optimised

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions