2
\$\begingroup\$

I needed a lot of data for a tensorflow project so I made a web scraper to get all of the text and links off of websites then to repeat the process at all of those links.

I left it on overnight and it did not get much done, so I spent the day optimizing it. I can't find any ways to optimize it more (If anyone knows a way I would like to hear it.) I used it on CNN and now I have a 9 gig text file.

BTW: I used faster_than_requests and selectolax because they are faster then urllib3 and bs4 and you should check them out

import cython # helps speed up code
from selectolax.parser import HTMLParser #bs4 but faster
import faster_than_requests #urllib but faster
import _pickle as pickle #saving code
from colorama import init #just makes error messages stand out
from colorama import Fore, Back, Style
init() #colorama thing
#cdef is a cython thing, helping speed up code
cdef int i = 0
cdef list urls
cdef list text
cdef set visits #is a set for efficiency 
cdef str mainsite = "https://www.cnn.com" #The main site keeps the scraper 
 #from straying too far from its original site.
cdef str source
parsing = True
try:
 with open('visits.pickle', 'rb') as f:
 visload = pickle.load(f)
 visits = visload[1]
 i = visload[0]
except Exception as e:
 print(Back.RED + "Error loading Visits: " + str(e))
 visits = set('')
 i = 0
try:
 with open('txt.pickle', 'rb') as f:
 txt = pickle.load(f)
except Exception as e:
 print(Back.RED + "Error loading txt: " + str(e))
 txt = []
try:
 with open('links.pickle', 'rb') as f:
 urls = pickle.load(f)
except Exception as e:
 print(Back.RED + "Error loading urls: " + str(e))
 urls = ["https://www.cnn.com"]
while parsing:
 try:
 if urls[0][0] == "/": #checks to see if it can go 
 # to the site directly or it needs to add 
 #the main site to the front
 source = faster_than_requests.get2str(mainsite + urls[0])
 dom = HTMLParser(source)
 print(Back.BLACK + mainsite + urls[0])
 else:
 source = faster_than_requests.get2str(urls[0])
 dom = HTMLParser(source)
 print(Back.BLACK + urls[0])
 for tag in dom.tags('p'):
 txt.append(str(dom.text())) #finds text and saves it
 for tag in dom.tags('a'): 
 attrs = tag.attributes
 if 'href' in attrs:
 urls.append(attrs['href']) #finds links and saves them
 except:
 print(Back.RED + f"Error: {urls[0]}") # it will through up an error 
 # if it tries to go to a sub-page
 # of another site, but this is
 # an intended feature 
 visits.add(urls[0])
 #visits keeps track of visites web pages 
 i = i + 1
 clean = True
 #clean make shure that it does not repeat a webpage.
 while clean:
 if urls[0] in visits:
 del(urls[0])
 else:
 clean = False
 print(Back.BLACK + f"urls:{len(urls)}, i:{i}, text lang:{len(txt)}")
 if i % 10000 == 0:
 #Save every 10000 webpages
 with open('txt.pickle', 'wb') as f:
 pickle.dump(txt, f, pickle.HIGHEST_PROTOCOL)
 with open('links.pickle', 'wb') as f:
 pickle.dump(urls, f, pickle.HIGHEST_PROTOCOL)
 with open('visits.pickle', 'wb') as f:
 pickle.dump([i, visits], f, pickle.HIGHEST_PROTOCOL)
 if 0 == len(urls):
 parsing = False
 print(txt)
 with open('txt.pickle', 'wb') as f:
 pickle.dump(txt, f, pickle.HIGHEST_PROTOCOL)
 with open('links.pickle', 'wb') as f:
 pickle.dump(urls, f, pickle.HIGHEST_PROTOCOL)
 with open('visits.pickle', 'wb') as f:
 pickle.dump([i, visits], f, pickle.HIGHEST_PROTOCOL)
asked Sep 29, 2019 at 16:39
\$\endgroup\$
1
  • \$\begingroup\$ Is there a good alternative to pickle that A- does not take three hours to save huge files, and B- compacts files better \$\endgroup\$ Commented Sep 30, 2019 at 1:38

1 Answer 1

2
\$\begingroup\$

If you want absolute performance

  1. You could avoid printing anything at all. Since that can be slow in some cases. If you absolutly need to know at least what happened you could try flushing at the end of the process.

  2. I know with python is delicated, but you could try visiting multiple pages at the same time? Threads or the equivalent in python

answered Sep 29, 2019 at 20:11
\$\endgroup\$
1
  • \$\begingroup\$ I was wondering, if print was slowing it down. I might only do it when the file saves. \$\endgroup\$ Commented Sep 29, 2019 at 20:42

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.