5
\$\begingroup\$

I am using requests and BeautifulSoup to scrape 20000 URLs, each web page containing a table of information. Essentially each web page is like a combo, and it has several items, each item having a description. I am scraping two elements - item, description from each row in the table for all the 20000+ combos.

I will then write this information to an excel file. Each row represents a combo, with the first cell of each row containing the URL of the combo. The header of the file contains the variables item . So for a particular combo (a particular row number) and a particular item (a particular column), there is a description. Any two combos could have some item in common; they could also have some item not in common. So I want to have an exhaustive list of all items available in my excel file header.

So for each row in a web page, I first check if the item already existed in my excel header.

have = False #boolean to check if the header already contains the name 
item_position = 1 #if found, find out its column number 
for cell in list(ws1.rows)[0]:
 value = cell.value
 val = value.encode('utf-8')
 if item == val:
 have = True
 break
 else:
 item_position += 1

If so then I will note down the column number and put description accordingly; if not I will append the item to the next empty cell in the header empty_header_cell and note down the column number and put description.

However, after like less than hundred URLs scraped, the speed becomes drastically slower. I think it is both due to web page requesting as well as finding existing item? I wonder if there is any improvement to the code to speed up the process. Here is my complete code:

from lxml import html
from bs4 import BeautifulSoup
import requests
import csv
import openpyxl
from openpyxl.workbook import Workbook
wb=openpyxl.load_workbook('Destination.xlsx')
ws1=wb.get_sheet_by_name('Sheet1')
empty_header_cell = 2 
#maintains the column number of the next empty cell in the excel file header
with open ('urls.csv') as f:
f_csv = csv.reader(f)
header = next(f_csv)
row_number = 2 #maintains a row number which increments after each url is scraped
for row in f_csv:
 url = row[0]
 ws1.cell(row=row_number, column=1).value = url
 wb.save(filename="Destination.xlsx") 
 try:
 page = requests.get(url)
 web = page.text
 soup = BeautifulSoup(web, 'lxml')
 table = soup.find('table', {'class': "tc_table"}) #find the table in each web page that I am goinf to scrape
 trs = table.find_all('tr')
 for tr in trs:
 ls = []
 for td in tr.find_all('td'):
 ls.append(td.text)
 ls = [x.encode('utf-8') for x in ls]
 try:
 item = ls[1]
 description = ls[2]
 have = False #boolean to check if the header already contains the name 
 item_position = 1 #if found, find out its column number 
 for cell in list(ws1.rows)[0]:
 value = cell.value
 val = value.encode('utf-8')
 if item == val:
 have = True
 break
 else:
 item_position += 1
 if have == True: #if item found
 ws1.cell(row=row_number, column=item_position).value = description
 wb.save(filename = 'Destination.xlsx')
 elif have == False: #if item not found
 ws1.cell(row=1, column=empty_header_cell).value = item #append item to the next empty header cell 
 ws1.cell(row=row_number, column=empty_header_cell).value = description
 empty_header_cell += 1 #update next empty header cell
 wb.save(filename = 'Destination.xlsx')
 except IndexError:
 print("i am an IndexError")
 row_number += 1 #start scraping the next url
 except IndexError: #to skip those webpages that have slightly different format so data cannot be located
 print("skipping this website") 
 row_number += 1 
 except AttributeError:
 print("attribute error")
 row_number += 1
Graipher
41.6k7 gold badges70 silver badges134 bronze badges
asked May 26, 2017 at 12:57
\$\endgroup\$

1 Answer 1

4
\$\begingroup\$

Let me take it from the performance perspective.

The main bottleneck

You are scraping the pages sequentially in the blocking manner - processing urls one at a time, not proceeding to the next url until you are done with the current one.

There are a number of tools that can help to switch to an asynchronous strategy. Look into Scrapy web-scraping framework. There is also aiohttp which is based on AsyncIO.

Gathering scraping results

I think you don't actually need an Excel writer here since you are only writing simple text data - you are not concerned with advanced data types or workbook style and formatting. Use a CSV writer - Python has a built-in csv module.

Note that if you'll switch to Scrapy, you'll get the built-in CSV exporter "for free".

answered May 29, 2017 at 16:54
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.