Web scraping many URLs and writing to Excel file

Question 1

I am using requests and BeautifulSoup to scrape 20000 URLs, each web page containing a table of information. Essentially each web page is like a combo, and it has several items, each item having a description. I am scraping two elements - item, description from each row in the table for all the 20000+ combos.

I will then write this information to an excel file. Each row represents a combo, with the first cell of each row containing the URL of the combo. The header of the file contains the variables item . So for a particular combo (a particular row number) and a particular item (a particular column), there is a description. Any two combos could have some item in common; they could also have some item not in common. So I want to have an exhaustive list of all items available in my excel file header.

So for each row in a web page, I first check if the item already existed in my excel header.

have = False #boolean to check if the header already contains the name 
item_position = 1 #if found, find out its column number 
for cell in list(ws1.rows)[0]:
 value = cell.value
 val = value.encode('utf-8')
 if item == val:
 have = True
 break
 else:
 item_position += 1

If so then I will note down the column number and put description accordingly; if not I will append the item to the next empty cell in the header empty_header_cell and note down the column number and put description.

However, after like less than hundred URLs scraped, the speed becomes drastically slower. I think it is both due to web page requesting as well as finding existing item? I wonder if there is any improvement to the code to speed up the process. Here is my complete code:

from lxml import html
from bs4 import BeautifulSoup
import requests
import csv
import openpyxl
from openpyxl.workbook import Workbook
wb=openpyxl.load_workbook('Destination.xlsx')
ws1=wb.get_sheet_by_name('Sheet1')
empty_header_cell = 2 
#maintains the column number of the next empty cell in the excel file header
with open ('urls.csv') as f:
f_csv = csv.reader(f)
header = next(f_csv)
row_number = 2 #maintains a row number which increments after each url is scraped
for row in f_csv:
 url = row[0]
 ws1.cell(row=row_number, column=1).value = url
 wb.save(filename="Destination.xlsx") 
 try:
 page = requests.get(url)
 web = page.text
 soup = BeautifulSoup(web, 'lxml')
 table = soup.find('table', {'class': "tc_table"}) #find the table in each web page that I am goinf to scrape
 trs = table.find_all('tr')
 for tr in trs:
 ls = []
 for td in tr.find_all('td'):
 ls.append(td.text)
 ls = [x.encode('utf-8') for x in ls]
 try:
 item = ls[1]
 description = ls[2]
 have = False #boolean to check if the header already contains the name 
 item_position = 1 #if found, find out its column number 
 for cell in list(ws1.rows)[0]:
 value = cell.value
 val = value.encode('utf-8')
 if item == val:
 have = True
 break
 else:
 item_position += 1
 if have == True: #if item found
 ws1.cell(row=row_number, column=item_position).value = description
 wb.save(filename = 'Destination.xlsx')
 elif have == False: #if item not found
 ws1.cell(row=1, column=empty_header_cell).value = item #append item to the next empty header cell 
 ws1.cell(row=row_number, column=empty_header_cell).value = description
 empty_header_cell += 1 #update next empty header cell
 wb.save(filename = 'Destination.xlsx')
 except IndexError:
 print("i am an IndexError")
 row_number += 1 #start scraping the next url
 except IndexError: #to skip those webpages that have slightly different format so data cannot be located
 print("skipping this website") 
 row_number += 1 
 except AttributeError:
 print("attribute error")
 row_number += 1

Question 2

Let me take it from the performance perspective.

The main bottleneck

You are scraping the pages sequentially in the blocking manner - processing urls one at a time, not proceeding to the next url until you are done with the current one.

There are a number of tools that can help to switch to an asynchronous strategy. Look into Scrapy web-scraping framework. There is also aiohttp which is based on AsyncIO.

Gathering scraping results

I think you don't actually need an Excel writer here since you are only writing simple text data - you are not concerned with advanced data types or workbook style and formatting. Use a CSV writer - Python has a built-in csv module.

Note that if you'll switch to Scrapy, you'll get the built-in CSV exporter "for free".

alecxe alecxe 17.5k8 gold badges52 silver badges93 bronze badges · Answer 1 · 2017-05-29 16:54:25Z

Let me take it from the performance perspective.

The main bottleneck

You are scraping the pages sequentially in the blocking manner - processing urls one at a time, not proceeding to the next url until you are done with the current one.

There are a number of tools that can help to switch to an asynchronous strategy. Look into Scrapy web-scraping framework. There is also aiohttp which is based on AsyncIO.

Gathering scraping results

I think you don't actually need an Excel writer here since you are only writing simple text data - you are not concerned with advanced data types or workbook style and formatting. Use a CSV writer - Python has a built-in csv module.

Note that if you'll switch to Scrapy, you'll get the built-in CSV exporter "for free".

Stack Exchange Network

Web scraping many URLs and writing to Excel file

1 Answer 1

The main bottleneck

Gathering scraping results

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Web scraping many URLs and writing to Excel file

1 Answer 1

The main bottleneck

Gathering scraping results

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions