This is my first time writing a web scrape using Selenium and Beautifulsoup. The website I'm scraping is https://www.grainger.com/ and I have it pulling a specific set of SKUs stored in an Excel file. To run a scrape of 1,000 items takes ~8 hours and I'm trying to scrape 30,000 items. Is there anywhere I can improve my scrape to have it run faster?
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import time
from random import randint
import datetime
from selenium.webdriver.chrome.options import Options
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
# open the file
data = pd.read_excel(r'Grainger Sku List.xlsx','Sheet1')
options = Options()
options.headless = True
driver = webdriver.Chrome(r'chromedriver.exe')
# get the urls
urls = data.URL
Graingerlist = []
for url in urls:
driver.get(url)
soup = BeautifulSoup(driver.page_source,'html.parser')
time.sleep(randint(1,11))
try:
Name = soup.find('h1',class_="lypQpT").text.strip()
except:
Name = 'Chceck Name'
try:
Price = soup.find('span',class_="vOg9Zc Jlt5uj").text.strip()
except:
pass
try:
Price = soup.find('span',class_="YrWqzV").text.strip()
except:
Price = 'Chceck Price'
try:
SPrice = soup.find('span',class_="vOg9Zc KHonQU Jlt5uj").text.strip()
except:
SPrice = 'No Sale Price'
try:
Item = soup.findAll("div", {"class": "vDgTDH"})
except:
Item = 'Chceck Item'
try:
TierR = soup.findAll("td", {"class": "TfXnvH"})
except:
Tier = 'No Tier Price'
try:
Tier1R = TierR[0].text.strip()
except:
Tier1R = 'No Tier 1'
try:
Tier2R = TierR[1].text.strip()
except:
Tier2R = 'No Tier 2'
try:
Tier3R = TierR[2].text.strip()
except:
Tier3R = 'No Tier 3'
try:
Tier4R = TierR[3].text.strip()
except:
Tier4R = 'No Tier 4'
try:
Tier5R = TierR[4].text.strip()
except:
Tier5R = 'No Tier 5'
try:
Tier6R = TierR[5].text.strip()
except:
Tier6R = 'No Tier 6'
try:
Tier7R = TierR[6].text.strip()
except:
Tier7R = 'No Tier 7'
try:
TierP = soup.findAll("span", {"class": "MLh0qn"})
except:
TierP = 'No Tier Price'
try:
Tier1P = TierP[0].text.strip()
except:
Tier1P = 'No Tier 1'
try:
Tier2P = TierP[1].text.strip()
except:
Tier2P = 'No Tier 2'
try:
Tier3P = TierP[2].text.strip()
except:
Tier3P = 'No Tier 3'
try:
Tier4P = TierP[3].text.strip()
except:
Tier4P = 'No Tier 4'
try:
Tier5P = TierP[4].text.strip()
except:
Tier5P = 'No Tier 5'
try:
Tier6P = TierP[5].text.strip()
except:
Tier6P = 'No Tier 6'
try:
Tier7P = TierP[6].text.strip()
except:
Tier7P = 'No Tier 7'
try:
ItemNum = Item[0].text.strip()
except:
ItemNum = 'Check Item Number'
try:
MPN = Item[1].text.strip()
except:
MPN = 'Check MPN'
try:
UOM = soup.find('span',class_="tqfrFT").text.strip()
except:
UOM = 'Chceck UOM'
try:
Tax = soup.findAll("li", {"class": "sIWwJ-"})
except:
Tax = 'Chceck Taxonomy'
try:
Link = url
except:
Link = 'Chceck Link'
try:
Tax0= Tax[0].text.strip()
except:
Tax0 = 'No Tax0'
try:
Tax1= Tax[1].text.strip()
except:
Tax1 = 'No Tax1'
try:
Tax2= Tax[2].text.strip()
except:
Tax2 = 'No Tax2'
try:
Tax3= Tax[3].text.strip()
except:
Tax3 = 'No Tax3'
try:
Tax4= Tax[4].text.strip()
except:
Tax4 = 'No Tax4'
try:
Tax5= Tax[5].text.strip()
except:
Tax5 = 'No Tax5'
try:
Tax6= Tax[6].text.strip()
except:
Tax6 = 'No Tax6'
try:
Tax7= Tax[7].text.strip()
except:
Tax7 = 'No Tax7'
Grainger = {
'Name': Name,
'Price':Price,
'Sale Price':SPrice,
'Tier 1 Range':Tier1R,
'Tier 1 Price':Tier1P,
'Tier 2 Range':Tier2R,
'Tier 2 Price':Tier2P,
'Tier 3 Range':Tier3R,
'Tier 3 Price':Tier3P,
'Tier 4 Range':Tier4R,
'Tier 4 Price':Tier4P,
'Tier 5 Range':Tier5R,
'Tier 5 Price':Tier5P,
'Tier 6 Range':Tier6R,
'Tier 6 Price':Tier6P,
'Tier 7 Range':Tier7R,
'Tier 7 Price':Tier7P,
'Item #':ItemNum,
'MPN': MPN,
'UOM':UOM,
'Tax0':Tax0,
'Tax1':Tax1,
'Tax2':Tax2,
'Tax3':Tax3,
'Tax4':Tax4,
'Tax5':Tax5,
'Tax6':Tax6,
'Tax7':Tax7,
'url': Link
}
Graingerlist.append(Grainger)
print('Saving', Grainger['Name'])
df = pd.DataFrame(Graingerlist)
now = datetime.datetime.now()
e = '{}-{}-{}'.format(now.year, now.month, now.day)
df.to_excel(rf'GRG Sheet 1 {e}.xlsx', index=(False))
-
\$\begingroup\$ Are you doing this for fun/practice, or do you specifically need info from Grainger? If the latter, your approach is incorrect and you need to read the Integrated Ordering section of grainger.ca/en/content/services/ecommerce-solutions . \$\endgroup\$Reinderien– Reinderien2022年12月28日 18:26:09 +00:00Commented Dec 28, 2022 at 18:26
-
\$\begingroup\$ Please show a representative sample of your Excel file \$\endgroup\$Reinderien– Reinderien2022年12月28日 18:29:49 +00:00Commented Dec 28, 2022 at 18:29
-
\$\begingroup\$ It is not a "for fun" project but the link you posted does not relate to what im trying to do... as for the excel file its simply a list of urls in column A that i want scraped like: grainger.com/product/10A593 grainger.com/product/10A598 grainger.com/product/10A666 grainger.com/product/10A994 grainger.com/product/10C002 \$\endgroup\$cahilltyler– cahilltyler2022年12月28日 18:40:16 +00:00Commented Dec 28, 2022 at 18:40
-
\$\begingroup\$ I beg to differ: if you do what they suggest and contact them about integration options, they will almost surely recommend against scraping and toward the use of a pre-existing API \$\endgroup\$Reinderien– Reinderien2022年12月28日 18:41:28 +00:00Commented Dec 28, 2022 at 18:41
-
\$\begingroup\$ That option is for customers of Grainger so it is not a viable option for my project. I welcome any recommendations on improving my existing code. \$\endgroup\$cahilltyler– cahilltyler2022年12月28日 18:48:25 +00:00Commented Dec 28, 2022 at 18:48
1 Answer 1
If you actually care about speed, this is not a serious attempt at commerce integration and you need to contact the vendor to get API details. You seem extremely resistant to this but it is the only path. If you do not do this, in a week or two the vendor could very easily change their class names and/or block your IP and your efforts will have been for naught. Said another way, Grainger has good reason to block you as you present a traffic load that does not match (and may interfere with) their intended use case. Quoting their terms of access,
C. Harm to Our Systems, Property and Security
When using the Grainger Property, you will not: [...] (ii) retrieve, index, scrape, data mine or otherwise gather any Grainger Content, Grainger Property, or other data, content, or materials (including through use of any robot, spider, screen scraping, web harvesting, data extraction, or similar software or technologies.
A half-measure that would be just as fragile but may somewhat increase speed is to avoid crawling individual URLs, and instead make a user product list (https://www.grainger.com/myaccount/mylistdetails) with your interesting products. I do not know the upper limit of the size of this list, but if you can get it all in one GET
or even in paginated GET
s then this should be faster.
Since your use case violates the vendor's terms of access I will not paste fully-formed alternative code, but I will offer bullet-point ways that your use of Python should improve:
Do not use Selenium. Use bare Requests.
Do not capitalise local variables.
Do not dump all of your code into the global namespace. Write functions, and add PEP484 typehints to their signatures.
Computers are good at loops: do not spell out all of the tier and taxonomy items manually; instead, form them over loops.
Do not call .format()
with fragments of a datetime. Instead, embed the datetime format into an interpolated string, as in
f'GRG Sheet 1 {now:%Y-%m-%d}.xlsx'
Explore related questions
See similar questions with these tags.