I am trying to get the url and sneaker titles at https://stockx.com/sneakers.
This is my code so far:
in main.py
from bs4 import BeautifulSoup
from utils import generate_request_header
import requests
url = "https://stockx.com/sneakers"
html = requests.get(url, headers=generate_request_header()).content
soup = BeautifulSoup(html, "lxml")
print soup
in utils.py
def generate_request_header():
header = BASE_REQUEST_HEADER
header["User-Agent"] = random.choice(USER_AGENT_HEADER_LIST)
return header
But whenever I print soup, I get the following output: https://pastebin.com/Ua6B6241. There doesn't seem to be any HTML extracted. How would I get it? Should I be using something like Selenium?
2 Answers 2
requests doesn't seem to be able to verify the ssl certificates, to temporarily bypass this error, you can use verify=False, i.e.:
requests.get(url, headers=generate_request_header(), verify=False)
To fix it permanently, you may want to read:
http://docs.python-requests.org/en/master/user/advanced/#ssl-cert-verification
Comments
I'm guessing the data you're looking for are at line 126 in the pastebin. I've never tried to extract the text of a script but I'm sure it could be done.
In lxml, something like:
source_code.xpath('//script[@type="text/javascript"]') should return a list of all the scripts as objects.
Or to try and get straight to the "tickers":
[i for i in source_code.xpath('//script[@type="text/javascript"]') if 'tickers' in i.xpath('string')]
BASE_REQUEST_HEADERandUSER_AGENT_HEADER_LIST? are they inside the functionsscope?