1. Home
2. Questions
3. AI Assist
4. Tags
5. Challenges
6. Chat
7. Articles
8. Users
9. Companies
11. Communities for your favorite technologies. Explore all Collectives
Stack Internal

Stack Overflow for Teams is now called Stack Internal. Bring the best of human thought and AI automation together at your work.
Try for free Learn more
Bring the best of human thought and AI automation together at your work. Learn more

Scraping website in which html is injected with javascript

Asked 8 years, 9 months ago

Viewed 188 times

I am trying to get the url and sneaker titles at https://stockx.com/sneakers.

This is my code so far:

in main.py

from bs4 import BeautifulSoup
from utils import generate_request_header
import requests
url = "https://stockx.com/sneakers"
html = requests.get(url, headers=generate_request_header()).content
soup = BeautifulSoup(html, "lxml")
print soup

in utils.py

def generate_request_header():
 header = BASE_REQUEST_HEADER
 header["User-Agent"] = random.choice(USER_AGENT_HEADER_LIST)
 return header

But whenever I print soup, I get the following output: https://pastebin.com/Ua6B6241. There doesn't seem to be any HTML extracted. How would I get it? Should I be using something like Selenium?

python

Improve this question

asked Apr 8, 2017 at 8:58

methuselah's user avatar

methuselah

13.3k53 gold badges181 silver badges340 bronze badges

Where is the code for BASE_REQUEST_HEADER and USER_AGENT_HEADER_LIST? are they inside the functions scope ?

Pedro Lobito
– Pedro Lobito

2017年04月08日 09:11:39 +00:00
Commented Apr 8, 2017 at 9:11
Here: pastebin.com/E19rtbZy

methuselah
– methuselah

2017年04月08日 09:12:23 +00:00
Commented Apr 8, 2017 at 9:12

Add a comment |

2 Answers 2

Sorted by: Reset to default

requests doesn't seem to be able to verify the ssl certificates, to temporarily bypass this error, you can use verify=False, i.e.:

requests.get(url, headers=generate_request_header(), verify=False)

To fix it permanently, you may want to read:

http://docs.python-requests.org/en/master/user/advanced/#ssl-cert-verification

Improve this answer

answered Apr 8, 2017 at 9:17

Pedro Lobito's user avatar

Pedro Lobito

99.8k36 gold badges274 silver badges278 bronze badges

Comments

I'm guessing the data you're looking for are at line 126 in the pastebin. I've never tried to extract the text of a script but I'm sure it could be done.

In lxml, something like: source_code.xpath('//script[@type="text/javascript"]') should return a list of all the scripts as objects.

Or to try and get straight to the "tickers":

[i for i in source_code.xpath('//script[@type="text/javascript"]') if 'tickers' in i.xpath('string')]

Improve this answer

answered Apr 9, 2017 at 0:52

AutomaticStatic's user avatar

AutomaticStatic

1,7595 gold badges23 silver badges46 bronze badges

Comments

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

python

See similar questions with these tags.

lang-py

CollectivesTM on Stack Overflow

Scraping website in which html is injected with javascript

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related