Scraping elements generated by javascript queries using python

Question 1

I am trying to access the text in an element whose content is generated by javascript. For example getting the number of twitter shares from this site.

I've tried using urllib and pyqt to obtain the html of the page, however since the content requires javascript to be generated, its HTML is not present in the response of urllib/pyqt. I am currently using selenium for this task, however it is taking longer than I would like it to.

Is it possible to get access to this data without opening the page in a browser?

This question has already been asked in the past, but the results I found are either c# specific or provide a link to a solution that has since gone dead

Question 2

Selenium is usually the best bet for this. Unless you can isolate the specific javascript request that is getting the data

Question 3

Working example :

import urllib
import requests
import json
url = "https://daphnecaruanagalizia.com/2017/10/crook-schembri-court-today-pleading-not-crook/"
encoded = urllib.parse.quote_plus(url)
# encoded = urllib.quote_plus(url) # for python 2 replace previous line by this
j = requests.get('https://count-server.sharethis.com/v2.0/get_counts?url=%s' % encoded).text
obj = json.loads(j)
print(obj['clicks']['twitter'] + obj['shares']['twitter'])
# => 5008

Explanation :

Inspecting the webpage, you can see that it does a request to this :

https://count-server.sharethis.com/v2.0/get_counts?url=https%3A%2F%2Fdaphnecaruanagalizia.com%2F2017%2F10%2Fcrook-schembri-court-today-pleading-not-crook%2F&cb=stButtons.processCB&wd=true

If you paste it in your browser you'll have all your answers. Then playing a bit with the url, you can see that removing extra parameters will give you a nice json.

So as you can see, you just have to replace the url parameter of the request with the url of the page you want to get the twitter counts.

Question 4

Thanks a lot. I can confirm that this worked. Could you please give me some more information on how you found the request? I would like to do the same thing but for the number of comments, i.e get the number of comments. I am using chrome and was trying to look for a request through the network tab. Are there any tricks that can help in identifying the desired request?

Question 5

you could filter on "javascript", "XHR", and "WS" I did just open the "response" tab, and scroll through requests, until I found "twitter"

Question 6

For the number of comments, you should check disqus

Question 7

Also, on the previous page : daphnecaruanagalizia.com You can see the number of comments per post, and it's not javascript, so quite easy to get. Check BeautifulSoup

Question 8

You could do something similar to the following after launching a Selenium web browser, then passing driver.page_source to the BeautifulSoup library (unfortunately cannot test this at work with firewalls in place):

soup = BeautifulSoup(driver.page_source, 'html.parser')
shares = soup.find('span', {'class': 'st_twitter_hcount'}).find('span', {'class': 'stBubble_hcount'})

Loïc 12k2 gold badges35 silver badges52 bronze badges · Accepted Answer · 2017-10-23 21:27:45Z

Working example :

import urllib
import requests
import json
url = "https://daphnecaruanagalizia.com/2017/10/crook-schembri-court-today-pleading-not-crook/"
encoded = urllib.parse.quote_plus(url)
# encoded = urllib.quote_plus(url) # for python 2 replace previous line by this
j = requests.get('https://count-server.sharethis.com/v2.0/get_counts?url=%s' % encoded).text
obj = json.loads(j)
print(obj['clicks']['twitter'] + obj['shares']['twitter'])
# => 5008

Explanation :

Inspecting the webpage, you can see that it does a request to this :

https://count-server.sharethis.com/v2.0/get_counts?url=https%3A%2F%2Fdaphnecaruanagalizia.com%2F2017%2F10%2Fcrook-schembri-court-today-pleading-not-crook%2F&cb=stButtons.processCB&wd=true

If you paste it in your browser you'll have all your answers. Then playing a bit with the url, you can see that removing extra parameters will give you a nice json.

So as you can see, you just have to replace the url parameter of the request with the url of the page you want to get the twitter counts.

Thanks a lot. I can confirm that this worked. Could you please give me some more information on how you found the request? I would like to do the same thing but for the number of comments, i.e get the number of comments. I am using chrome and was trying to look for a request through the network tab. Are there any tricks that can help in identifying the desired request?
you could filter on "javascript", "XHR", and "WS" I did just open the "response" tab, and scroll through requests, until I found "twitter"
Also, on the previous page : daphnecaruanagalizia.com You can see the number of comments per post, and it's not javascript, so quite easy to get. Check BeautifulSoup

CollectivesTM on Stack Overflow

Scraping elements generated by javascript queries using python

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related