Scrape website with Python with javascript format

Question 1

I don't have much experience scraping data from websites. I normally use Python "requests" and "BeautifulSoup".

I need to download the table from here https://publons.com/awards/highly-cited/2019/ I do the usual with right click and Inspect, but the format is not the one I'm used to working with. I did a bit of reading and seems to be Javascript, where I could potentially extract the data from https://publons.com/static/cache/js/app-59ff4a.js. I read other posts that recommend Selenium and PhantomJS. However, I can't modify the paths as I'm not admin in this computer (I'm using Windows). Any idea on how to tackle this? Happy to go with R if Python isn't an option.

Thanks!

Question 2

If you monitor the web traffic via dev tools you will see the API calls the page makes to update content. The info returned is in json format.

For example: page 1

import requests
r = requests.get('https://publons.com/awards/api/2019/hcr/?page=1&per_page=10').json()

You can alter the page param in a loop to get all results.

The total number of results is already indicated in the first response via r['count'] so easy enough to calculate the # pages to loop for at 10 results per page. Just be sure to be polite in how you make your requests.

Outline:

import math, requests
with requests.Session() as s:
 r = s.get('https://publons.com/awards/api/2019/hcr/?page=1&per_page=10').json()
 #do something with json. Parse items of interest into list and add to a final list? Convert to dataframe at end?
 number_pages = math.ceil(r['count']/10)
 for page in range(2, number_pages + 1):
 #perhaps have a delay after X requests
 r = s.get(f'https://publons.com/awards/api/2019/hcr/?page={page}&per_page=10').json()
 #do something with json. Parse items of interest into list and add to a final list? Convert to dataframe at end?

QHarr 84.5k14 gold badges58 silver badges105 bronze badges · Accepted Answer · 2019-09-16 05:02:49Z

If you monitor the web traffic via dev tools you will see the API calls the page makes to update content. The info returned is in json format.

For example: page 1

import requests
r = requests.get('https://publons.com/awards/api/2019/hcr/?page=1&per_page=10').json()

You can alter the page param in a loop to get all results.

The total number of results is already indicated in the first response via r['count'] so easy enough to calculate the # pages to loop for at 10 results per page. Just be sure to be polite in how you make your requests.

Outline:

import math, requests
with requests.Session() as s:
 r = s.get('https://publons.com/awards/api/2019/hcr/?page=1&per_page=10').json()
 #do something with json. Parse items of interest into list and add to a final list? Convert to dataframe at end?
 number_pages = math.ceil(r['count']/10)
 for page in range(2, number_pages + 1):
 #perhaps have a delay after X requests
 r = s.get(f'https://publons.com/awards/api/2019/hcr/?page={page}&per_page=10').json()
 #do something with json. Parse items of interest into list and add to a final list? Convert to dataframe at end?

CollectivesTM on Stack Overflow

Scrape website with Python with javascript format

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related