I don't have much experience scraping data from websites. I normally use Python "requests" and "BeautifulSoup".
I need to download the table from here https://publons.com/awards/highly-cited/2019/ I do the usual with right click and Inspect, but the format is not the one I'm used to working with. I did a bit of reading and seems to be Javascript, where I could potentially extract the data from https://publons.com/static/cache/js/app-59ff4a.js. I read other posts that recommend Selenium and PhantomJS. However, I can't modify the paths as I'm not admin in this computer (I'm using Windows). Any idea on how to tackle this? Happy to go with R if Python isn't an option.
Thanks!
1 Answer 1
If you monitor the web traffic via dev tools you will see the API calls the page makes to update content. The info returned is in json format.
For example: page 1
import requests
r = requests.get('https://publons.com/awards/api/2019/hcr/?page=1&per_page=10').json()
You can alter the page param in a loop to get all results.
The total number of results is already indicated in the first response via r['count'] so easy enough to calculate the # pages to loop for at 10 results per page. Just be sure to be polite in how you make your requests.
Outline:
import math, requests
with requests.Session() as s:
r = s.get('https://publons.com/awards/api/2019/hcr/?page=1&per_page=10').json()
#do something with json. Parse items of interest into list and add to a final list? Convert to dataframe at end?
number_pages = math.ceil(r['count']/10)
for page in range(2, number_pages + 1):
#perhaps have a delay after X requests
r = s.get(f'https://publons.com/awards/api/2019/hcr/?page={page}&per_page=10').json()
#do something with json. Parse items of interest into list and add to a final list? Convert to dataframe at end?