Scraping javascript rendered HTML page in python

Question 1

I am scraping a website using python, but the website is being rendered with javascript and all the links are coming from javascript. So when I use request.get(url) it's only giving the source code, not the other links that are generated with javascript. Is there any way to scrape those links automatically?

I also tried something like what's described here: Ultimate guide for scraping JavaScript rendered web pages. But that is too slow to load.

So is there any faster way, using Mechanize, Phantom or some other library? (Note: I have already tried using PyQ4, but that is too slow - I'm looking for a faster solution).

Question 2

You can Try PhantomJs or Casperjs

There are more node wrappers written over phantom and casperjs one of the most efficient and scalable is "ghost town"

Question 3

One approach that may not be the fastest, but is most likely to succeed, is to use Selenium. The following function should do the job: Given an URL that holds javascript generated content, retrieve the dynmaic website and return its rendered html. Note that instead of Chrome you can use any other supported browser (e.g., Firefox, Safari or IE). Have a look at the docs:

https://www.selenium.dev/selenium/docs/api/py/api.html#

def retrieve_html_from_js_website(url, path_to_chrome_binary, threshold_waiting_time=4):
 import time
 from selenium import webdriver
 from selenium.webdriver.chrome.service import Service
 user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
 options = webdriver.ChromeOptions()
 options.add_argument(f'user-agent=[{user_agent}]')
 options.add_argument('--disable-blink-features=AutomationControlled')
 options.add_experimental_option("excludeSwitches", ["enable-automation"])
 options.add_experimental_option("detach", True)
 with webdriver.Chrome(service=Service(path_to_chrome_binary), options=options) as driver:
 # Note that there are many creative websites that use mechanisms 
 # to prevent browsers instantiated with Selenium from crawling 
 # their content. Some mechanisms are listed in the following:
 # https://piprogramming.org/articles/How-to-make-Selenium-undetectable-and-stealth--7-Ways-to-hide-your-Bot-Automation-from-Detection-0000000017.html
 driver.get(url)
 time.sleep(threshold_waiting_time)
 return driver.page_source

From here you can perform any parsing operation, such as extracting javascript generated URLs. For this particular task, I prefer using Beautiful Soup, although Selenium can do the job as well.

https://beautiful-soup-4.readthedocs.io/en/latest/

Harshit Anand 7029 silver badges13 bronze badges · Accepted Answer · 2016-04-11 11:15:10Z

0

You can Try PhantomJs or Casperjs

There are more node wrappers written over phantom and casperjs one of the most efficient and scalable is "ghost town"

Share

Improve this answer

answered Apr 11, 2016 at 11:15

Harshit Anand's user avatar

Harshit Anand

7029 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

CollectivesTM on Stack Overflow

Scraping javascript rendered HTML page in python

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related