0

I am scraping a website using python, but the website is being rendered with javascript and all the links are coming from javascript. So when I use request.get(url) it's only giving the source code, not the other links that are generated with javascript. Is there any way to scrape those links automatically?

I also tried something like what's described here: Ultimate guide for scraping JavaScript rendered web pages. But that is too slow to load.

So is there any faster way, using Mechanize, Phantom or some other library? (Note: I have already tried using PyQ4, but that is too slow - I'm looking for a faster solution).

ekhumoro
122k23 gold badges272 silver badges400 bronze badges
asked Apr 11, 2016 at 10:47

2 Answers 2

0

You can Try PhantomJs or Casperjs

There are more node wrappers written over phantom and casperjs one of the most efficient and scalable is "ghost town"

answered Apr 11, 2016 at 11:15
Sign up to request clarification or add additional context in comments.

Comments

0

One approach that may not be the fastest, but is most likely to succeed, is to use Selenium. The following function should do the job: Given an URL that holds javascript generated content, retrieve the dynmaic website and return its rendered html. Note that instead of Chrome you can use any other supported browser (e.g., Firefox, Safari or IE). Have a look at the docs:

https://www.selenium.dev/selenium/docs/api/py/api.html#

def retrieve_html_from_js_website(url, path_to_chrome_binary, threshold_waiting_time=4):
 import time
 from selenium import webdriver
 from selenium.webdriver.chrome.service import Service
 user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
 options = webdriver.ChromeOptions()
 options.add_argument(f'user-agent=[{user_agent}]')
 options.add_argument('--disable-blink-features=AutomationControlled')
 options.add_experimental_option("excludeSwitches", ["enable-automation"])
 options.add_experimental_option("detach", True)
 with webdriver.Chrome(service=Service(path_to_chrome_binary), options=options) as driver:
 # Note that there are many creative websites that use mechanisms 
 # to prevent browsers instantiated with Selenium from crawling 
 # their content. Some mechanisms are listed in the following:
 # https://piprogramming.org/articles/How-to-make-Selenium-undetectable-and-stealth--7-Ways-to-hide-your-Bot-Automation-from-Detection-0000000017.html
 driver.get(url)
 time.sleep(threshold_waiting_time)
 return driver.page_source

From here you can perform any parsing operation, such as extracting javascript generated URLs. For this particular task, I prefer using Beautiful Soup, although Selenium can do the job as well.

https://beautiful-soup-4.readthedocs.io/en/latest/

answered Jul 20, 2022 at 15:27

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.