Scraping HTML and JavaScript

Asked 11 years, 9 months ago

Viewed 14k times

I am working on a project in which I need to crawl several websites and gather different kinds of information from them. Information like text, links, images, etc.

I am using Python for this. I have tried BeautifulSoup for this purpose on the HTML pages and it works, but I am stuck when parsing sites which contains a lot of JavaScript, as most of the information on these files is stored in the <script> tag.

Any ideas how to do this?

Improve this question

edited Mar 31, 2014 at 14:45

Patrick Hofman's user avatar

Patrick Hofman

158k23 gold badges270 silver badges343 bronze badges

asked Mar 31, 2014 at 14:30

user1934948's user avatar

user1934948

3271 gold badge4 silver badges17 bronze badges

and another ressource: stackoverflow.com/questions/22624255/…

Ehvince
– Ehvince

2014年03月31日 15:31:12 +00:00
Commented Mar 31, 2014 at 15:31
as a side note, selenium is much more lighweight than Ghost.

Ehvince
– Ehvince

2014年03月31日 15:32:03 +00:00
Commented Mar 31, 2014 at 15:32

Add a comment |

4 Answers 4

Sorted by: Reset to default

First of all, scrapping and parsing JS from pages is not trivial. It can however be vastly simplified if you use a headless web client instead, which will parse everything for you just like a regular browser would.
The only difference is that its main interface is not GUI/HMI but an API.

For example, you can use PhantomJS with Chrome or Firefox which both support headless mode.

For a more complete list of headless browsers check here.

Improve this answer

edited Mar 25, 2019 at 15:37

answered Mar 31, 2014 at 14:34

bosnjak's user avatar

bosnjak

8,6342 gold badges25 silver badges47 bronze badges

1 Comment

user1934948

user1934948 Over a year ago

i am able to get ghost work and load the page but what should i do get whole webpage out of it. the documentation describes a function get_page but it is not there even in the code itself.

2014年04月23日T15:10:24.66Z+00:00

If there is a lot of javascript dynamic load involved in the page loading, things get more complicated.

Basically, you have 3 ways to crawl the data from the website:

using browser developer tools see what AJAX requests are going on a page load. Then simulate these requests in your crawler. You will probably need the help of json and requests modules.
use tools that utilizes real browsers like selenium. In this case you don't care how the page is loaded - you'll get what a real user see. Note: you can use a headless browser too.
see if the website provides an API (e.g. walmart API)

Also take a look at Scrapy web-scraping framework - it doesn't handle AJAX calls too, but this is really the best tool in web-scraping world I've ever worked with.

Also see these resources:

Hope that helps.

Improve this answer

edited May 23, 2017 at 10:30

Community's user avatar

Community Bot

11 silver badge

answered Mar 31, 2014 at 14:34

alecxe's user avatar

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Comments

To get you started with selenium and BeautifulSoup:

Install phantomjs with npm (Node Package Manager):

apt-get install nodejs
npm install phantomjs

install selenium:

pip install selenium

and get the resulted page like this, and parse it with beautifulSoup as usual:

from BeautifulSoup4 import BeautifulSoup as bs
from selenium import webdriver
client = webdriver.PhantomJS()
client.get("http://foo")
soup = bs(client.page_source)

Improve this answer

answered Mar 31, 2014 at 15:35

Ehvince's user avatar

Ehvince

18.7k8 gold badges64 silver badges84 bronze badges

Comments

A very fast way would be to iterate through all the tags and get textContent This is the JS snippet:

page =""; var all = document.getElementsByTagName("*"); for (tag of all) page = page + tag.textContent;

or in selenium/python:

import selenium
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("http://ranprieur.com")
pagetext = driver.execute_script('page =""; var all = document.getElementsByTagName("*"); for (tag of all) page = page + tag.textContent; return page;')

enter image description here

Improve this answer

answered Feb 20, 2018 at 1:48

Eduard Florinescu's user avatar

Eduard Florinescu

17.7k30 gold badges123 silver badges189 bronze badges

Comments

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

default

CollectivesTM on Stack Overflow

Scraping HTML and JavaScript

4 Answers 4

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

4 Answers 4

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related