how to scrape search results if returned in javascript (using python)

Question 1

The site I want to scrape populates returns using JavaScript.

Can I simply call the script somehow and work with its results? (Then without pagination, of course.) I don't want to run the entire thing to scrape the resulting formatted HTML, but the raw source is blank.

Have a look: http://kozbeszerzes.ceu.hu/searchresults.xhtml?q=1998&page=0

The source of the return is simply

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="/templates/base_template.xsl"?>
<content>
 <head>
 <SCRIPT type="text/javascript" src="/js/searchResultsView.js"></SCRIPT> 
 </head>
 <whitebox>
 <div id = "hits"></div> 
 </whitebox>
</content>

I would prefer simple Python tools.

Question 2

I'm only just looking into this, but try PhantomJS and Selenium WebDriver. I'll try and get you an answer.

Question 3

I downloaded Selenium and ChromeDriver.

from selenium import webdriver
driver = webdriver.Chrome()
driver.get('http://kozbeszerzes.ceu.hu/searchresults.xhtml?q=1998&page=0')
for e in driver.find_elements_by_class_name('result'):
 link = e.find_element_by_tag_name('a')
 print(link.text.encode('ascii', 'ignore'), link.get_attribute('href').encode('ascii', 'ignore'))
driver.quit()

If you're using Chrome, you can inspect the page attributes using F12, which is pretty useful.

Question 4

Indeed you can do that with Python. You either need python-ghost or Selenium. I prefer the latter combined with PhantomJS, much lighter and simpler to install, and easy to use:

Install phantomjs with npm (Node Package Manager):

apt-get install nodejs
npm install phantomjs

install selenium:

pip install selenium

and get the resulted page like this, and parse it with beautifulSoup (or another lib) as usual:

from BeautifulSoup4 import BeautifulSoup as bs
from selenium import webdriver
client = webdriver.PhantomJS()
client.get("http://foo")
soup = bs(client.page_source)

Question 5

In nutshell: you can't do this with Python only.

As you've said, this is populated by javascript (jquery), which adds content on-the fly.

You can try running script with nodejs locally and at some point dump DOM as html. But you need to dig into js code anyway.

Question 6

Thanks, then can you help me (or help rephrase the question) how to run the right piece of JavaScript e.g. with an AppleScript call ('tell application Google Chrome to execute ....js', but how exactly?). If you have a look at the .js file, I am happy with its returns in 'resp,' without the pagination I only need to run this only once for each year 1998-2014.

Question 7

nodejs is js interpreter that you can install and run js scripts with it. Just look at it, it's no harder to use than python shell/interpreter.

Question 8

Will do, what I am not sure how I can specify the function arguments for this remote function which is built to work with queries in the page that contained it. Thanks!

Jeffrey Tang 7936 silver badges17 bronze badges · Accepted Answer · 2014-03-25 03:35:31Z

I downloaded Selenium and ChromeDriver.

from selenium import webdriver
driver = webdriver.Chrome()
driver.get('http://kozbeszerzes.ceu.hu/searchresults.xhtml?q=1998&page=0')
for e in driver.find_elements_by_class_name('result'):
 link = e.find_element_by_tag_name('a')
 print(link.text.encode('ascii', 'ignore'), link.get_attribute('href').encode('ascii', 'ignore'))
driver.quit()

If you're using Chrome, you can inspect the page attributes using F12, which is pretty useful.

CollectivesTM on Stack Overflow

how to scrape search results if returned in javascript (using python)

3 Answers 3

Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

3 Answers 3

Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related