The site I want to scrape populates returns using JavaScript.
Can I simply call the script somehow and work with its results? (Then without pagination, of course.) I don't want to run the entire thing to scrape the resulting formatted HTML, but the raw source is blank.
Have a look: http://kozbeszerzes.ceu.hu/searchresults.xhtml?q=1998&page=0
The source of the return is simply
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="/templates/base_template.xsl"?>
<content>
<head>
<SCRIPT type="text/javascript" src="/js/searchResultsView.js"></SCRIPT>
</head>
<whitebox>
<div id = "hits"></div>
</whitebox>
</content>
I would prefer simple Python tools.
-
1I'm only just looking into this, but try PhantomJS and Selenium WebDriver. I'll try and get you an answer.Jeffrey Tang– Jeffrey Tang2014年03月25日 03:07:08 +00:00Commented Mar 25, 2014 at 3:07
3 Answers 3
I downloaded Selenium and ChromeDriver.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('http://kozbeszerzes.ceu.hu/searchresults.xhtml?q=1998&page=0')
for e in driver.find_elements_by_class_name('result'):
link = e.find_element_by_tag_name('a')
print(link.text.encode('ascii', 'ignore'), link.get_attribute('href').encode('ascii', 'ignore'))
driver.quit()
If you're using Chrome, you can inspect the page attributes using F12, which is pretty useful.
Comments
Indeed you can do that with Python. You either need python-ghost or Selenium. I prefer the latter combined with PhantomJS, much lighter and simpler to install, and easy to use:
Install phantomjs with npm (Node Package Manager):
apt-get install nodejs
npm install phantomjs
install selenium:
pip install selenium
and get the resulted page like this, and parse it with beautifulSoup (or another lib) as usual:
from BeautifulSoup4 import BeautifulSoup as bs
from selenium import webdriver
client = webdriver.PhantomJS()
client.get("http://foo")
soup = bs(client.page_source)
Comments
In nutshell: you can't do this with Python only.
As you've said, this is populated by javascript (jquery), which adds content on-the fly.
You can try running script with nodejs locally and at some point dump DOM as html. But you need to dig into js code anyway.