i'm working in python 3.2 (newb) on windows machine (though i have ubuntu 10.04 on virtual box if needed, but i prefer to work on the windows machine).
Basically i'm able to work with the http module and urlib module to scrape web pages, but only those that don't have java script document.write("<div....") and the like that adds data that is not there while i get the actual page (meaning without real ajax scripts).
To process those kind of sites as well i'm pretty sure i need a browser java script processor to work on the page and give me an output with the final result, hopefully as a dict or text.
I tried to compile python-spider monkey but i understand that it's not for windows and it's not working with python 3.x :-?
Any suggestions ? if anyone did something like that before i'll appreciate the help!
-
Maybe this answer can help you: stackoverflow.com/questions/5196408/…Danilo Bargen– Danilo Bargen2011年03月17日 14:51:11 +00:00Commented Mar 17, 2011 at 14:51
-
doesn't helpso much but it does imply there is no real solution.codeScriber– codeScriber2011年03月17日 15:57:14 +00:00Commented Mar 17, 2011 at 15:57
-
Yeah. But maybe the mentioned spidermonkey-wrapper could help.Danilo Bargen– Danilo Bargen2011年03月17日 16:27:46 +00:00Commented Mar 17, 2011 at 16:27
-
it's not spidermonkey-wrapper it's spider money for python and it's only for 2.7 and below and only for mac and linux...codeScriber– codeScriber2011年03月17日 17:29:34 +00:00Commented Mar 17, 2011 at 17:29
3 Answers 3
I recommend python's bindings to the webkit library - here is an example. Webkit is cross platform and is used to render webpages in Chrome and Safari. An excellent library.
Comments
Use Firebug to see exactly what is being called to get the data to display (a POST or GET url?). I suspect there's an AJAX call that's retrieving the data from the server either as XML or JSON. Just call the same AJAX call, and parse the data yourself.
Optionally, you can download Selenium for Firefox, start a Selenium server, download the page via Selenium, and get the DOM contents. MozRepl works as well, but doesn't have as much documentation since it's not widely used.
1 Comment
document.write is usually used because you are generating the content on the fly, often by fetching data from a server. What you get are web apps that are more about javascript than HTML. "Scraping" is rather more a question of downloading HTML and processing it, but here there isn't any HTML to download. You are essentially trying to scrape a GUI program.
Most of these applications have some sort of API, often returning XML or JSON data, that you can use instead. If it doesn't, your should probably try to remote control a real webbrowser instead.
3 Comments
Explore related questions
See similar questions with these tags.