I have a number of web pages that I am attempting to parse information from obtained using curl. Each of the page uses JQuery to transform its content upon the document being loaded in the browser (using the document.ready function) - mostly setting the classes/ids of divs. The information is much easier to parse once the Javascript functions have been loaded.
What are my options for (preferably from the command line) executing the Javascript content of the pages and dumping the transformed HTML?
-
1getfirebug.com/commandline ?? is this what you are looking for man.Tats_innit– Tats_innit2012年05月20日 08:41:04 +00:00Commented May 20, 2012 at 8:41
-
+1 sounds interesting :) I thought about node.js for a while but that won't work for you =/Ja͢ck– Ja͢ck2012年05月20日 08:44:01 +00:00Commented May 20, 2012 at 8:44
1 Answer 1
To scrape dynamic web, don't use static download tools like curl.
If you want to scrape dynamic web use a headless webbrowser which you can control from your programming language. The most popular tool for this is Selenium
http://code.google.com/p/selenium/
With Selenium you can export modified DOM tree out of the browser as HTML.
An example use case:
1 Comment
Explore related questions
See similar questions with these tags.