Scraping a web page with java script in Python

Question 1

i'm working in python 3.2 (newb) on windows machine (though i have ubuntu 10.04 on virtual box if needed, but i prefer to work on the windows machine).

Basically i'm able to work with the http module and urlib module to scrape web pages, but only those that don't have java script document.write("<div....") and the like that adds data that is not there while i get the actual page (meaning without real ajax scripts).

To process those kind of sites as well i'm pretty sure i need a browser java script processor to work on the page and give me an output with the final result, hopefully as a dict or text.

I tried to compile python-spider monkey but i understand that it's not for windows and it's not working with python 3.x :-?

Any suggestions ? if anyone did something like that before i'll appreciate the help!

Question 2

Maybe this answer can help you: stackoverflow.com/questions/5196408/…

Question 3

doesn't helpso much but it does imply there is no real solution.

Question 4

Yeah. But maybe the mentioned spidermonkey-wrapper could help.

Question 5

it's not spidermonkey-wrapper it's spider money for python and it's only for 2.7 and below and only for mac and linux...

Question 6

I recommend python's bindings to the webkit library - here is an example. Webkit is cross platform and is used to render webpages in Chrome and Safari. An excellent library.

Question 7

Use Firebug to see exactly what is being called to get the data to display (a POST or GET url?). I suspect there's an AJAX call that's retrieving the data from the server either as XML or JSON. Just call the same AJAX call, and parse the data yourself.

Optionally, you can download Selenium for Firefox, start a Selenium server, download the page via Selenium, and get the DOM contents. MozRepl works as well, but doesn't have as much documentation since it's not widely used.

Question 8

i suspect u are right ehre, i allerady checked it out with firebug since i could not fid the images links myself in the web page. however for this case it might work o, but if i ever need something big, it will be an ant's work. i need something more substantial. selenium was recomanded by many, maybe i should give it a shot.

Question 9

document.write is usually used because you are generating the content on the fly, often by fetching data from a server. What you get are web apps that are more about javascript than HTML. "Scraping" is rather more a question of downloading HTML and processing it, but here there isn't any HTML to download. You are essentially trying to scrape a GUI program.

Most of these applications have some sort of API, often returning XML or JSON data, that you can use instead. If it doesn't, your should probably try to remote control a real webbrowser instead.

Question 10

you mean take an approach like writing my own firefox extension for reaping whatever information i want from the web page i guess. you are partially right my terms might be misleading. i wanted to harvest some photos from national geographic site(personal use) since i can do that using browser i guessed i can automate it. it might be eaiser had i known firefox extentions... but i do want to do it using programing lang, java or python, python being scripting is preffered...

Question 11

@codeScriber: You can also control Firefox from Python. I haven't done that so I'm not entirely sure how it's best done though.

Question 12

Writing your own browser extension would be a silly idea when there are perfectly good ones out there already. addons.mozilla.org/en-US/firefox/addon/mozrepl (I'd assume Python has a preexisting module that can talk to it as well as Perl's WWW::Mechanize::Firefox, but you can write your own easily enough)

hoju 29.7k41 gold badges138 silver badges178 bronze badges · Accepted Answer · 2011-04-12 01:15:20Z

2

I recommend python's bindings to the webkit library - here is an example. Webkit is cross platform and is used to render webpages in Chrome and Safari. An excellent library.

Share

Improve this answer

edited Oct 9, 2012 at 12:15

answered Apr 12, 2011 at 1:15

hoju's user avatar

hoju

29.7k41 gold badges138 silver badges178 bronze badges

Sign up to request clarification or add additional context in comments.

CollectivesTM on Stack Overflow

Scraping a web page with java script in Python

3 Answers 3

Comments

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

3 Answers 3

Comments

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related