Can anyone direct me to a good Python screen scraping library for javascript code (hopefully one with good documentation/tutorials)? I'd like to see what options are out there, but most of all the easiest to learn with fastest results... wondering if anyone had experience. I've heard some stuff about spidermonkey, but maybe there are better ones out there?
Specifically, I use BeautifulSoup and Mechanize to get to here, but need a way to open the javascript popup, submit data, and download/parse the results in the javascript popup.
<a href="javascript:openFindItem(12510109)" onclick="s_objectID="javascript:openFindItem(12510109)_1";return this.s_oc?this.s_oc(e):true">Find Item</a>
I'd like to implement this with Google App engine and Django. Thanks!
3 Answers 3
What I usually do is automate an actual browser in these cases, and grab the processed HTML from there.
Edit:
Here's an example of automating InternetExplorer to navigate to a URL and grab the title and location after the page loads.
from win32com.client import Dispatch
from ctypes import Structure, pointer, windll
from ctypes import c_int, c_long, c_uint
import win32con
import pywintypes
class POINT(Structure):
_fields_ = [('x', c_long),
('y', c_long)]
def __init__( self, x=0, y=0 ):
self.x = x
self.y = y
class MSG(Structure):
_fields_ = [('hwnd', c_int),
('message', c_uint),
('wParam', c_int),
('lParam', c_int),
('time', c_int),
('pt', POINT)]
def wait_until_ready(ie):
pMsg = pointer(MSG())
NULL = c_int(win32con.NULL)
while True:
while windll.user32.PeekMessageW(pMsg, NULL, 0, 0, win32con.PM_REMOVE) != 0:
windll.user32.TranslateMessage(pMsg)
windll.user32.DispatchMessageW(pMsg)
if ie.ReadyState == 4:
break
ie = Dispatch("InternetExplorer.Application")
ie.Visible = True
ie.Navigate("http://google.com/")
wait_until_ready(ie)
print "title:", ie.Document.Title
print "location:", ie.Document.location
3 Comments
I use the Python bindings to webkit to render basic JavaScript and Chickenfoot for more advanced interactions. See this webkit example for more info.
Comments
You can also use a "programatic web browser" named Spynner. I found this to be the best solution. Relatively easy to use.