in reply to more screen scraping with embedded Javascript

I think it's probably easier to just reverse engineer the request using HTTP::Recorder or (more low-level) log your browsers actual request using a basic HTTP proxy.

update: The spidermonkey javascript engine only does Javascript. It has no concept of a browser: that means no document, no DOM, no HTML forms. A simple document.write() will not work because there is no document object. You might be able to extract the script from the HTML page, hand it a a fake document object and have the script write to that (provided it doesn't try to do any events, or read from or write from the DOM or anything like that) and then have that document object return its content to you.

Then you will have to figure out where the written pieces go in your HTML form, pass it into WWW::Mechanize, convince WWW::Mechanize the page you've just created is actually located on a remote server (not that hard, probably) and submit the form.

Repeat until you've reached the last page.

Actually, what you want is complete automated browser. I hear IE can be controlled via OLE or something like that. I don't know how well that works. I'm not familiar with any automation options for mozilla.

updated: fixed some typos

  • Comment on Re: more screen scraping with embedded Javascript

Replies are listed 'Best First'.
Re^2: more screen scraping with embedded Javascript
by geektron (Curate) on Oct 25, 2004 at 22:21 UTC
    well, the biggest issue is that i need to take an array of elements ( defined in the javascript ) and somehow re-engineer the  document.write() calls to build links. i can easily build the links once i have the array, but i'm unclear from reading the docs for HTTP::Recorder how i'll be able to extract the javascript-based array from the page content ... i don't need to fill out forms; i need to rebuild parts of a page using the javascript contents of a remote page.