geektron has asked for the wisdom of the Perl Monks concerning the following question:

i know the question has been asked a few times before ( How to scrape an HTTPS website that has JavaScript, Thwarting Screen Scrapers ), but neither one seems to hit on the part i need.

i've been tasked with screen-scraping what originally looked like an easy page, but it turns out the entire page is built with calls to Javascript's  document.write. i suspect the engineers were trying to avoid screen scraping in the first place ...

I can get the information out of the page I need from a little bit of reverse-engineering and parsing an array of arrays (in Javascript). i've read through Javascript::Spidermonkey to see if this will DWIM ... but i can't tell from reading the perldoc if i can use Javascript::Spidermonkey to extract arrays from the page code, or if i'm going to have to resort to some brute-force parsing of the page.

is Javascript::Spidermonkey what i'm looking for in this case? or should i stick with some combination of something like WWW::Mechanize, LWP, etc ...

Replies are listed 'Best First'.
Re: more screen scraping with embedded Javascript
by Joost (Canon) on Oct 25, 2004 at 20:52 UTC
    I think it's probably easier to just reverse engineer the request using HTTP::Recorder or (more low-level) log your browsers actual request using a basic HTTP proxy.

    update: The spidermonkey javascript engine only does Javascript. It has no concept of a browser: that means no document, no DOM, no HTML forms. A simple document.write() will not work because there is no document object. You might be able to extract the script from the HTML page, hand it a a fake document object and have the script write to that (provided it doesn't try to do any events, or read from or write from the DOM or anything like that) and then have that document object return its content to you.

    Then you will have to figure out where the written pieces go in your HTML form, pass it into WWW::Mechanize, convince WWW::Mechanize the page you've just created is actually located on a remote server (not that hard, probably) and submit the form.

    Repeat until you've reached the last page.

    Actually, what you want is complete automated browser. I hear IE can be controlled via OLE or something like that. I don't know how well that works. I'm not familiar with any automation options for mozilla.

    updated: fixed some typos

      well, the biggest issue is that i need to take an array of elements ( defined in the javascript ) and somehow re-engineer the  document.write() calls to build links. i can easily build the links once i have the array, but i'm unclear from reading the docs for HTTP::Recorder how i'll be able to extract the javascript-based array from the page content ... i don't need to fill out forms; i need to rebuild parts of a page using the javascript contents of a remote page.
Re: more screen scraping with embedded Javascript
by johnnywang (Priest) on Oct 26, 2004 at 02:17 UTC
    If you're on Win32, you can use Win32::OLE, especially Win32::IE::Mechanize, which starts IE, and drives it (i.e., follows links, submit forms, click a button, etc.) I've used it to run tests against an application that uses javascript. Since you're really using IE, javascript/browser events are all handled, but I did not try to access a dynamically constructed document, as is in your case. There is also a project called Samie, which basically does the same thing (direclty uses Win32::OLE).
      unfortunately for this task, i'm not on Win32.

      it's something i'd need to handle via cron on a *nix server.