morgon has asked for the wisdom of the Perl Monks concerning the following question:

Dear fellow monks,

for many years I have used WWW::Mechanize with excellent results to scrape a weekly magazin website for offline consumption on my Paĺm-pilot.

Now all of a sudden they have changed their table of content ...

Where there used to be nice html there is now an ugly mix of html-fragments with a lot of javascript mixed in that builds the html dynamically (a lot of document.write) and unfortunately completely breaks my conversion scripts...

So what I want now is a way to capture the html that the javascript generates - i.e. a tool that interprets the javascript and saves the resulting document-html in a file.

Any ideas on how to achieve this?

Replies are listed 'Best First'.
Re: getting rid of javascript
by alexlc (Beadle) on May 02, 2009 at 04:51 UTC

    The only way to really do this is to harness an actual web browser to load the page, interpret all the js dhtml stuff, and then get it from there.
    Selenium is pretty good for this, and the perl module utilizing it looks pretty good as well, though I have not used it.
    Test::WWW::Selenium

    -- AlexLC
Re: getting rid of javascript
by spx2 (Deacon) on May 02, 2009 at 07:04 UTC

    I've seen at least 20 nodes here on PerlMonks where people are asking for this(actually Google says there are about 307). Selenium is an option, I think it's easier to write your own Selenium than using the existing one with all of its ugly configuration issues

    I had the same problem when almost all my code for parsing some pages was rendered useless because the site switched to AJAX.

      I have posted a Q&A node addressing this question so that future posts asking/answering this same general question can be pointed there.
Re: getting rid of javascript
by whakka (Hermit) on May 02, 2009 at 16:56 UTC
    I'm on Windows and have always (lazily?) avoided the Javascript issue with Win32::IE::Mechanize. I've also seen Mozilla::Mechanize get mentions. I say lazy because these use the same interface as WWW::Mechanize so it's straightforward to port code.

    With these modules you just let the page render and take the html from there. It's ugly and slow but usually works.

Re: getting rid of javascript
by Anonymous Monk on May 02, 2009 at 06:28 UTC
    You could try WWW::Scripter, its for scripting web sites that have scripts :)