doc has asked for the wisdom of the Perl Monks concerning the following question:

I have a widget that needs to get HTML source code form a variety of websites. LWP:: is fine for getting most Web pages. It does not however get the 'real' HTML source for pages like:

<script>window.location = 'http://somewhere.else'</script>

What I want to do is get the terminal HTML source that represents the page(s) that a real user gets to see finally rendered in their browser once the javascript redirection dust settles as it were.

While parsing javascript like the simple example above is of course easy, there are an infinite number of variations on this and you need a full blown Javascript/DOM engine. Seeing these are already written for the Major Browsers there seems no point in not just hooking them.

For IE there is OLE or SAMIE but this constrains you to use Windows for the OS which is what we are currently doing.

Is there an equivalent for Mozilla so I can run this widget on *nix?

Replies are listed 'Best First'.
Re: Getting HTML Source Code where Javascript Redirects Foil LWP
by Corion (Patriarch) on Oct 24, 2003 at 12:44 UTC

    If you're desperate enough to follow the road of generic JavaScript execution instead of the simple template based approach, take a look at Javascript.pm, which uses the Mozilla SeaMonkey JavaScript engine. Now all you have to do is to supply an appropriate browser DOM to the JavaScript, and you're all set.

    perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web
Re: Getting HTML Source Code where Javascript Redirects Foil LWP
by bunnyman (Hermit) on Oct 24, 2003 at 19:51 UTC

    Mozilla can be embedded - get info here.

    There is a project underway to get Mozilla supporting OLE with the same interface as IE has. Go here. This won't help you on *nix though.

      Will the Mozilla OLE support include the patented IE "IRemoteExploit" call?

        I think there's still a bug or two in that part. Too bad, you can't have everything with free software.

Re: Getting HTML Source Code where Javascript Redirects Foil LWP
by cbraga (Pilgrim) on Oct 24, 2003 at 17:39 UTC
    LWP does get the real source for the page, only that the real source includes JavaScript redirection. A very poor approach from the person who wrote it, considering that search engines in general won't be able to follow it, not to mention that it's The Evil Way To Do It. :)
    <sig>ESC[78;89;13p ESC[110;121;13p</sig>
Re: Getting HTML Source Code where Javascript Redirects Foil LWP
by Willard B. Trophy (Hermit) on Oct 24, 2003 at 18:44 UTC
    Maybe the use of JavaScript is a hint that the owners don't want you to scrape their content. Check their terms and conditions.

    At $firm, we have a lawyer who spends quite a bit of his time chasing down people who don't want to pay for our commercial services. Some surprisingly large companies are getting quite a shock when they get a cease and desist.

    --
    bowling trophy thieves, die!

Re: Getting HTML Source Code where Javascript Redirects Foil LWP
by Art_XIV (Hermit) on Oct 24, 2003 at 19:21 UTC

    I might be missing something here, but...

    Why not have your code LWP-slurp that it finds within the script tags?

    You will probably need configure your app with some rules on when to do so, but that should be half the fun.

Re: Getting HTML Source Code where Javascript Redirects Foil LWP
by petdance (Parson) on Oct 25, 2003 at 02:45 UTC
    The brand new WWW::Mechanize that I put up last night starts down this road. So far, it only finds tags in an A onClick, but it's a start.

    xoxo,
    Andy