jonjacobmoon has asked for the wisdom of the Perl Monks concerning the following question:

I have been asked to write a script that is a sort of limited crawler that acts like a browser. This is to say, the script will be looking for a specific link on a page and report if it is present or not.

This is the easy part.

The hard part is that these pages that the crawler will go to may have the link on any of a variety of ways. It may be in a frame, it may be generated by the javascript, or it may have meta-refresh, or it may be rendered on the page in any unforeseen way that a browser knows how to handle. In short, I need my program to look at the final rendered html source just as a browser would.

To illustrate: if I have LWP go to http://www.foo.com and foo.com has frames, then I need to check the source for each frame not the framesource.

I have some ideas of how to do this while following links, and adding exceptions for javascript, frames, and meta-refresh and any o others I can come up with, but I know that since browsers have all the exceptions handled, if it runs as if it is a browser, then I don't have to add exceptions as they come up.

Does any one have a easy way to do this that goes beyond but may even include HTML::Parser and LWP. I have researched this and know I can do it with an HTML::Parser, LWP combo where I follow certain links, but as I said, if it can act like a browser, I don't need to worry about following links to get the source for the page that the user would evenutally see.


I admit it, I am Paco.

Replies are listed 'Best First'.
(cLive ;-) Re: Browser Emulation
by cLive ;-) (Prior) on Feb 02, 2002 at 23:18 UTC
    The easiest way to emulate Internet Explorer is to add something to the script that randomly freezes your machine, and then forces you to reboot.

    Repeat as neccessary.

    my $time = localtime; 1 while ($time + 30 > localtime); `shutdown -r now`;

    Of course, you may not be able to do this unless you run it as root, but that should give you some idea.

    cLive ;-)

Re (tilly) 1: Browser Emulation
by tilly (Archbishop) on Feb 03, 2002 at 03:15 UTC
    Occasionally I think of stupid hacks to accomplish things. Most of the time I remind myself that they are stupid, and they never see the light of day. But just to be silly I will tell you a way that you can get your script to act just like a browser, complete with proper handling of frames, JavaScript, meta tags, cookies and the like.

    Go find an http proxy in Perl. (merlyn's columns would be a good place to start.) Modify it to parse the documents passing through to find links in them. (HTML::LinkExtor may help here.) Then set a copy of IE to use that as your proxy server. Use the OLE modules to drive IE around the web.

    Voila! :-)

    If you just need an answer to a question, this might be acceptable. But I (obviously) wouldn't put this into production. (Says the man who once had a temporary hack which used IE to produce PDFs left in production for half a year...)

Re: Browser Emulation
by trs80 (Priest) on Feb 02, 2002 at 21:24 UTC
    I have worked with MonkeyWrench and now there is HTTP::TestEngine that works with MonkeyWrench. The HTTP::TestEngine records your 'clicks' in a browser and then allows you to play them back.
    Have you looked at those already?
Re: Browser Emulation
by gellyfish (Monsignor) on Feb 03, 2002 at 12:49 UTC

    This is a pretty simple-minded example how you might do this using LWP::UserAgent and HTML::Parser:

    #!/usr/bin/perl -w use strict; use LWP::UserAgent; use HTML::Parser; use URI; my $starturl = shift || die "No url supplied\n"; my $baseuri = URI->new($starturl); my @urls ; push @urls,$starturl; my $agent = new LWP::UserAgent; my $parser = HTML::Parser->new(api_version => 3, start_h => [\&start ,"tagname, attr"]); $agent->agent("Gelzilla/666"); while( my $url = shift @urls) { my $request = new HTTP::Request 'GET' => $url; my $result = $agent->request($request); if ($result->is_success) { print $result->as_string; $parser->parse($result->content); } else { print "Error: " . $result->status_line . "\n"; } } sub start { my($tag,$attr) = @_; if ($tag eq 'frame' ) { my $thisuri = URI->new($attr->{src}); push @urls, $thisuri->abs($baseuri); } }

    Of course you are going to have to make your own arrangements for displaying things properly :)

    /J\

      Wow, that was way too easy :)

      Major ++ for that. I was expecting a partial answer but you pushed it nearly all the way to the end. This is very close to what I wanted. I have to do more tests to see if it will handle all possible cases (or at least as many as I can come with), but it looks like a great solution.

      Thanks.


      I admit it, I am Paco.
Re: Browser Emulation
by drifter (Scribe) on Feb 02, 2002 at 21:22 UTC
    There are no easy answers for this, I wrote a http crawler once and it wasn't simple... What I'd suggest is switch LWP to HTTP::GHTTP, and see an existing module / script, for example HTTP::SimpleLinkChecker or so.
Re: Browser Emulation
by Cody Pendant (Prior) on Feb 02, 2002 at 22:34 UTC
    these pages that the crawler will go to may have the link on any of a variety of ways. It may be in a frame, it may be generated by the javascript, or it may have meta-refresh
    Can I ask you to clarify what you mean by "link"?

    It seems from what you wrote above, that what you're actually saying is that the page may contain the URL in any form, not a link as such.

    A Meta-Refresh would take you to the other URL, but isn't technically a link. Also, a JavaScript statement like window.location='http://URL.com' would take a JavaScript-enabled browser to the new URL, but again, isn't a link.

    Both these examples require that the URL, as a string, be present, but a link requires that it be present and surrounded with the right HTML code.

    Apart from recursively testing frames, this shouldn't be hard to do.

    --
    Weaselling out of things is important. It's what separates us from the animals ... except the weasel.

Re: Browser Emulation
by Zaxo (Archbishop) on Feb 02, 2002 at 21:31 UTC

    Maybe the Inline::Java family of modules can help with rendering the javascripted bits.

    Warning, I've never tried it.

    Update: Inline::Java should be able to to parse javascript syntax. You would need to provide it a navigator class, OnThisNThat methods, etc, which I think could be written in Perl with that module. Inline is a remarkable namespace.

    After Compline,
    Zaxo

      Java and JavaScript are completly different languages. However, Netscape Navigator (I don't know about mozilla) has fairly good bridging.

      Also, it may be possible to perlify Mozilla's JavaScript engine. (Which is designed to be used as a library seperate from mozilla, IIRC.) That would certianly be a worthwhile, and probably rather difficult, project.

      Update: I actualy did Recall Correctly. The javascript interpreter mozilla uses is called JSRef, AKA SpiderMonkey. There is also Rhino, which is pure Java, and thus probably harder for you to use.

      Update: Fixed links.

      TACCTGTTTGAGTGTAACAATCATTCGCTCGGTGTATCCATCTTTG ACACAATGAATCTTTGACTCGAACAATCGTTCGGTCGCTCCGACGC