Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Here is a script for link checks on one web page. It doesnt always work and gives a "bad" link message for links that are okay. It is about 80% accurate. I need some assistance in making this 100% accurate. Is there anyway to make this better?? I just need it to check http links only with no email or ftp links.
use HTML::LinkExtor; use LWP::Simple qw(get head); $base_url = shift or die "not working here: $0 <start_url>\n"; $parser = HTML::LinkExtor->new(undef, $base_url); $parser->parse(get($base_url)); @links = $parser->links; print "$base_url: \n"; foreach $linkarray(@links) { my @element = @$linkarray; my $elt_type = shift @element; while (@element) { my ($attr_name, $attr_value) = splice(@element, 0,2); if ($attr_value->scheme =~ /\b(ftp|https?|file)\b/) { print " $attr_value ", head($attr_value) ? "OK" : "BAD","\n"; } } }

Replies are listed 'Best First'.
Re: Script is accurate 80% of time
by Zaxo (Archbishop) on Mar 27, 2002 at 16:56 UTC

    See LWP head mystery and replies, particularly merlyn's succinct explanation. Many sites deny HEAD requests (does anyone know why?), so failed ones should be retried with get().

    After Compline,
    Zaxo

      I don't know why, but the only reason I can think of makes the site operator out to be really stupid/vicious. Without a HEAD a client cannot determine the state of a page and will have to refetch with GET. For users this means you might be able to show more ad's. It might also be possible to exploit this for search-engines as well. However according to this page it would seem that HTTP/0.9 did not have HEAD. Though I hope there aren't *that* many HTTP/0.9 hosts out there...

      --
      perl -pe "s/\b;([st])/'\1/mg"

(crazyinsomniac) Re: Script is accurate 80% of time
by crazyinsomniac (Prior) on Mar 28, 2002 at 09:04 UTC
Case sensitive regex problem?
by RMGir (Prior) on Mar 27, 2002 at 11:55 UTC
    I'm not sure how HTML::LinkExtor works, but if it doesn't lowercase the scheme for you, you need an i on the end of that regex, as in:
    $attr_value->scheme =~ /\b(ftp|https?|file)\b/i # <---
    Otherwise, you'd miss anything like
    <A HREF="HTTP://www.perlmonks.org">
    where the protocol is in caps.

    Then again, maybe HTML::LinkExtor does that for you, and in that case I have no idea what the problem is :)
    --
    Mike

    Edit: D'oh! Completely misread the question; I blame the lack of caffeine in my system this early :)