Website link checker

shajiindia has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to run the code originally written by Steve Oualline

Description : One of the most vexing problems facing a webmaster is making sure that all the links on their website are correct. Internal links are difficult to deal with. Every time a file is added, removed, or changed on your website, there is the possibility of generating dead links. External links are even worse. Not only are they not under your control, but they disappear without a moment's notice. What's needed is a way of automatically checking a site for links that just don't work.

When I am trying to run the script I am getting the following error. I am using Windows 7.

C:\demo>site-check.pl http://www.codeinside.com
ERROR: No such file http://www.codeinside.com

Please advice on how to run the program successfully. Where am I going wrong?

#
# Usage: site-check.pl <top-file> 
#
# Checks for:
#    1. Broken links
#    2. Orphaned files
#
use strict;
use warnings;

use HTML::SimpleLinkExtor;
use LWP::Simple;
use File::Basename;
use File::Spec::Functions;
use File::Find ();

# Generated by find2pl
# for the convenience of &wanted calls, 
# including -eval statements:
use vars qw/*name *dir *prune/;
*name   = *File::Find::name;
*dir    = *File::Find::dir;
*prune  = *File::Find::prune;

my %file_seen = ();    # True if we've seen a file
my @external_links = ();# List of external links

my @bad_files = ();    # Files we did not see
my @full_file_list = ();# List of all the files


########################################################
# wanted -- Called by the find routine, this returns
#    true if the file is wanted.  As a side effect
#    it records any normal file seen in "full_file_list".
########################################################
sub wanted {
    if (-f "$name") {
        push(@full_file_list, $name);
    }
    return (1);
}

########################################################
# process_file($file)
#
# Read an html file and extract the tags.  
#
# If the file does not exist, put it in the list of 
# bad files.
########################################################
no warnings 'recursion';    # Turn off recursion warning

sub process_file($);    # Needed because this is recursive
sub process_file($)
{
    my $file_name = shift;    # The file to process
    my $dir_name = dirname($file_name);

    # Did we do it already
    if ($file_seen{$file_name}) {
        return;
    }
    $file_seen{$file_name} = 1;
    if (! -f $file_name) {
        push(@bad_files, $file_name);
    return;
    }

    # Skip non-html files
    if (($file_name !~ /\.html$/) and 
        ($file_name !~ /\.htm$/)) {
        return;
    }
    # The parser object to extract the list
    my $extractor = HTML::SimpleLinkExtor->new();

    # Parse the file
    $extractor->parse_file($file_name);

    # The list of all the links in the file
    my @all_links = $extractor->links();

    # Check each link
    foreach my $cur_link (@all_links) {

    # Is the link external
        if ($cur_link =~ /^http:\/\//) {
        # Put it on the list of external links
        push(@external_links, {
        file => $file_name,
            link => $cur_link});
        next;
    }
    # Remove the "#name" part of the link
    # We don't check that
    if ($cur_link =~ /([^#]*)#/) {
        $cur_link = $1;
    }
    if ($cur_link eq "") {
        next;
    }
    # Get the name of the file
    my $next_file = "$dir_name/$cur_link";

    # Remove any funny characters in the name
    $next_file = File::Spec->canonpath($next_file);

    # Follow the links in this file
    process_file($next_file);
    }
}
# Turn on deep recursion warning
use warnings 'recursion';    

if ($#ARGV != 0) {
    print STDERR "Usage: $0 <top-file>\n";
    exit (8);
}
        
# Top level file
my $top_file = $ARGV[0];
if (-d $top_file) {
    $top_file .= "/index.html";
}
if (! -f $top_file) {
    print STDERR "ERROR: No such file $top_file\n";
    exit (8);
}

# Scan all the links
process_file($top_file);

print "Broken Internal Links\n";
foreach my $cur_file (sort @bad_files)
{
    print "\t$cur_file\n";
}

# Traverse desired filesystems
File::Find::find({wanted => \&wanted}, dirname($ARGV[0]));

print "Orphan Files\n";
foreach my $cur_file (sort @full_file_list)
{
    if (not defined($file_seen{$cur_file})) {
    print "\t$cur_file\n";
    }
}

print "Broken External Links\n";
foreach my $cur_file (sort @external_links) {
    if (not (head($cur_file->{link}))) {
    print "\t$cur_file->{file} => $cur_file->{link}\n";
    }
}
[download]

Comment on Website link checker Download Code

Replies are listed 'Best First'.
Re: Website link checker by BrowserUk (Patriarch) on Jan 24, 2012 at 07:35 UTC
The problem is, the script is written to recurse through your own webite, by running it on your webserver, against the filesystem. You are trying to run it against someone else's website via the internet, and that is not what it is designed for, and it won't do that. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply]
Re^2: Website link checker by shajiindia (Acolyte) on Jan 24, 2012 at 10:34 UTC
Thanks BrowserUk for your timely help.	[reply]
Re: Website link checker by Anonymous Monk on Jan 24, 2012 at 08:43 UTC
http://linkchecker.sourceforge.net/ http://validator.w3.org/checklink	[reply]
Re^2: Website link checker by shajiindia (Acolyte) on Jan 24, 2012 at 10:36 UTC
Thanks Anonymous Monk	[reply]