Pilot has asked for the wisdom of the Perl Monks concerning the following question:

Hello Perl Monks. I am fairly new to programming in Perl and I have a relatively easy problem that I am not sure how to start solving. Basically, what I need to do is write a Perl script that will recursively go down through an entire web directory and check all of the links to make sure they are relative. It has to look in every folder and has to make sure that all the links begin with www.blah.blah.com If the links are not relative and point elsewhere, I need the code to print those for me and where they are located. Suggestions are GREATLY appreciated.

Thanks,
Pilot

update (broquaint): title change (was Recursive)

  • Comment on Recursively traversing and checking for relative links

Replies are listed 'Best First'.
Re: Recursively traversing and checking for relative links
by Zaxo (Archbishop) on Feb 11, 2003 at 17:57 UTC

    There are two modules which will do most of the work for you. File::Find will handle the directory recursion. HTML::Parser or HTML::TokeParser will help decypher the anchor tags to get you the links.

    Try a first cut at a program working from the module doc examples, and show us what you get. We'll be glad to help.

    After Compline,
    Zaxo

Re: Recursively traversing and checking for relative links
by bart (Canon) on Feb 11, 2003 at 18:00 UTC
    Is that a "local" directory tree? I'm thinking of running a script on the site itself...

    You can use File::Find to go through the whole tree. For each file, you can use HTML::LinkExtor, or one of the other available link checker modules (I think there's even one on this site), to extract the links. Next, all you have to do is check the actual links.

    If HTML::LinkExtor doesn't work, it's easy to throw something together with HTML::TokeParser or HTML::TokeParser::Simple, go through all the tags, and check the attributes for only the proper tags.

    Before I go too far and do all the work myself, do check if you can make it work yourself. I'm willing to look into it, if I know it will be appreciated.

Re: Recursively traversing and checking for relative links
by jammin (Novice) on Feb 11, 2003 at 18:07 UTC
    For the actual link checking I would recommend that you use HTML::TokeParser, for the directory scanning you will need to use recursion. Someting like:
    sub getDirectoryList{ #####################Get list of file in a dir my $dir = shift; my $list = []; opendir(DIR, $dir); while(defined(my $file = readdir(DIR))){ unless ($file eq "." || $file eq ".."){ push @$list,$file; } } closedir DIR; return $list; } sub getFileList{ ################Get list of files from dir my ($dir,$fileList) = @_; my $dirList = getDirectoryList($dir); foreach my $file(@$dirList){ print "checking $dir - $file\n"; ##if it's a directory if (-d "$dir/$file"){ getFileList("$dir/$file",$fileList); } else{ my ($size,$mtime) = (stat "$dir/$file")[7,9]; $fileList->{"$dir/$file"} = { "size" => $size, "mod" => $mtime }; } } }