Getting URL info

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

My script fetches the filename and html title for all my html files. My problem is now finding all the URL's. Here is an example of URL and link in one of my web pages:

<A HREF="http://mylink/index.html/"><FONT SIZE="-1"><STRONG>LINK NAME<
+/STRONG></FONT></A>
[download]

Here is my attempt (in my subroutine) at getting the info but struggling getting the href part:

sub wanted
{
   if( $_ =~ /\.html?$/)
   {
      
       my $name = $File::Find::name;
       open ( F, $name ) or die "$!: $name\n";
           while($line = <F>)
             {
             if($line =~ /<title>(.+)<\/title>/i) && ($line =~ /<a hre
+f(.+ ##not sure here???
             {
                 $ct++;
                 print "FILE = $File::Find::name   TITLE = $1  URL = $
+2\n";
             }
             }
   close F;
   }
[download]

Comment on Getting URL info Select or Download Code

Replies are listed 'Best First'.
Re: Getting URL info by Abigail-II (Bishop) on Jul 22, 2002 at 16:15 UTC
Please use one of the many HTML parsing modules from CPAN. There is even one whose main purpose is specifically what you want to do. (It has something like 'Linkextractor' in the name). Abigail	[reply]
Re: Getting URL info by valdez (Monsignor) on Jul 22, 2002 at 18:20 UTC
The module cited by Abigail-II is inside HTML::Parser and its name is HTML::LinkExtor. It is very simple to create a parser using HTML::TokeParser, give it a try. I think there is something wrong in your script: you are looking for a line that contains a title tag and a link at the same time... is this correct? Ciao, Valerio	[reply]