Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

My script fetches the filename and html title for all my html files. My problem is now finding all the URL's. Here is an example of URL and link in one of my web pages:
<A HREF="http://mylink/index.html/"><FONT SIZE="-1"><STRONG>LINK NAME< +/STRONG></FONT></A>
Here is my attempt (in my subroutine) at getting the info but struggling getting the href part:
sub wanted { if( $_ =~ /\.html?$/) { my $name = $File::Find::name; open ( F, $name ) or die "$!: $name\n"; while($line = <F>) { if($line =~ /<title>(.+)<\/title>/i) && ($line =~ /<a hre +f(.+ ##not sure here??? { $ct++; print "FILE = $File::Find::name TITLE = $1 URL = $ +2\n"; } } close F; }

Replies are listed 'Best First'.
Re: Getting URL info
by Abigail-II (Bishop) on Jul 22, 2002 at 16:15 UTC
    Please use one of the many HTML parsing modules from CPAN. There is even one whose main purpose is specifically what you want to do. (It has something like 'Linkextractor' in the name).

    Abigail

Re: Getting URL info
by valdez (Monsignor) on Jul 22, 2002 at 18:20 UTC

    The module cited by Abigail-II is inside HTML::Parser and its name is HTML::LinkExtor. It is very simple to create a parser using HTML::TokeParser, give it a try.

    I think there is something wrong in your script: you are looking for a line that contains a title tag and a link at the same time... is this correct?

    Ciao, Valerio