Extracting information

oaklander has asked for the wisdom of the Perl Monks concerning the following question:

This script finds and extracts image links in an HTML file but doesnt handle link text with embedded HTML. For example if there is something with font size it wont pick it up...

<A HREF="path/name"><FONT SIZE=-1>path name</FONT></A>
[download]

but will pick it up without the embedded html...

<A HREF="path/name">path name</A>
[download]

Here is the Perl script:

$/ = "";
$raw = "";
$linktext = "";
%atts = ();

while (<>)
{
while (/<A\s([^>]+)>([^<]+)<\/A>/ig)
{
$raw = $1;
$linktext = $2;
$linktext =~ s/[\s]*\n/ /g;
while ($raw =~ /([^\s=]+)\s*=\s*("([^"]+)"|[^\s]+\s*)/ig)
{
if (defined $3)
{
    $atts{ uc($1) } = $3;
}
else
{
    $atts{ uc($1) } = $2;
}

print '-' x 15;
print "\nLink text: $linktext\n";

foreach $key ("HREF", "NAME", "TITLE", "REL", "REV", "TARGET")
          
{
    if (exists($atts{$key}))
            
{
             
    $atts{$key} =~ s/[\s]*\n/ /g;
    print "   $key: $atts{$key}\n";
}
}
    %atts = ();
}
}
}
[download]

Comment on Extracting information Select or Download Code

Replies are listed 'Best First'.
Re: Extracting information by quent (Beadle) on Jan 09, 2002 at 20:12 UTC
There are a few things wrong with your code. It doesn't use warnings or strict. It has inconsistent indentation. It is trying to parse HTML with simple regular expressions. You could do worse than using something already built for such a purpose like HTML::LinkExtor or HTML::SimpleLinkExtor.	[reply]
Re: Re: Extracting information by oaklander (Acolyte) on Jan 09, 2002 at 23:00 UTC
Thanks to all for your suggestions. This Perl web site is great! I appreciate everyones help.	[reply]
Re: HTML Parsing by BazB (Priest) on Jan 09, 2002 at 20:20 UTC
Why try and parse this sort of thing when Perl's not-exactly-secret-weapon CPAN has plenty well tested modules that'll do all this for you? HTML::Parser would be one place to start - it includes modules to slice and dice your HTML in several different ways - check the README. As far as I can see, you should be able to replace the snippet you've posted with these modules. Cheers. BazB. Update: Is that even valid HTML? It looks pretty horrid either way. A much nicer way of doing it would be: `<font size="-1"><a href="link">this is a link</a></font>` [download] Further update:Yes, it is valid HTML. Still looks hideous :-)	[reply] [d/l]
Re: Extracting information by fuzzysteve (Beadle) on Jan 09, 2002 at 20:03 UTC
My regex's aren't what they could be, but after some experimentation with the code, the problem arises when you have a < before the </a> also the problem would appear to be in your first regex (the while loop check). looking at the reg exp , you've written it to exclude any data that has a < before the </a> the problem is with the `([^<]+)`. you've specifically exluded any data with tags betweern the anchor tags.	[reply] [d/l]