Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Extracting information

by oaklander (Acolyte)
on Jan 09, 2002 at 19:34 UTC ( [id://137462]=perlquestion: print w/replies, xml ) Need Help??

oaklander has asked for the wisdom of the Perl Monks concerning the following question:

This script finds and extracts image links in an HTML file but doesnt handle link text with embedded HTML. For example if there is something with font size it wont pick it up...
<A HREF="path/name"><FONT SIZE=-1>path name</FONT></A>
but will pick it up without the embedded html...
<A HREF="path/name">path name</A>
Here is the Perl script:
$/ = ""; $raw = ""; $linktext = ""; %atts = (); while (<>) { while (/<A\s([^>]+)>([^<]+)<\/A>/ig) { $raw = $1; $linktext = $2; $linktext =~ s/[\s]*\n/ /g; while ($raw =~ /([^\s=]+)\s*=\s*("([^"]+)"|[^\s]+\s*)/ig) { if (defined $3) { $atts{ uc($1) } = $3; } else { $atts{ uc($1) } = $2; } print '-' x 15; print "\nLink text: $linktext\n"; foreach $key ("HREF", "NAME", "TITLE", "REL", "REV", "TARGET") { if (exists($atts{$key})) { $atts{$key} =~ s/[\s]*\n/ /g; print " $key: $atts{$key}\n"; } } %atts = (); } } }

Replies are listed 'Best First'.
Re: Extracting information
by quent (Beadle) on Jan 09, 2002 at 20:12 UTC
    There are a few things wrong with your code. It doesn't use warnings or strict. It has inconsistent indentation. It is trying to parse HTML with simple regular expressions. You could do worse than using something already built for such a purpose like HTML::LinkExtor or HTML::SimpleLinkExtor.
      Thanks to all for your suggestions. This Perl web site is great! I appreciate everyones help.
Re: HTML Parsing
by BazB (Priest) on Jan 09, 2002 at 20:20 UTC

    Why try and parse this sort of thing when Perl's not-exactly-secret-weapon CPAN has plenty well tested modules that'll do all this for you?

    HTML::Parser would be one place to start - it includes modules to slice and dice your HTML in several different ways - check the README. As far as I can see, you should be able to replace the snippet you've posted with these modules.

    Cheers.

    BazB.

    Update: Is that even valid HTML? It looks pretty horrid either way. A much nicer way of doing it would be:

    <font size="-1"><a href="link">this is a link</a></font>
    Further update:Yes, it is valid HTML. Still looks hideous :-)

Re: Extracting information
by fuzzysteve (Beadle) on Jan 09, 2002 at 20:03 UTC
    My regex's aren't what they could be, but after some experimentation with the code, the problem arises when you have a < before the </a>
    also the problem would appear to be in your first regex (the while loop check).
    looking at the reg exp , you've written it to exclude any data that has a < before the </a> the problem is with the ([^&lt;]+). you've specifically exluded any data with tags betweern the anchor tags.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://137462]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (3)
As of 2024-04-19 21:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found