in reply to Re: Re: Re: extracting web links
in thread extracting web links

*Ahem*

I believe you meant to say all except HTML::LinkExtractor, which gets them all :).

perldoc HTML::LinkExtractor
...
WHAT'S A LINK-type tag
    Take a look at %HTML::LinkExtractor::TAGS to see what I consider to be
    link-type-tag.

    Take a look at @HTML::LinkExtractor::VALID_URL_ATTRIBUTES to see all the
    possible tag attributes which can contain URI's (the links!!)
...
use HTML::LinkExtractor; use Data::Dumper; local $Data::Dumper::Indent = 1; print Dumper( \%HTML::LinkExtractor::TAGS , \@HTML::LinkExtractor::VAL +ID_URL_ATTRIBUTES ); __END__ $VAR1 = { 'tr' => [ 'background' ], 'base' => [ 'href' ], 'form' => [ 'action' ], 'body' => [ 'background' ], 'input' => [ 'dynsrc', 'lowsrc', 'src' ], 'a' => [ 'href' ], '!doctype' => [ 'url' ], 'img' => [ 'dynsrc', 'longdesc', 'lowsrc', 'src', 'usemap' ], 'object' => [ 'archive', 'classid', 'code', 'codebase', 'data', 'usemap' ], 'bgsound' => [ 'src' ], 'sound' => [ 'src' ], 'del' => [ 'cite' ], 'script' => [ 'src' ], 'applet' => [ 'archive', 'code', 'codebase', 'src' ], 'embed' => [ 'pluginspage', 'pluginurl', 'src' ], 'area' => [ 'href' ], 'iframe' => [ 'src' ], 'ilayer' => [ 'background', 'src' ], 'td' => [ 'background' ], 'blockquote' => [ 'cite' ], 'q' => [ 'cite' ], 'ins' => [ 'cite' ], 'th' => [ 'background' ], 'layer' => [ 'src' ], 'frame' => [ 'src', 'longdesc' ], 'meta' => undef, 'table' => [ 'background' ], 'isindex' => [ 'action' ], 'div' => [ 'src' ], 'link' => [ 'src', 'href' ] }; $VAR2 = [ 'action', 'archive', 'background', 'cite', 'classid', 'code', 'codebase', 'data', 'dynsrc', 'href', 'longdesc', 'lowsrc', 'pluginspage', 'pluginurl', 'src', 'usemap' ];

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.