in reply to (jeffa) Re: Pretty cool link extractor.
in thread Pretty cool link extractor.

Even that won't catch everything, such as a mixed case 'href', using single quotes instead of double (or none at all!), any other attributes after the href (such as javascript events) or that the label might contain an unquoted '<'. You're not just trying to match valid HTML you're trying to match HTML that is "out there". Your example also doesn't catch the case where there might not be a label at all, it might be an image.

For these reasons I whole heartedly recommend using one of the HTML:: modules eg, HTML::LinkExtor or HTML::TreeBuilder. Just because I am feeling perverse I've come up with a perverse regex that seems to work with my 'odd' cases (though it will fail if there is whitespace in the url):

use strict; use warnings; use Data::Dump qw(dump); my @links = (); my $html = do { local $/; <DATA> }; while ($html =~ /[Hh][Rr][Ee][Ff]\s*=\s*['"]?([^\s"'>]+)['"]?.*?>(.*?) +<\s*\/\s*[Aa]\s*>/gs) { push @links, [$1, $2]; } print dump(@links), "\n"; __DATA__ <a href="http://foo.com">bar</a> <a href="index.html">index</a> <a href='/blah'>some text</a> <a href="http://some.url.com" onClick="">blah</a> <a href=http://bad.bad.bad>text</a> <a href="encodeme.html">< back</a> <a href="image.gif"><img src="blah.jpg"></a> <a href="/multline/example.html" >this is some text </a>
ick! :)

Update:belg4mit suggested URI::Find which sounds like a sensible idea.

gav^