in reply to (jeffa) Re: Pretty cool link extractor.
in thread Pretty cool link extractor.
For these reasons I whole heartedly recommend using one of the HTML:: modules eg, HTML::LinkExtor or HTML::TreeBuilder. Just because I am feeling perverse I've come up with a perverse regex that seems to work with my 'odd' cases (though it will fail if there is whitespace in the url):
ick! :)use strict; use warnings; use Data::Dump qw(dump); my @links = (); my $html = do { local $/; <DATA> }; while ($html =~ /[Hh][Rr][Ee][Ff]\s*=\s*['"]?([^\s"'>]+)['"]?.*?>(.*?) +<\s*\/\s*[Aa]\s*>/gs) { push @links, [$1, $2]; } print dump(@links), "\n"; __DATA__ <a href="http://foo.com">bar</a> <a href="index.html">index</a> <a href='/blah'>some text</a> <a href="http://some.url.com" onClick="">blah</a> <a href=http://bad.bad.bad>text</a> <a href="encodeme.html">< back</a> <a href="image.gif"><img src="blah.jpg"></a> <a href="/multline/example.html" >this is some text </a>
Update:belg4mit suggested URI::Find which sounds like a sensible idea.
gav^
|
|---|