(jeffa) Re: Pretty cool link extractor.

That didn't work for the following:

<a href="http://foo.com">bar</a>
<a href="index.html">index</a>

Maybe i am missing something, but i think that the regex you use to 'remove all HTML tags' isn't working the way you think it should. Here is how i would do it:

use strict;
use Data::Dumper;

my @link;
my @data = <DATA>;
for (@data) {
  my ($url,$label) = $_ =~ /href\s*=\s*"([^"]+)"\s*>([^<]+)/;
  next unless $url and $label;
  push @link, [$url,$label];
}

print Dumper \@link;

__DATA__
<a href="http://foo.com">bar</a>
<a href="index.html">index</a>
[download]

But i would NEVER use that in any serious code (it has its limitations - only one link per line). I would use a module. Now, why people think that writing code to bypass using a module (that has already been tested and used by many, many people around the world) is a 'good thing' elludes me. Is it because you don't have permission? Then please read A Guide to Installing Modules - there is no excuse.

jeffa

L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
B--B--B--B--B--B--B--B--
H---H---H---H---H---H---
(the triplet paradiddle with high-hat)

Comment on (jeffa) Re: Pretty cool link extractor. Download Code

Replies are listed 'Best First'.
Re: (jeffa) Re: Pretty cool link extractor. by gav^ (Curate) on Mar 26, 2002 at 04:23 UTC
Even that won't catch everything, such as a mixed case 'href', using single quotes instead of double (or none at all!), any other attributes after the href (such as javascript events) or that the label might contain an unquoted '<'. You're not just trying to match valid HTML you're trying to match HTML that is "out there". Your example also doesn't catch the case where there might not be a label at all, it might be an image. For these reasons I whole heartedly recommend using one of the HTML:: modules eg, HTML::LinkExtor or HTML::TreeBuilder. Just because I am feeling perverse I've come up with a perverse regex that seems to work with my 'odd' cases (though it will fail if there is whitespace in the url): use strict; use warnings; use Data::Dump qw(dump); my @links = (); my $html = do { local $/; <DATA> }; while ($html =~ /[Hh][Rr][Ee][Ff]\s=\s['"]?([^\s"'>]+)['"]?.?>(.?) +<\s\/\s[Aa]\s>/gs) { push @links, [$1, $2]; } print dump(@links), "\n"; __DATA__ <a href="http://foo.com">bar</a> <a href="index.html">index</a> <a href='/blah'>some text</a> <a href="http://some.url.com" onClick="">blah</a> <a href=http://bad.bad.bad>text</a> <a href="encodeme.html">< back</a> <a href="image.gif"><img src="blah.jpg"></a> <a href="/multline/example.html" >this is some text </a> [download] ick! :) Update:*belg4mit suggested URI::Find which sounds like a sensible idea. gav^	[reply] [d/l]

Replies are listed 'Best First'.

Re: (jeffa) Re: Pretty cool link extractor.
by gav^ (Curate) on Mar 26, 2002 at 04:23 UTC

For these reasons I whole heartedly recommend using one of the HTML:: modules eg, HTML::LinkExtor or HTML::TreeBuilder. Just because I am feeling perverse I've come up with a perverse regex that seems to work with my 'odd' cases (though it will fail if there is whitespace in the url):

use strict;
use warnings;
use Data::Dump qw(dump);
my @links = ();
my $html = do { local $/; <DATA> };
while ($html =~ /[Hh][Rr][Ee][Ff]\s*=\s*['"]?([^\s"'>]+)['"]?.*?>(.*?)
+<\s*\/\s*[Aa]\s*>/gs) {
    push @links, [$1, $2];
}
print dump(@links), "\n";
__DATA__
<a href="http://foo.com">bar</a>
<a href="index.html">index</a>
<a href='/blah'>some text</a>
<a href="http://some.url.com" onClick="">blah</a>
<a href=http://bad.bad.bad>text</a>
<a href="encodeme.html">< back</a>
<a href="image.gif"><img src="blah.jpg"></a>
<a href="/multline/example.html"
>this is some
text
</a>
[download]

Update:belg4mit suggested URI::Find which sounds like a sensible idea.

gav^

[reply]
[d/l]