Even that won't catch everything, such as a mixed case 'href', using single quotes instead of double (or none at all!), any other attributes after the href (such as javascript events) or that the label might contain an unquoted '<'. You're not just trying to match valid HTML you're trying to match HTML that is "out there". Your example also doesn't catch the case where there might not be a label at all, it might be an image.

For these reasons I whole heartedly recommend using one of the HTML:: modules eg, HTML::LinkExtor or HTML::TreeBuilder. Just because I am feeling perverse I've come up with a perverse regex that seems to work with my 'odd' cases (though it will fail if there is whitespace in the url):

use strict; use warnings; use Data::Dump qw(dump); my @links = (); my $html = do { local $/; <DATA> }; while ($html =~ /[Hh][Rr][Ee][Ff]\s*=\s*['"]?([^\s"'>]+)['"]?.*?>(.*?) +<\s*\/\s*[Aa]\s*>/gs) { push @links, [$1, $2]; } print dump(@links), "\n"; __DATA__ <a href="http://foo.com">bar</a> <a href="index.html">index</a> <a href='/blah'>some text</a> <a href="http://some.url.com" onClick="">blah</a> <a href=http://bad.bad.bad>text</a> <a href="encodeme.html">< back</a> <a href="image.gif"><img src="blah.jpg"></a> <a href="/multline/example.html" >this is some text </a>
ick! :)

Update:belg4mit suggested URI::Find which sounds like a sensible idea.

gav^


In reply to Re: (jeffa) Re: Pretty cool link extractor. by gav^
in thread Pretty cool link extractor. by DigitalKitty

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.