comment on

demerphq++; matching URLs with a regex should be on some list of Frequently Made Mistakes, and HTML::TreeBuilder is what I would usually use. However, I do occasionally parse HTML myself, when:

I am writing quick, one-shot, throw-away code (or proto-typing with XXX FIXME notes about the parsing), and
the input HTML is known to be very regular (i.e. no newlines between A and HREF, and no extra attributes).

In the 3 examples that Anonymous Monk gives, the first two work (although with extraneous back-whacks), while the third fails to compile because it did not back-whack the pattern delimiters where found inside the pattern. If the first two fail for him, then something else may be wrong; perhaps the unposted code which extracts the tag is wonky.

Finally, I would recommend extracting the true URL, (the part inside the quotes in <A HREF="...">), and match it against a hash of URLs to skip. Here is code that demonstrates both the regex and HTML::TreeBuilder methods:

#!/usr/bin/perl -w
use strict;
use HTML::TreeBuilder;

my $data = <<'EOF';
  <html><head></head><body>
  <a href="ftp://debian.secsup.org/pub/linux/debian/README"></a>
  <a href="http://perlgolf.sourceforge.net/"></a>
  <A HREF="http://www.cnn.com/WEATHER/index.html"></a>
  <a href="http://www.ethereal.com/appnotes/enpa-sa-00006.html"></a>
  <a href="http://www.onlamp.com/lpt/a/2680"></a>
  <a href="http://www.perl.com/lpt/a/2002/08/22/exegesis5.html"></a>
  </body></html>
EOF

# This list of URLs to omit is case-insensitive.
my %omit = map {lc($_),1} qw(
  http://www.cnn.com/WEATHER/index.html
  ftp://DEBIAN.SECSUP.ORG/PUB/LINUX/DEBIAN/README
);

print "\nParsing with HTML::TreeBuilder...\n";

my $tree = HTML::TreeBuilder->new;
$tree->parse($data);
$tree->eof();
for (@{ $tree->extract_links('a') }) {
  my($real_url, $element, $attr, $tag) = @$_;
  if( $omit{ lc($real_url) } ) {
    print "Skip this url: $real_url\n";
    next;
  }
  print "Good URL: $real_url\n";
}
$tree = $tree->delete;

print "\nParsing with regex...\n";
my @tags = ($data =~ m{(<a\s+href=".+?">)}ig );

foreach my $url (@tags) {
  my ($real_url) = ( $url=~ m{<A\s+HREF="(.+?)">}i )
    or die "URL '$url' failed pattern match"; 
  if( $omit{ lc($real_url) } ) {
    print "Skip this url: $real_url\n";
    next;
  }
  print "Good URL: $real_url\n";
}
[download]

In reply to Re: matching a Url by Util
in thread matching a Url by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.