in reply to matching a Url
demerphq++; matching URLs with a regex should be on some list of Frequently Made Mistakes, and HTML::TreeBuilder is what I would usually use. However, I do occasionally parse HTML myself, when:
In the 3 examples that Anonymous Monk gives, the first two work (although with extraneous back-whacks), while the third fails to compile because it did not back-whack the pattern delimiters where found inside the pattern. If the first two fail for him, then something else may be wrong; perhaps the unposted code which extracts the tag is wonky.
Finally, I would recommend extracting the true URL, (the part inside the quotes in <A HREF="...">), and match it against a hash of URLs to skip. Here is code that demonstrates both the regex and HTML::TreeBuilder methods:
#!/usr/bin/perl -w use strict; use HTML::TreeBuilder; my $data = <<'EOF'; <html><head></head><body> <a href="ftp://debian.secsup.org/pub/linux/debian/README"></a> <a href="http://perlgolf.sourceforge.net/"></a> <A HREF="http://www.cnn.com/WEATHER/index.html"></a> <a href="http://www.ethereal.com/appnotes/enpa-sa-00006.html"></a> <a href="http://www.onlamp.com/lpt/a/2680"></a> <a href="http://www.perl.com/lpt/a/2002/08/22/exegesis5.html"></a> </body></html> EOF # This list of URLs to omit is case-insensitive. my %omit = map {lc($_),1} qw( http://www.cnn.com/WEATHER/index.html ftp://DEBIAN.SECSUP.ORG/PUB/LINUX/DEBIAN/README ); print "\nParsing with HTML::TreeBuilder...\n"; my $tree = HTML::TreeBuilder->new; $tree->parse($data); $tree->eof(); for (@{ $tree->extract_links('a') }) { my($real_url, $element, $attr, $tag) = @$_; if( $omit{ lc($real_url) } ) { print "Skip this url: $real_url\n"; next; } print "Good URL: $real_url\n"; } $tree = $tree->delete; print "\nParsing with regex...\n"; my @tags = ($data =~ m{(<a\s+href=".+?">)}ig ); foreach my $url (@tags) { my ($real_url) = ( $url=~ m{<A\s+HREF="(.+?)">}i ) or die "URL '$url' failed pattern match"; if( $omit{ lc($real_url) } ) { print "Skip this url: $real_url\n"; next; } print "Good URL: $real_url\n"; }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Re: matching a Url
by demerphq (Chancellor) on Sep 09, 2002 at 15:07 UTC |