Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am processing urls but dont want to process a url called:
<A HREF="http://www.cnn.com/WEATHER/index.html">
In order to ignore the url I tried many many conditional statements but nothing is working.

I cant get my script to "skip this url" and not process it. It seems to ignore my if conditional statement. I need some much needed help with what I am doing wrong:
if($url =~ /^<A\s+HREF\=\"http\:\/\/www.cnn.com/i) { print "skip this url.\n"; next; }
Also tried this:
if($url =~ /<A HREF\=\"http\:\/\/www\.cnn\.com\/WEATHER\/index\.html\" +>/i) { print "Skip this url.\n"; next; }
And another attempt:
if($url =~ /<A HREF="http://www.cnn.com/WEATHER/index.html">/i) { print "skip this url.\n"; next; }

Replies are listed 'Best First'.
Re: matching a Url
by demerphq (Chancellor) on Sep 09, 2002 at 13:11 UTC
    First off, <A HREF="http://www.cnn.com/WEATHER/index.html"> isn't an url. Its the open element of an Anchor tag. If you are trying to skip anchor tags based on the url then I wouldnt try to match a full tag, (think of all the possible other attrbutes a tag can have which will make the match fail), but rather the url within the tag (even then I wonder... Have a look at HTML::LinkExtor for other ideas.)

    The below will print "matches".

    my $str='<A HREF="http://www.cnn.com/WEATHER/index.html">'; print "matches!" if $str=~m!href\s*=\s*\Q['"]http://www.cnn.com/WEATHE +R/index.html['"]\E!i;

    If this is part of some regex based HTML parser then I suggest you look into using HTML::Parser or its more useful (but greedy) child class, HTML::TreeBuilder. Frankly I would use modules like that becuase the intracacies of HTML make it difficult to parse properly with regexen.

    update: thanks to the CB for some HTML clarifications while writing this.

    Yves / DeMerphq
    ---
    Software Engineering is Programming when you can't. -- E. W. Dijkstra (RIP)

Re: matching a Url
by talexb (Chancellor) on Sep 09, 2002 at 13:18 UTC
      I am processing urls but dont want to process a url called:

      <A HREF="http://www.cnn.com/WEATHER/index.html">

    You haven't explained why this URL is undesireable. What criteria are you using to choose whether (sorry, pun intended) or not to retain a URL? Describe that to us and we can give you a better answer to 'How'.

    --t. alex
    but my friends call me T.

Re: matching a Url
by Util (Priest) on Sep 09, 2002 at 14:45 UTC

    demerphq++; matching URLs with a regex should be on some list of Frequently Made Mistakes, and HTML::TreeBuilder is what I would usually use. However, I do occasionally parse HTML myself, when:

    1. I am writing quick, one-shot, throw-away code (or proto-typing with XXX FIXME notes about the parsing), and
    2. the input HTML is known to be very regular (i.e. no newlines between A and HREF, and no extra attributes).

    In the 3 examples that Anonymous Monk gives, the first two work (although with extraneous back-whacks), while the third fails to compile because it did not back-whack the pattern delimiters where found inside the pattern. If the first two fail for him, then something else may be wrong; perhaps the unposted code which extracts the tag is wonky.

    Finally, I would recommend extracting the true URL, (the part inside the quotes in <A HREF="...">), and match it against a hash of URLs to skip. Here is code that demonstrates both the regex and HTML::TreeBuilder methods:

    #!/usr/bin/perl -w use strict; use HTML::TreeBuilder; my $data = <<'EOF'; <html><head></head><body> <a href="ftp://debian.secsup.org/pub/linux/debian/README"></a> <a href="http://perlgolf.sourceforge.net/"></a> <A HREF="http://www.cnn.com/WEATHER/index.html"></a> <a href="http://www.ethereal.com/appnotes/enpa-sa-00006.html"></a> <a href="http://www.onlamp.com/lpt/a/2680"></a> <a href="http://www.perl.com/lpt/a/2002/08/22/exegesis5.html"></a> </body></html> EOF # This list of URLs to omit is case-insensitive. my %omit = map {lc($_),1} qw( http://www.cnn.com/WEATHER/index.html ftp://DEBIAN.SECSUP.ORG/PUB/LINUX/DEBIAN/README ); print "\nParsing with HTML::TreeBuilder...\n"; my $tree = HTML::TreeBuilder->new; $tree->parse($data); $tree->eof(); for (@{ $tree->extract_links('a') }) { my($real_url, $element, $attr, $tag) = @$_; if( $omit{ lc($real_url) } ) { print "Skip this url: $real_url\n"; next; } print "Good URL: $real_url\n"; } $tree = $tree->delete; print "\nParsing with regex...\n"; my @tags = ($data =~ m{(<a\s+href=".+?">)}ig ); foreach my $url (@tags) { my ($real_url) = ( $url=~ m{<A\s+HREF="(.+?)">}i ) or die "URL '$url' failed pattern match"; if( $omit{ lc($real_url) } ) { print "Skip this url: $real_url\n"; next; } print "Good URL: $real_url\n"; }

      Hmm, I think I would have used HTML::TreeBuilder->lookdown() to do this, but TMTOWTDI

      :-)

      --- demerphq
      my friends call me, usually because I'm late....

Re: matching a Url
by Popcorn Dave (Abbot) on Sep 09, 2002 at 16:12 UTC
    Granted what others have said about modules should give you more flexibility in this, but here's a regex solution for you.

    Does weather appear more than once in your list of url's? If not you could just do:

    next if $url =~ m/weather/i;

    or if you're trying to block the cnn site alltogether, you could just use:

    next if $url =~ m!cnn.com/weather/index.html!i;

    I've used ! for my delimeters there so that I don't have to escape my /'s. Makes it a bit easier to read. Also with the url, I don't know that you have to escape the . as it's going to find 1 char, and there should be 1 char between cnn and com which is the '.'

    Hope that helps!

    Some people fall from grace. I prefer a running start...