Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to match all URLs that end in ?C=N;O=D or ?C=M;O=A and return zero if the url matches. The following function is supposed to do that and all of the regex expressions work as expected except for that one. Any help is appreciated.
sub test_url { my ( $uri, $server ) = @_; # return 1; # Ok to index/spider # return 0; # No, don't index or spider; # ignore any common image files return if $uri->path =~ /\.(gif|jpg|jpeg|png)?$/; # ignore directory listing sorting links # DOESN'T WORK AS EXPECTED return if $uri->path =~ /\?(C=N;O=D|C=M;O=A)?$/; # make sure that the path is limited to the docs path return $uri->path =~ m[^/starteam_area/]; }

Replies are listed 'Best First'.
Re: Difficult? regex
by hipowls (Curate) on Feb 22, 2008 at 11:03 UTC

    The regex /\?(C=N;O=D|C=M;O=A)?$/ has a question mark after (C=N;O=D|C=M;O=A) and therefore it is optional. The regex matches any URL that terminates with a '?'.

      It has a question mark because I just copied and pasted the prior regex that matches image files below:
      # ignore any common image files return if $uri->path =~ /\.(gif|jpg|jpeg|png)?$/;
      Does this mean that the regex for filtering image files will also match anything that ends in a period? I thought the final question mark would make a "non greedy" match. I should note that I'm not much of a Perl programmer and definitely not very good at regex expressions. I've read several sites on regex and tried all sort of variations on this regex but just can't get it to work. I've put several hours of work into this already so I'm not just looking for someone to write the code for me. Until now that is :-) So anyway, I also tried the regex without the question mark and it doesn't work either:
      return if $uri->path =~ /\?(C=N;O=D|C=M;O=A)$/;

        The question mark makes the term optional. It doesn't make sense here to use non-greedy matches, there isn't a '*' or '+' modifier to say "grab as much as you can". In this case you want to match the file endings.

Re: Difficult? regex
by moritz (Cardinal) on Feb 22, 2008 at 11:06 UTC
    /\?(C=N;O=D|C=M;O=A)?$/; ^

    WHy do you need that last question mark? That way it will also match \?$. Probably not what you want.

    It would also help if your error description was better than "DOESN'T WORK AS EXPECTED"

    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Difficult? regex
by olus (Curate) on Feb 22, 2008 at 11:00 UTC
    return 0 if $uri->path =~ /\?(C=N;O=D|C=M;O=A)?$/;

    Update: There are two issues on the that regexp line. On my reply I considered only one. You said you wanted to have 0 returned in case of a match, but you are returning nothing. What you will have is an undefined value.

    The second is addressed in the replies below.

      Thanks to everyone for the suggestions. My code looks like this now and still doesn't work as expected (see original posting). It does return zero for URLs ending in .jpg so that regex works. The final regex also works.
      sub test_url { my ( $uri, $server ) = @_; # return 1; # Ok to index/spider # return 0; # No, don't index or spider; # ignore any common image files return 0 if $uri->path =~ /\.(gif|jpg|jpeg|png)?$/; # ignore directory listing sorting links return 0 if $uri->path =~ /\?(C=N;O=D|C=M;O=A)$/; # make sure that the path is limited to the docs path return $uri->path =~ m[^/starteam_area/]; }

        Regarding your original post, the replies given so far do solve the problem you mentioned.

        If the behavior is still not what you expected, then there are other things that you will want to say, because we cannot guess what that expected behavior is.

        You say the first and third regexps work. Let me show you that the second also works, and the returned value is '0', just like you want (unless you really mean 'zero' and not '0'.

        sub test_url { my ( $s, $server ) = @_; # return 1; # Ok to index/spider # return 0; # No, don't index or spider; # ignore any common image files return 0 if $s =~ /\.(gif|jpg|jpeg|png)?$/; # ignore directory listing sorting links return 0 if $s =~ /\?(C=N;O=D|C=M;O=A)$/; # make sure that the path is limited to the docs path return $s =~ m[^/starteam_area/]; } my $res; $res = test_url('http://someurl.com/?C=N;O=D'); print "returned value was - ".$res."\n"; $res = test_url('http://someurl.com/?C=M;O=A'); print "returned value was - ".$res."\n"; $res = test_url('http://someurl.com/'); print "returned value was - ".$res."\n"; $res = test_url('http://someurl.com/?C=X;O=A'); print "returned value was - ".$res."\n"; ---- #output returned value was - 0 returned value was - 0 returned value was - returned value was -

        Note that I replaced $uri with $s because I don't know what kind of structure $uri is.

        Try:

        my $uripath = 'http://www.somewhere.com/~s/reports/?C=M;O=D'; # easier to maintian as a partial expression # added support for more sortorders # which is probably where you erred my $rx_dirsort = qr{\?(C=[NMSD];O=[AD])$}; print "hit!\n" if $uripath =~ /$rx_dirsort/;

        hth

        Edit: Actually, the most important error you made was using the final '?', which has been pointed out by many here.

Re: Difficult? regex
by peter (Sexton) on Feb 22, 2008 at 13:43 UTC
    Do you use the URI module? If you do, then your problems is that the value return by $uri->path doesn't contain the query parameters. Take a look at the simple test:
    use URI; my $uri = URI->new("http://localhost/test.html?test=1"); print $uri->path . "\n";
    Will show:
    /test.html
    Peter Stuifzand
      Yes I do use the URI module but didn't know what that was until I started doing some more debugging with the help of the examples posted above. I found the problem on my own shortly before reading your post but it's nice to see that you caught it just by looking at the code. Thank you again to everyone for your help. Donation time :-)
Re: Difficult? regex
by Your Mother (Archbishop) on Feb 22, 2008 at 22:19 UTC

    I would argue against doing this with a regex. When you have something that is known to conform to a standard, like URIs or HTML, better to use a parser. It's easier to adapt, easier to extend use cases, immune to argument order, and generally more likely to be bomb-proof.

    How about this for your thing.

    use URI::QueryParam; # introduces a new method to URI sub test_url { my ( $uri, $server ) = @_; # returns true, ok to index/spider # return false, don't index or spider # A white list is always better than # a black list if you can make one return unless $uri->path =~ /\.html$/; # Note about what this condition really means return if $uri->query_param("C") eq "N" and $uri->query_param("O") eq "D"; # Note about what this condition really means return if $uri->query_param("C") eq "M" and $uri->query_param("O") eq "A"; # make sure that the path is limited to the docs path return $uri->path =~ m[^/starteam_area/]; }
      This is a really useful suggestion which I will incorporate into my final solution. Thank you very much.