Difficult? regex

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Difficult? regex by hipowls (Curate) on Feb 22, 2008 at 11:03 UTC
The regex /\?(C=N;O=D\|C=M;O=A)?$/ has a question mark after (C=N;O=D\|C=M;O=A) and therefore it is optional. The regex matches any URL that terminates with a '?'.	[reply]
Re^2: Difficult? regex by Anonymous Monk on Feb 22, 2008 at 11:46 UTC
It has a question mark because I just copied and pasted the prior regex that matches image files below: `# ignore any common image files return if $uri->path =~ /\.(gif\|jpg\|jpeg\|png)?$/;` [download] Does this mean that the regex for filtering image files will also match anything that ends in a period? I thought the final question mark would make a "non greedy" match. I should note that I'm not much of a Perl programmer and definitely not very good at regex expressions. I've read several sites on regex and tried all sort of variations on this regex but just can't get it to work. I've put several hours of work into this already so I'm not just looking for someone to write the code for me. Until now that is :-) So anyway, I also tried the regex without the question mark and it doesn't work either: `return if $uri->path =~ /\?(C=N;O=D\|C=M;O=A)$/;` [download]	[reply] [d/l] [select]
Re^3: Difficult? regex by hipowls (Curate) on Feb 23, 2008 at 00:52 UTC
The question mark makes the term optional. It doesn't make sense here to use non-greedy matches, there isn't a '*' or '+' modifier to say "grab as much as you can". In this case you want to match the file endings.	[reply]
Re: Difficult? regex by moritz (Cardinal) on Feb 22, 2008 at 11:06 UTC
`/\?(C=N;O=D\|C=M;O=A)?$/; ^` [download] WHy do you need that last question mark? That way it will also match `\?$`. Probably not what you want. It would also help if your error description was better than "DOESN'T WORK AS EXPECTED"	[reply] [d/l] [select]
A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Difficult? regex by olus (Curate) on Feb 22, 2008 at 11:00 UTC
`return 0 if $uri->path =~ /\?(C=N;O=D\|C=M;O=A)?$/;` Update: There are two issues on the that regexp line. On my reply I considered only one. You said you wanted to have 0 returned in case of a match, but you are returning nothing. What you will have is an undefined value. The second is addressed in the replies below.	[reply] [d/l]
Re^2: Difficult? regex by Anonymous Monk on Feb 22, 2008 at 12:14 UTC
Thanks to everyone for the suggestions. My code looks like this now and still doesn't work as expected (see original posting). It does return zero for URLs ending in .jpg so that regex works. The final regex also works. `sub test_url { my ( $uri, $server ) = @_; # return 1; # Ok to index/spider # return 0; # No, don't index or spider; # ignore any common image files return 0 if $uri->path =~ /\.(gif\|jpg\|jpeg\|png)?$/; # ignore directory listing sorting links return 0 if $uri->path =~ /\?(C=N;O=D\|C=M;O=A)$/; # make sure that the path is limited to the docs path return $uri->path =~ m[^/starteam_area/]; }` [download]	[reply] [d/l]
Re^3: Difficult? regex by olus (Curate) on Feb 22, 2008 at 12:49 UTC
Regarding your original post, the replies given so far do solve the problem you mentioned. If the behavior is still not what you expected, then there are other things that you will want to say, because we cannot guess what that expected behavior is. You say the first and third regexps work. Let me show you that the second also works, and the returned value is '0', just like you want (unless you really mean 'zero' and not '0'. sub test_url { my ( $s, $server ) = @_; # return 1; # Ok to index/spider # return 0; # No, don't index or spider; # ignore any common image files return 0 if $s =~ /\.(gif\|jpg\|jpeg\|png)?$/; # ignore directory listing sorting links return 0 if $s =~ /\?(C=N;O=D\|C=M;O=A)$/; # make sure that the path is limited to the docs path return $s =~ m[^/starteam_area/]; } my $res; $res = test_url('http://someurl.com/?C=N;O=D'); print "returned value was - ".$res."\n"; $res = test_url('http://someurl.com/?C=M;O=A'); print "returned value was - ".$res."\n"; $res = test_url('http://someurl.com/'); print "returned value was - ".$res."\n"; $res = test_url('http://someurl.com/?C=X;O=A'); print "returned value was - ".$res."\n"; ---- #output returned value was - 0 returned value was - 0 returned value was - returned value was - [download] Note that I replaced `$uri with $s` because I don't know what kind of structure $uri is.	[reply] [d/l] [select]
Re^4: Difficult? regex by Anonymous Monk on Feb 22, 2008 at 15:43 UTC
Re^3: Difficult? regex by stiller (Friar) on Feb 22, 2008 at 12:44 UTC
Try: `my $uripath = 'http://www.somewhere.com/~s/reports/?C=M;O=D'; # easier to maintian as a partial expression # added support for more sortorders # which is probably where you erred my $rx_dirsort = qr{\?(C=[NMSD];O=[AD])$}; print "hit!\n" if $uripath =~ /$rx_dirsort/;` [download] hth Edit: Actually, the most important error you made was using the final '?', which has been pointed out by many here.	[reply] [d/l]
Re: Difficult? regex by peter (Sexton) on Feb 22, 2008 at 13:43 UTC
Do you use the URI module? If you do, then your problems is that the value return by $uri->path doesn't contain the query parameters. Take a look at the simple test: `use URI; my $uri = URI->new("http://localhost/test.html?test=1"); print $uri->path . "\n";` [download] Will show: `/test.html` [download] Peter Stuifzand	[reply] [d/l] [select]
Re^2: Difficult? regex by Anonymous Monk on Feb 22, 2008 at 16:13 UTC
Yes I do use the URI module but didn't know what that was until I started doing some more debugging with the help of the examples posted above. I found the problem on my own shortly before reading your post but it's nice to see that you caught it just by looking at the code. Thank you again to everyone for your help. Donation time :-)	[reply]
Re: Difficult? regex by Your Mother (Archbishop) on Feb 22, 2008 at 22:19 UTC
I would argue against doing this with a regex. When you have something that is known to conform to a standard, like URIs or HTML, better to use a parser. It's easier to adapt, easier to extend use cases, immune to argument order, and generally more likely to be bomb-proof. How about this for your thing. use URI::QueryParam; # introduces a new method to URI sub test_url { my ( $uri, $server ) = @_; # returns true, ok to index/spider # return false, don't index or spider # A white list is always better than # a black list if you can make one return unless $uri->path =~ /\.html$/; # Note about what this condition really means return if $uri->query_param("C") eq "N" and $uri->query_param("O") eq "D"; # Note about what this condition really means return if $uri->query_param("C") eq "M" and $uri->query_param("O") eq "A"; # make sure that the path is limited to the docs path return $uri->path =~ m[^/starteam_area/]; } [download]	[reply] [d/l]
Re^2: Difficult? regex by Anonymous Monk on Feb 26, 2008 at 09:14 UTC
This is a really useful suggestion which I will incorporate into my final solution. Thank you very much.	[reply]