Re: Difficult? regex
by hipowls (Curate) on Feb 22, 2008 at 11:03 UTC
|
The regex /\?(C=N;O=D|C=M;O=A)?$/ has a question mark after (C=N;O=D|C=M;O=A) and therefore it is optional. The regex matches any URL that terminates with a '?'.
| [reply] |
|
|
It has a question mark because I just copied and pasted the prior regex that matches image files below:
# ignore any common image files
return if $uri->path =~ /\.(gif|jpg|jpeg|png)?$/;
Does this mean that the regex for filtering image files will also match anything that ends in a period?
I thought the final question mark would make a "non greedy" match.
I should note that I'm not much of a Perl programmer and definitely not very good at regex expressions. I've read several sites on regex and tried all sort of variations on this regex but just can't get it to work. I've put several hours of work into this already so I'm not just looking for someone to write the code for me. Until now that is :-)
So anyway, I also tried the regex without the question mark and it doesn't work either:
return if $uri->path =~ /\?(C=N;O=D|C=M;O=A)$/;
| [reply] [d/l] [select] |
|
|
| [reply] |
Re: Difficult? regex
by moritz (Cardinal) on Feb 22, 2008 at 11:06 UTC
|
/\?(C=N;O=D|C=M;O=A)?$/;
^
WHy do you need that last question mark? That way it will also match \?$. Probably not what you want.
It would also help if your error description was better than "DOESN'T WORK AS EXPECTED" | [reply] [d/l] [select] |
A reply falls below the community's threshold of quality. You may see it by logging in. |
Re: Difficult? regex
by olus (Curate) on Feb 22, 2008 at 11:00 UTC
|
return 0 if $uri->path =~ /\?(C=N;O=D|C=M;O=A)?$/;
Update: There are two issues on the that regexp line. On my reply I considered only one. You said you wanted to have 0 returned in case of a match, but you are returning nothing. What you will have is an undefined value.
The second is addressed in the replies below. | [reply] [d/l] |
|
|
Thanks to everyone for the suggestions. My code looks like this now and still doesn't work as expected (see original posting). It does return zero for URLs ending in .jpg so that regex works. The final regex also works.
sub test_url {
my ( $uri, $server ) = @_;
# return 1; # Ok to index/spider
# return 0; # No, don't index or spider;
# ignore any common image files
return 0 if $uri->path =~ /\.(gif|jpg|jpeg|png)?$/;
# ignore directory listing sorting links
return 0 if $uri->path =~ /\?(C=N;O=D|C=M;O=A)$/;
# make sure that the path is limited to the docs path
return $uri->path =~ m[^/starteam_area/];
}
| [reply] [d/l] |
|
|
Regarding your original post, the replies given so far do solve the problem you mentioned.
If the behavior is still not what you expected, then there are other things that you will want to say, because we cannot guess what that expected behavior is.
You say the first and third regexps work. Let me show you that the second also works, and the returned value is '0', just like you want (unless you really mean 'zero' and not '0'.
sub test_url {
my ( $s, $server ) = @_;
# return 1; # Ok to index/spider
# return 0; # No, don't index or spider;
# ignore any common image files
return 0 if $s =~ /\.(gif|jpg|jpeg|png)?$/;
# ignore directory listing sorting links
return 0 if $s =~ /\?(C=N;O=D|C=M;O=A)$/;
# make sure that the path is limited to the docs path
return $s =~ m[^/starteam_area/];
}
my $res;
$res = test_url('http://someurl.com/?C=N;O=D');
print "returned value was - ".$res."\n";
$res = test_url('http://someurl.com/?C=M;O=A');
print "returned value was - ".$res."\n";
$res = test_url('http://someurl.com/');
print "returned value was - ".$res."\n";
$res = test_url('http://someurl.com/?C=X;O=A');
print "returned value was - ".$res."\n";
----
#output
returned value was - 0
returned value was - 0
returned value was -
returned value was -
Note that I replaced $uri with $s because I don't know what kind of structure $uri is. | [reply] [d/l] [select] |
|
|
|
|
my $uripath = 'http://www.somewhere.com/~s/reports/?C=M;O=D';
# easier to maintian as a partial expression
# added support for more sortorders
# which is probably where you erred
my $rx_dirsort = qr{\?(C=[NMSD];O=[AD])$};
print "hit!\n" if $uripath =~ /$rx_dirsort/;
hth
Edit: Actually, the most important error you made was using the final '?', which has been pointed out by many here. | [reply] [d/l] |
Re: Difficult? regex
by peter (Sexton) on Feb 22, 2008 at 13:43 UTC
|
Do you use the URI module? If you do, then your problems is that the value return by $uri->path doesn't contain the query parameters.
Take a look at the simple test:
use URI;
my $uri = URI->new("http://localhost/test.html?test=1");
print $uri->path . "\n";
Will show:
/test.html
| [reply] [d/l] [select] |
|
|
Yes I do use the URI module but didn't know what that was until I started doing some more debugging with the help of the examples posted above. I found the problem on my own shortly before reading your post but it's nice to see that you caught it just by looking at the code.
Thank you again to everyone for your help. Donation time :-)
| [reply] |
Re: Difficult? regex
by Your Mother (Archbishop) on Feb 22, 2008 at 22:19 UTC
|
I would argue against doing this with a regex. When you have something that is known to conform to a standard, like URIs or HTML, better to use a parser. It's easier to adapt, easier to extend use cases, immune to argument order, and generally more likely to be bomb-proof.
How about this for your thing.
use URI::QueryParam; # introduces a new method to URI
sub test_url {
my ( $uri, $server ) = @_;
# returns true, ok to index/spider
# return false, don't index or spider
# A white list is always better than
# a black list if you can make one
return unless $uri->path =~ /\.html$/;
# Note about what this condition really means
return if
$uri->query_param("C") eq "N"
and
$uri->query_param("O") eq "D";
# Note about what this condition really means
return if
$uri->query_param("C") eq "M"
and
$uri->query_param("O") eq "A";
# make sure that the path is limited to the docs path
return $uri->path =~ m[^/starteam_area/];
}
| [reply] [d/l] |
|
|
This is a really useful suggestion which I will incorporate into my final solution. Thank you very much.
| [reply] |