Weird Regex in CGI::Cookie

enoch has asked for the wisdom of the Perl Monks concerning the following question:

I stumbled across a bit of code with the following loop in it:

foreach (@pairs) {
    s/\s*(.*?)\s*/$1/;
    .
    .
    .
}
[download]

That regex sent up a warning signal in my brain.

Match 0 or more spaces greedily, followed by
Capture into $1, any character (other than a new line) 0 or more times non-greedily, followed by
Match 0 or more spaces greedily

Replace everything matched with what was caught in $1

Looking at it, I assumed the author was trying, albeit in a broken way, to strip out leading and trailing spaces. To further investigate its broken-ness, I wrote a small snippet:

push @strings, 'nospace';
push @strings, 'trailingspace      ';
push @strings, '      leadingspace';
push @strings, 'internal space';
push @strings, '    surroundedbyspace    ';
push @strings, '    spaces every    where    ';

for my $wtf (@strings)
{
    print "Before: '$wtf'\n";
    $wtf =~ s/\s*(.*?)\s*/$1/;
    print "After: '$wtf'\n\n";
}
[download]

Which, output:

Before: 'nospace'
After: 'nospace'

Before: 'trailingspace      '
After: 'trailingspace      '

Before: '      leadingspace'
After: 'leadingspace'

Before: 'internal space'
After: 'internal space'

Before: '    surroundedbyspace    '
After: 'surroundedbyspace    '

Before: '    spaces every    where    '
After: 'spaces every    where    '
[download]

Which is, basically, what I expected. The regex is only replacing leading spaces... The things is, this code is located in CGI::Cookie in the raw_fetch subroutine (view the source here). Here is the code in the subroutine:

# Fetch a list of cookies from the environment or the incoming headers
+ and
# return as a hash. The cookie values are not unescaped or altered in 
+any way.
 sub raw_fetch {
   my $class = shift;
   my $raw_cookie = get_raw_cookie(@_) or return;
   my %results;
   my($key,$value);
   
   my(@pairs) = split("; ?",$raw_cookie);
   foreach (@pairs) {
     s/\s*(.*?)\s*/$1/;
     if (/^([^=]+)=(.*)/) {
       $key = $1;
       $value = $2;
     }
     else {
       $key = $_;
       $value = '';
     }
     $results{$key} = $value;
   }
   return \%results unless wantarray;
   return %results;
}
[download]

So, is this regex doing something that I am missing, or is it a broken regex that was placed in a seldom-used subroutine that no one has bother correcting?
Can anyone shed some light...

enoch

update: Fixed a typo.

Comment on Weird Regex in CGI::Cookie Select or Download Code

Replies are listed 'Best First'.
Re: Weird Regex in CGI::Cookie by davido (Cardinal) on Oct 16, 2003 at 00:33 UTC
I just verified that CGI::Cookie v1.24 still contains that ineffectual regexp. Per the CGI::Cookie POD, "Address bug reports and comments to: lstein@cshl.org". Go for it. You might want to submit your report along with a recommended patch, if you want to get it implemented quicker. I know that Lincoln is pretty aggressive about stamping bugs out of his modules. The problem you're seeing is that the regexp engine is acting lazy. The following regexp seems to work for your test strings: `s/\s\b(.+)\b\s/$1/` [download] But it's probably better to anchor the regexp to the start and end of string, to draw it out to match the entire string. Even then it still just is a slightly ambiguous way to do things; to rely on greed outweighing non-greed. Especially when there are better ways to do it. It is faster and more reliable to do it this way: `s/^\s+//; s/\s+$//;` [download] Alternation could be used along with the /g modifier, but then you slow things down just a little. Dave "If I had my life to do over again, I'd be a plumber." -- Albert Einstein	[reply] [d/l] [select]
Re: Re: Weird Regex in CGI::Cookie by graff (Chancellor) on Oct 16, 2003 at 02:30 UTC
You might want to submit your report along with a recommended patch, if you want to get it implemented quicker. I second that, whole-heartedly. In fact, here's my suggestion for a better version (hoping I understand the intent correctly): `sub raw_fetch { my $class = shift; my $raw_cookie = get_raw_cookie(@_) or return; my %results; $raw_cookie =~ s/^\s+//; $raw_cookie =~ s/\s+$//; my(@pairs) = split(/\s;\s/,$raw_cookie); foreach (@pairs) { my ($key,$val) = (/^([^=]+)=?(.*)/); $results{$key} = $val; } return \%results unless wantarray; return %results; }` [download] I've only tested it to the extent that the assignments to %results match the original code (except for the trailing whitespace glitch) -- that is, a "raw_cookie" string like this: one=1 ;two=2; three=the third ; four= ; five ; six = the sixth ; produces these key/value pairs: 'one'=>'1', 'two'=>'2', 'three'=>'the third', 'four'=>'', 'five'=>'', 'six '=>' the sixth' (note that spaces next to "=" are not deleted)	[reply] [d/l]
Re: Weird Regex in CGI::Cookie by enoch (Chaplain) on Oct 16, 2003 at 04:39 UTC
I emailed in the following patch: `--- /usr/local/lib/perl5/5.6.1/CGI/Cookie.pm +++ /usr/local/lib/perl5/5.6.1/CGI/Cookie.pm.new @@ -42,7 +42,7 @@ my(@pairs) = split("; ?",$raw_cookie); foreach (@pairs) { - s/\s(.?)\s/$1/; + s/^\s+//; s/\s+$//; if (/^([^=]+)=(.)/) { $key = $1; $value = $2;` [download]	[reply] [d/l]
Re: Weird Regex in CGI::Cookie by d_i_r_t_y (Monk) on Oct 16, 2003 at 00:59 UTC
yes, it is broken. the reason is that the pattern is only matching the leading space and nothing else -- ie: `$_ = ' blee blah '; s/\s(.?)\s/$1/; print "matched '$&'; retained '$1'; str is '$_'\n";` [download] which gives `matched ' '; retained ''; str is 'blee blah '` to strip whitespace, the minimal match needs to be forcibly anchored to match to the end of the string, ie: `s/^\s(.+?)\s$/$1/`, which now successfully strips leading & training whitespace. however this is a poor way to do it. it is clearer and faster to use 2 regexps, one to strip leading \s, the other the trailing \s; ie: `s/^\s+//; s/\s+$//;` [download] using 2 regexps is also quite a bit faster than the single regexp too: [matt@blade8 matt]$ perl -MBenchmark -e '@list1 = @list2 = map { " "x +rand(100) . "some text" . " "x rand(100) } 1 .. 100; timethese( 100_ +000, { single_regexp => sub { return map { s/^\s(.+?)\s*$/$1/; $_ } + @list1 }, dual_regexp => sub { return map { s/^\s+//; s/\s+$//; $_ } + @list2 } } ); ' Benchmark: timing 100000 iterations of single_regexp, dual_regexp... single_regexp: 56 wallclock secs (56.04 usr + 0.00 sys = 56.04 CPU +) @ 1784.44/s (n=100000) dual_regexp: 17 wallclock secs (18.63 usr + 0.00 sys = 18.63 CPU) +@ 5367.69/s (n=100000) [download]	[reply] [d/l] [select]
Re: Weird Regex in CGI::Cookie by Abigail-II (Bishop) on Oct 15, 2003 at 23:40 UTC
It looks very broken to me. It's also strange that the code is deleting leading whitespace, while the split uses `"; ?"`. Abigail	[reply] [d/l]
Re: Weird Regex in CGI::Cookie by hossman (Prior) on Oct 20, 2003 at 05:06 UTC
So, is this regex doing something that I am missing, or is it a broken regex that was placed in a seldom-used subroutine that no one has bother correcting? Can anyone shed some light... I have a feeling that it is broken in the sense that it doesn't do what (i/you/others suspect) it was intended to do. However.... based on the cookie header format this code seems pretty unneccessary. It was probably added as a "sanity check" in case some external client/server was over-zealous with it's white space.	[reply]


Keep It Simple, Stupid
	PerlMonks