Puzzled by regex

syphilis has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Puzzled by regex
by davido (Cardinal) on Apr 10, 2013 at 06:16 UTC

There is a difference, but it's not necessarily in what is getting matched, but rather, in how it's matching. Example:

use strict;
use warnings;

my @strings = (
  "____\n",   "__ __\n",  "__X __\n",
  "__Z__\n",  "__\n__\n", "__^__\n",
  "_____\n",  "________\n",
); 


foreach my $string ( @strings ) {
  print "<<$string>>\n";
  if( $string =~ m/__(\S+?)__/ ) {
    print "\tNon-Greedy -- Match: (($1)).\n";
  }
  else {
    print "\tNon-Greedy -- No Match.\n";
  }
  if( $string =~ m/__(\S+)__/ ) {
    print "\tGreedy     -- Match: [[$1]].\n";
  }
  else {
    print "\tGreedy     -- No Match.\n";
  }
}
[download]

Most of that is going to be pretty boring, until you get to the last item in the list, where you'll get the following output:

<<________
>>
    Non-Greedy -- Match: ((_)).
    Greedy     -- Match: [[____]].
[download]

I have no idea whether non-greedy matching is going to have any practical effect in the type of strings you're matching with the regex though.

Dave

[reply]
[d/l]
[select]

Re^2: Puzzled by regex

by syphilis (Archbishop) on Apr 10, 2013 at 07:17 UTC

\n

        Non-Greedy -- Match: ((____)).
        Greedy     -- Match: [[____]].
[download]

?

[reply]
[d/l]
[select]

Re^3: Puzzled by regex

by Anonymous Monk on Apr 10, 2013 at 07:36 UTC

but at least now I'm starting to feel a little confident that it serves no purpose. (I'll still probably leave it there ... because I'm feeling even more confident that it doesn't do any harm :-)

Its probably a reflex :) I know when I write regex I make more mistakes from greedines than from non-greediness, so I tend to write +? *? to be on the safe side

I know I'm not alone in getting bit by it , it is a frequent cause/solution from newbies

[reply]

Re: Puzzled by regex
by Athanasius (Archbishop) on Apr 10, 2013 at 04:26 UTC

When I saw the regex expression \S+?, my first thought was that this is equivalent to \S*. But it isn’t, as a little experimentation shows.

Consulting the Camel Book (4^th Edition, page 214), I found that + means “1 or more times maximally” and +? means “1 or more times minimally.”

So, the difference between the two forms is not whether they match: if one matches, both must match. The difference lies only in what is matched, and this is relevant only if this part is captured (or, just possibly, if efficiency is an issue).

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^2: Puzzled by regex

by syphilis (Archbishop) on Apr 10, 2013 at 06:53 UTC

The difference lies only in what is matched, and this is relevant only if this part is captured

/\S+?/

/\S+/

/__\S+?__\n/

/__\S+__\n/

@{$DATA{$pkg}} = split /(?m)(__\S+?__\n)/, $data;
[download]

[reply]
[d/l]
[select]

Re: Puzzled by regex
by Don Coyote (Hermit) on Apr 10, 2013 at 08:18 UTC

Random thoughts on match operating efficiency

The difference between the maximally matched quantifier (.+) - greedy, and the minimally matched quantifier (.+?) - nongreedy, in the case of the +(1 or more) quantifier is what is matched but more importantly, how, or from where, it is matched

In the maximal case the match position begins from eol and backtracks a position at a time and checks for the match, repeating till success or starting match position is reached

In the nongreedy case the operator match position starts from the starting match postion and forward-tracks a character at a time until success or eol

application of + quantifier behaviour to ? quantifier behaviour:

applying this to the ?(0 or 1) quantifier, I would expect the matching start position differs in the case of a greedy match starting at 1 position ahead, and in the nongreedy case starting at the starting match position.

Random summation:

The difference is not in what is matched, but how, or from where, the matching starts. This effectively increases the nongreedy match efficiency by the reduction of one jump ahead operation per usage.

Just Random:

I would imagine this will have been internally optimised, unless (or even especially if) there is perhaps a security benefit of a look forward match opposed to a look behind match

update later the same day

crumbs, +(0 or 1) quantifier, well that is incorrect. This '+' is the (1 or more) quantifier.

ok so to fix the above example i have replaced the '*' quantifiers with '+' quantifiers. And I have replaced the '+' quantifiers with '?' quantifiers, so at least what I wrote makes sense. Which it does despite the syntax errors now rectified.

After attempting to provide some examples where differences would be found, between the default greedy and nongreedy behaviour indicated by a secondary '?' quantifier, I realised that you are right, there are no differences in what is matched, when the '\n' are included, and in agreement with davidos and my own response, being the difference is in how the match is carried out.

[reply]

Re: Puzzled by regex
by Loops (Curate) on Apr 10, 2013 at 04:20 UTC

In the first regular expression you're using the '?' operator which says the previous character or group is optional. But you're applying it against the '+' operator which makes no sense, since it means one-or-more.

If instead you group the \S and + together and then apply the '?', you'll see different results:

use warnings;

my @str = ("____\n", "__ __\n", "__X __\n", "__Z__\n", "__\n__\n", "__
+^__\n");

for(@str) {
   if($_ =~ /__(\S+)?__\n/) {print "1 "}
   else {print "0 "}

   if($_ =~ /__\S+__\n/) {print "1\n"}
   else {print "0\n"}
}
__END__
1 0
0 0
0 0
1 1
0 0
1 1
[download]

So the answer is, Inline.pm just has a bug.

[reply]
[d/l]

Re^2: Puzzled by regex

by davido (Cardinal) on Apr 10, 2013 at 06:04 UTC

This is not exactly accurate. \S+ greedily matches one or more non-space characters. \S+? non-greedily matches one or more non-space characters. The syntax does make sense.

Dave

[reply]
[d/l]
[select]

Re^3: Puzzled by regex

by Loops (Curate) on Apr 10, 2013 at 06:48 UTC

Thanks for correcting me. I was just dead wrong.

It's explained here: Matching Repititions, Down a bit it explains, "minimal match or non-greedy quantifiers ?? , *? , +?, and {}?"

Learn something new every day.

[reply]

Re^4: Puzzled by regex

by davido (Cardinal) on Apr 10, 2013 at 07:05 UTC