Matching a regular expression group multiple times

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Matching a regular expression group multiple times by AnomalousMonk (Archbishop) on Aug 12, 2014 at 17:29 UTC
The capturing parentheses in a regex expresson like `qr/(?:(simple).?)+/` always capture to the same capture group no matter how many times they may be 'repeated' by a quantifier. Which* capture group (by number) is determined by the position of the capturing parentheses in the final regex. After interpolation, the statement `$string =~ /$re$re$re/g;` looks like `$string =~ /(?:(simple).?)+(?:(simple).?)+(?:(simple).?)+/g;` which clearly contains three* sets of capturing parentheses, capturing to `$1 $2 $3` respectively. Perhaps a way to do what you want is: `c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le "my $re = qr/(?:(simple).?)+/; ;; my $string = 'This is a simple string, just a simple simple thing.'; my @captures = $string =~ /$re/g; dd \@captures; " ["simple", "simple", "simple"]` [download] Update:* This particular example can be expressed even more simply as: `c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le "my $re = qr/simple/; ;; my $string = 'This is a simple string, just a simple simple thing.'; my @captures = $string =~ /$re/g; dd \@captures; " ["simple", "simple", "simple"]` [download]	[reply] [d/l] [select]
Re^2: Matching a regular expression group multiple times by kennethk (Abbot) on Aug 12, 2014 at 17:37 UTC
You are correct that the OP has confusion about number on the capture buffers, but there's something a little odd going on here with the greedy `+` (in my mind). `#!/usr/bin/perl use 5.10.0; my $re = qr/(?:(simple).?)+/; my $string = "This is a simple thing just a simple simple thing."; $string =~ /$re/g; say $&;` [download] outputs `simple` [download] but changing line 3 to `my $re = qr/(?:(simple).?){3}/;` [download] outputs `simple thing just a simple simple` [download] Why is the repeat failing? Is it because the non-greediness of the inner term somehow trumps the greediness of the outer? #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.	[reply] [d/l] [select]
Re^3: Matching a regular expression group multiple times by AppleFritter (Vicar) on Aug 12, 2014 at 18:09 UTC
Why is the repeat failing? Is it because the non-greediness of the inner term somehow trumps the greediness of the outer? Yes, this is due to the way the regex engine works. Perl will match the literal string "`simple`", and then match any number of characters, but as few as possible (`.?`), subject to the constraints imposed by the rest of the pattern. But there IS no rest of the pattern; so there are no constraints, and Perl does its utmost and matches zero* extra characters. Only now after this is done does the `+` quantifier kick in, but since it finds that there isn't another literal "`simple`" following what was already matched, nothing further is matched, and the entire match consists of only of the initial "`simple`" followed by the empty string that the `.?` matched. Wait, I hear you say, there is more to the pattern! The `+` itself surely follows? However, that's not how the regex engine works; the `+` is part of the pattern currently being matched, and the fact that it trails the non-capturing group is a mere artifact of Perl's regex syntax. It helps to think of the `+` as being at the front of that group instead, where you'd also find other modifiers (e.g. `(?i:...)`). So there is no pattern following the first, and Perl isn't cunning enough to match a bigger part of the string. Neither should it be: in order to do so, it'd have to ignore what you're explicitely telling it to (match any number of characters, but as few as possible), so in order to be able to match more later on. And how would it know that this is what you wanted, anyway? Perl is a DWIMmy language, but it can't read minds yet. ;) The regex engine's inner workings are explained in detail in chapter 5 of Programming Perl, BTW, in the section titled "The Little Engine That /Could(n't)?/*".	[reply]
Re^3: Matching a regular expression group multiple times by AnomalousMonk (Archbishop) on Aug 12, 2014 at 18:20 UTC
In the `qr/(?:(simple).?)+/` regex, `.?` is satisfied with nothing, so it's happy. Then `(?:pattern)+` is satisfied with a single `'simple'`. If there were more `simple...` sequences immediately following, greedy `+` would try to grab them, but there aren't, so it don't. If `+` is satisfied with what it has, it can't force preceding satisfied assertions to fail. In the `qr/(?:(simple).?){3}/` regex, the `{3}` quantifier cannot* be satisfied until it forces the preceding `.?` to grab a bunch more stuff. (I've removed the `/g` modifier in these examples because it just confuses the issue.) `c:\@Work\Perl\monks>perl -wMstrict -lE "my $re = qr/(?:(s \d mple).?)+/x; my $string = 'This is a s1mple thing just a s2mple s3mple thing.'; $string =~ $re; say $&; ;; my $string2 = 'This is a s1mples2mples3mple thing'; $string2 =~ $re; say $&; ;; $re = qr/(?:(s \d mple).*?){3}/x; $string =~ $re; say $&; " s1mple s1mples2mples3mple s1mple thing just a s2mple s3mple` [download]	[reply] [d/l] [select]
Re: Matching a regular expression group multiple times by Anonymous Monk on Aug 12, 2014 at 17:00 UTC
There is a mistake in the second section of code. It should read like this. `my $re = qr/(?:(simple).*?)+/; my $string = "This is a simple thing just a simple simple thing."; $string =~ /$re$re$re/g; say $1 if $1; # says simple say $2 if $2; # says simple say $3 if $3; # says simple` [download]	[reply] [d/l]
Re: Matching a regular expression group multiple times by Anonymous Monk on Sep 11, 2024 at 00:56 UTC
REALLY interesting question I learned quite a bit myself on this one. I commented inside the code quite a bit saying what was happening as I went along. Basically it matters if you do a global match in scalar or list context. If you do a global match in SCALAR context, like this: `my $string =~ /(simple)/g;` The regex will only return the position of the FIRST match and stop searching. It will set the pos($string) value to the final position of the first match. If you run a global match AGAIN on that same string, it will not start from the beginning of the string, but from the ending position of the last match. Therefore a match with the /g modifier can and often will have side effects. The pos($string) value will reset on a failed match, or you can reset it manually via `pos($string) = 0` However if you want to keep matching all the way to the END of the string and keep storing back references, you need to run the match in LIST context. Like this `my @matches = $string =~/(simple)/g` This will match all the way to the end of the string and find all instances of "simple". If you wanted to do the same thing in scalar context, you would have to use a while loop. Like this `while($string =~/(simple)/g){ my $postition = pos($string); print "most recent match is $1, at position $position\n"; }` [download] This will go all the way to the end of the string in SCALAR context and find all instances of "simple". In case that is confusing here is the entire code sample which is also commented and should hopefully explain the difference between using the /g modifier matching in SCALAR context vs LIST context. #!/usr/bin/perl -w =begin running a global match on the same string twice can have side effects. + The pos($testString) or ending position of the last match changes a +round behind the scenes. When using the /g modifier in scalar contex +t, the next search will start from the position of the last match, no +t the beginning of the string each time. To reset this position and +restart from the beginning, you need a failed match or you can reset +manually using the pos() function i.e. pos($test3) = 0 restarts match +ing from the beginning of the string. Also @- is the builtin array that contains the beginning position of a +ll the matches if you were doing this manually like in a while loop =end =cut my $test1 = $test2 = $test3 = "This is a simple thing, just a simple s +imple thing."; my @matches; my $position; print "String: \"$test1\"\n\n"; print "first test scalar context:\n"; $test1 =~ /(simple)/g; $position = pos($test1); print "\$1 is $1, pos(\$test1) is $position\n" if($1); print "\$2 is $2, pos(\$test1) is $position\n" if($2);#no match becaus +e scalar context /g only gets the first match print "\$3 is $3, pos(\$test1) is $position\n" if($3);#no match becaus +e (same) print "\n"; print "second test list context:\n"; @matches = $test2 =~ /(simple)/g;#matches all three in list context, a +pparently does not set pos. Cant use foreach loop must use while loo +p if you needed the positions my $i = 0; for (@matches){ print "Match $i is $_\n"; $i++; } print "\n"; =begin scalar context /g doesnt go to end of string, it stops at a match... +Next search begins at the position of this match. To match all the w +ay to end of a string, use list context or scalars in a loop structur +e. A loop structure is useful if you needed the position of each mat +ch which would be in pos($test3) #https://www.oreilly.com/library/view/perl-in-a/1565922867/re148.html =end =cut print "third test scalar context but with looping:\n"; $i=0; while($test3 =~ /(simple)/g){ my $position = pos($test3); print "\$1 is $1, pos(\$test3) is $position, loop counter is $i\n" + if($1); print "\$2 is $2, pos(\$test3) is $position, loop counter is $i\n" + if($2);#no match because scalar context print "\$3 is $3, pos(\$test3) is $position, loop counter is $i\n" + if($3);#no match because scalar context #can reset position like this if you needed to pos($test3) = 0; $i++; if($i > 10){ last;}#watch out for infinite loops too if you reset +the position in the while loop } print "\n"; [download] The output looks like this `$perl simple.pl String: "This is a simple thing, just a simple simple thing." first test scalar context: $1 is simple, pos($test1) is 16 second test list context: Match 0 is simple Match 1 is simple Match 2 is simple third test scalar context but with looping: $1 is simple, pos($test3) is 16, loop counter is 0 $1 is simple, pos($test3) is 37, loop counter is 1 $1 is simple, pos($test3) is 44, loop counter is 2` [download]	[reply] [d/l] [select]
Re: Matching a regular expression group multiple times by locked_user sundialsvc4 (Abbot) on Aug 12, 2014 at 23:42 UTC
As far as I know, there are (just) two issues at-bar here: If you want to match zero-or-more adjacent occurrences of a single string, in any particular place within the source, then you must use a quantifier such as `+` ... and the search should be “greedy.” If you need to match multiple times within the same string, the modifiers `/gc` should be used, within a loop that will then search through and exhaust the input string. And, that’s it.