Capturing brackets within a repeat group

BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Capturing brackets within a repeat group by Arien (Pilgrim) on Jan 11, 2003 at 03:39 UTC
The passage from perlre is: The numbered variables ($1, $2, $3, etc.) and the related punctuation set ($+, $&, $`, $', and $^N) are all dynamically scoped until the end of the enclosing block or until the next successful match, whichever comes first. So you can only get at the last captured repeated group from outside the regex this way: the earlier captures get overwritten when repeatedly matching the repeated sub-pattern to get an overall match. &mdash Arien	[reply]
Re: Re: Capturing brackets within a repeat group by BrowserUk (Patriarch) on Jan 11, 2003 at 03:51 UTC
That's interesting. I've read that passage many times but never interpreted it that way. The bracketed comment you ommitted from the end of the paragraph: The numbered variables ($1, $2, $3, etc.) and the related punctuation set ($+, $&, $`, and $') are all dynamically scoped until the end of the enclosing block or until the next successful match, whichever comes first. (See Compound Statements in the perlsyn manpage.) led me to think that this was only relevent to the scope of the capture buffers external to the statement itself. I read the phrase you highlighted ... or until the next successful match ... to mean a successful match as part of a distinctly seperate m// or s/// (hence the reference to "Compound Statements"), rather than as the next successful match within the same statement. Oh well. It was a nice idea. Thanks for setting me straight. Examine what is said, not who speaks. The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.	[reply]
Re: Re: Re: Capturing brackets within a repeat group by John M. Dlugosz (Monsignor) on Jan 11, 2003 at 06:15 UTC
That's how I read that passage, too. But I have no problem with the way repeating a capture works. The ()'s generate code to refer to a particular slot, since the numbering is static from left-to-right. The repeat redoes the same parens. It's no different from backtracking in that respect. I'm sure you've figured this out: `/( (?: blah blah) {5} ) /x` [download] Put the repeat inside the parens. —John	[reply] [d/l]
Re: Re: Re: Re: Capturing brackets within a repeat group by BrowserUk (Patriarch) on Jan 11, 2003 at 10:19 UTC
Re: Re: Re: Re: Re: Capturing brackets within a repeat group by John M. Dlugosz (Monsignor) on Jan 11, 2003 at 17:55 UTC
Re: Capturing brackets within a repeat group [plus dynamic backreferences] by ihb (Deacon) on Jan 11, 2003 at 11:30 UTC
The typical expression to illustrate this is `/(.)/s` That will match last char in the string, if any. If you step back from the screen and look at the pattern again, you might think this makes sense. Looking at the capturing part `(.)` I think you want `$1` to be one char long. Expanding the issue a bit, would you want `'abcd' =~ /(?:(.)(.))/s` or `'abcd' =~ /(?:(.){2})/s` to set `$1 eq 'a' $2 eq 'b' $3 eq 'c' $4 eq 'd'` [download] ? What potentially could get really messy would be if you have another group and the end: `'abcdx' =~ /(?:(.)(.))(.)/s` How would you easily know what the last match matched? (Ignoring Re: Multiple matches of a regex.) Sure, you can use `$+`, or even `$^N` in recent perls. But what if it's the second last match? This also leads the question to how you'd do backreferences, if you at regex compile-time can't decide which variable that will hold the submatch. But this being Perl you of course can do what you want. Here's a little demonstration where I want to match subsequent words with nothing but spaces in between: `$_ = 'foo bar baz burk \| gah'; my @words; /(?:(\w+)\s+(?{push @words => $1}))*/; # Not backtracking safe! See +below. # Submatches are in @words now.` [download] If we look back at the issue of backreferencing you can use `(??{})` to create dynamic backreferences. This pattern below requires the last two words to be identical (but it doesn't include the last word in `@words`; compare to `/(.)\1/`). `my @words; 'foo bar baz baz burk \| gah' =~ / (?{ local @_words }) (?: (\w+) \s+ (?{ local @_words = (@_words, $1) }) )+ (??{ quotemeta $_words[-1] }) (?{ @words = @_words }) /x;` [download] This version is also backtracking safe. The one above wasn't, but it didn't need to. As you see it's a bit of extra work to make it backtracking safe so I kept it simple in the one that didn't need it. Hope I've helped, `ihb`	[reply] [d/l] [select]
Re: Re: Capturing brackets within a repeat group [plus dynamic backreferences] by BrowserUk (Patriarch) on Jan 11, 2003 at 22:18 UTC
Hope I've helped, In truth, I think you missed the point entirely. :^) Examine what is said, not who speaks. The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.	[reply]
Re: Re: Re: Capturing brackets within a repeat group [plus dynamic backreferences] by ihb (Deacon) on Jan 12, 2003 at 00:50 UTC
Besides looking for documentation I thought you'd perhaps wanted an explanation why Perl's current behaviour is sane and to be expected, and given a way to do achieve what you thought Perl would do for you. Let me set my general reply in context of MAC address parsing: First things first though: `local $_ = join ':', qw/0 0A 0C B B8 F/; # $mac my $part = qr/[0-9A-Z]{1,2}/;` [download] First you used `my @parts = /^($part):($part):($part):($part):($part):($part)$/;` which worked. Then you tried to shrink it to `my @parts = / ^ (?: ($part) : ){5} ($part) $ /x;` [download] but that didn't work. Now, using "my" technique you just need to add three to four lines to achieve what you want. use re 'eval'; # Needed due to interpolation of $part my @parts; / (?{ local @_parts }) ^ (?: ($part) : (?{ local @_parts = (@_parts, $1) }) ){5} ($part) $ (?{ @parts = (@_parts, $2) }) /x; The beauty of this technique is that you don't have to know how many times you need/want to match; something that is required if you use the `x` operator. If you just want to solve this particular problem, why not simply verify with your second more compact regex and then `split` it up on `/:/`? Update: Since I got negative response on this reply I reworded the beginning to make it better express what I meant. If it sounded offensive or bad in any way then that wasn't how it was meant and I apologize. `ihb`	[reply] [d/l] [select]
Re: Re: Re: Re: Capturing brackets within a repeat group [plus dynamic backreferences] by BrowserUk (Patriarch) on Jan 12, 2003 at 03:27 UTC
Re: Re: Re: Re: Re: Capturing brackets within a repeat group [plus dynamic backreferences] by ihb (Deacon) on Jan 12, 2003 at 13:21 UTC
Re: Capturing brackets within a repeat group by Hofmator (Curate) on Jan 11, 2003 at 15:58 UTC
To get round your problem you could of course also `my @mac_bits = $mac =~ /\G ( [0-9A-Z]{1,2} ) (?: :\|$ ) /igx; print 'no MAC address' unless (@mac_bits == 6);` [download] -- Hofmator	[reply] [d/l]
Re: Re: Capturing brackets within a repeat group by BrowserUk (Patriarch) on Jan 11, 2003 at 22:16 UTC
Yup! That's essentially the solution I arrived at here, although the \G in your version isn't doing anything in this case. I believe (but am open to correction) that \G doesn't have any effect unless you also use the /c modifier and even then, it only has an effect once a failure has occurred in which case, a subsequent match on the same target string will start from the point of the previous failure. The reason for wanting the capturing group withing a repeat count to work in the way I described was that it would allow the s/// used in the above reference to only effect the substitution on the target string if the format of the target string exactly matched the regex. Your regex will happily match 'ff:ff' or 'ff:ff:ff:ff:ff:ff:ff:ff:ff:' as you are aware, which is why you are checking the size of the array afterwards. Thats ok, but in the case where you want to modify the target using the s/// operator, it forces you to match & capture, test and then modify IF the number of matches is correct `my $mac = 'ff:ff:ff:ff:ff:ff'; if (6 == ($_ = () = $mac =~ /([0-9A-Z]{1,2})(?::\|$)/ig) ){ $mac =~ s/([0-9A-Z]{1,2})(?::\|$)/ substr "0$1", -2 /ieg; }` [download] in order that you ensure that you only modify the target if it actually conforms to the required format. That makes for a hell of a lot more work, redundancy, needless capturing and duplication than it would if the repeat group repeated the capture group as well. It might then look like this: `my $mac = 'ff:ff:ff:ff:ff:ff'; $mac =~ s[^ (?: ( [0-9A-Z]{1,2} ) : ){5} ( [0-9A-Z]{1,2} ) $] [ sprintf '%02s' x 6, $1, $2, $3, $4, $5, $6 ]iex;` [download] No need for the redundant capturing, duplicated matching, nor even to test as the substitution will only occur if the target matches the pattern exactly. I think that John M. Dlugosz hit the nail on the head. The best way of acheive my aim is to use the x operator to build the regex then compile it with qr// like this. `my $re_mac = '(?: ( [0-9A-Z]{1,2} ) : )' x 5 . '( [0-9A-Z]{1,2} )'; $re_mac = qr[$re_mac]ix; .... $mac =~ s[^ $re_mac $] [ sprintf '%02s' x 6, $1, $2, $3, $4, $5, $6 ]e +x;` [download] That satisfies my desire to avoid redundancy whilst only performing the substitution if the tightly specified regex is matched exactly. If I need to know whether the substitution occured, I can simply test its return. The main reason for the SoPW was simply that this was the first time I had ever tried to apply a repeat count to a capture group and when it didn't work the way my instincts told me it would, I tried to look up the description that explained the behaviour, and came up short. I'm still not entirely convinced that the passage that Arien cites is an explanation for the behaviour. Given the context of the passage, it seems entirely disparate from the usage I am describing. However, if it was in the authors mind to cover both situations in that short passage, then I think this is a case where a few more words, or perhaps a second short paragraph to seperate and clarify the two would have benefited. Examine what is said, not who speaks. The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.	[reply] [d/l] [select]
Re3: Capturing brackets within a repeat group by Hofmator (Curate) on Jan 12, 2003 at 15:44 UTC
although the \G in your version isn't doing anything in this case. I believe (but am open to correction) that \G doesn't have any effect unless you also use the /c modifier and even then, it only has an effect once a failure has occurred in which case, a subsequent match on the same target string will start from the point of the previous failure. then let me correct ;-) The '\G' forces the next match to start where the last ended. When the regex is executed the first time, '\G' is thus equivalent to '\A' (beginning of string). The next matches (due to the /g modifier) have to start where the previous one ended, so no part of the string can be skipped. This sure makes a difference, see the examples below. `sub test_regex { local $_ = shift; local $\ = "\n"; print 'string: ', $_, ; print 'with \G: ', join(':', m/\G ( [0-9A-Z]{1,2} ) (?: :\|$ ) / +igx); print 'without \G: ', join(':', m/ ( [0-9A-Z]{1,2} ) (?: :\|$ ) / +igx), "\n"; } test_regex('0:0A:0C:B:B8:F'); test_regex('#0:0A:0C:B:B8:F'); test_regex('0: 0A:0C:B:B8:F'); test_regex('0:0Aa:0C:B:B8:F');` [download] -- Hofmator	[reply] [d/l]
Re: Re3: Capturing brackets within a repeat group by BrowserUk (Patriarch) on Jan 12, 2003 at 16:52 UTC
Re5: Capturing brackets within a repeat group by Hofmator (Curate) on Jan 12, 2003 at 18:03 UTC