BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

I know I'm gonna feel like an idiot when I read the explaination of this but...what the hey.

I was playing around with regexes to match MAC addresses as formatted in a recent SoPW and tried this.

print $mac =~ /^([0-9A-Z]{1,2}):([0-9A-Z]{1,2}):([0-9A-Z]{1,2}):([0-9A-Z]{1,2}):([0- +9A-Z]{1,2}):([0-9A-Z]{1,2})$/ 0 0A 0C B B8 F

Works fine, but the obvious step is to reduce the repeating elements by grouping and adding a repeat count. So I tried this

print $mac =~ /^(?:([0-9A-Z]{1,2}):){5}([0-9A-Z]{1,2})$/ B8 F

Which wasn't what I expected at all. So I tried this

print $mac =~ /^(?:([0-9A-Z]{1,2}):){2}([0-9A-Z]{1,2}):([0-9A-Z]{1,2}) +:([0-9A-Z]{1,2}):([0-9A-Z]{1,2})$/ 0A 0C B B8 F

Which seems to indicate that if you have a capturing group within a repeat group, that whilst the number of repetitions is honoured and must be consistant with the data, only the last occurance of the capturing group actually captures?

This is contra my expectations, and despite reading the perlre and others, I don't see anything that would indicate this is the case.

Did I miss the relevant passage in the docs and this is working as designed or uncover a bug?


Examine what is said, not who speaks.

The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

Replies are listed 'Best First'.
Re: Capturing brackets within a repeat group
by Arien (Pilgrim) on Jan 11, 2003 at 03:39 UTC

    The passage from perlre is:

    The numbered variables ($1, $2, $3, etc.) and the related punctuation set ($+, $&, $`, $', and $^N) are all dynamically scoped until the end of the enclosing block or until the next successful match, whichever comes first.

    So you can only get at the last captured repeated group from outside the regex this way: the earlier captures get overwritten when repeatedly matching the repeated sub-pattern to get an overall match.

    &mdash Arien

      That's interesting. I've read that passage many times but never interpreted it that way. The bracketed comment you ommitted from the end of the paragraph:

      The numbered variables ($1, $2, $3, etc.) and the related punctuation set ($+, $&, $`, and $') are all dynamically scoped until the end of the enclosing block or until the next successful match, whichever comes first. (See Compound Statements in the perlsyn manpage.)

      led me to think that this was only relevent to the scope of the capture buffers external to the statement itself. I read the phrase you highlighted ... or until the next successful match ... to mean a successful match as part of a distinctly seperate m// or s/// (hence the reference to "Compound Statements"), rather than as the next successful match within the same statement.

      Oh well. It was a nice idea. Thanks for setting me straight.


      Examine what is said, not who speaks.

      The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

        That's how I read that passage, too.

        But I have no problem with the way repeating a capture works. The ()'s generate code to refer to a particular slot, since the numbering is static from left-to-right. The repeat redoes the same parens. It's no different from backtracking in that respect.

        I'm sure you've figured this out:

        /( (?: blah blah) {5} ) /x
        Put the repeat inside the parens.

        —John

Re: Capturing brackets within a repeat group [plus dynamic backreferences]
by ihb (Deacon) on Jan 11, 2003 at 11:30 UTC
    The typical expression to illustrate this is   /(.)*/s That will match last char in the string, if any. If you step back from the screen and look at the pattern again, you might think this makes sense. Looking at the capturing part (.) I think you want $1 to be one char long.

    Expanding the issue a bit, would you want   'abcd' =~ /(?:(.)(.))*/s or   'abcd' =~ /(?:(.){2})*/s to set
    $1 eq 'a' $2 eq 'b' $3 eq 'c' $4 eq 'd'
    ?

    What potentially could get really messy would be if you have another group and the end:   'abcdx' =~ /(?:(.)(.))*(.)/s How would you easily know what the last match matched? (Ignoring Re: Multiple matches of a regex.) Sure, you can use $+, or even $^N in recent perls. But what if it's the second last match?

    This also leads the question to how you'd do backreferences, if you at regex compile-time can't decide which variable that will hold the submatch.

    But this being Perl you of course can do what you want. Here's a little demonstration where I want to match subsequent words with nothing but spaces in between:
    $_ = 'foo bar baz burk | gah'; my @words; /(?:(\w+)\s+(?{push @words => $1}))*/; # Not backtracking safe! See +below. # Submatches are in @words now.
    If we look back at the issue of backreferencing you can use (??{}) to create dynamic backreferences. This pattern below requires the last two words to be identical (but it doesn't include the last word in @words; compare to /(.)\1/).
    my @words; 'foo bar baz baz burk | gah' =~ / (?{ local @_words }) (?: (\w+) \s+ (?{ local @_words = (@_words, $1) }) )+ (??{ quotemeta $_words[-1] }) (?{ @words = @_words }) /x;
    This version is also backtracking safe. The one above wasn't, but it didn't need to. As you see it's a bit of extra work to make it backtracking safe so I kept it simple in the one that didn't need it.

    Hope I've helped,
    ihb

      Hope I've helped,

      In truth, I think you missed the point entirely. :^)


      Examine what is said, not who speaks.

      The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

        Besides looking for documentation I thought you'd perhaps wanted an explanation why Perl's current behaviour is sane and to be expected, and given a way to do achieve what you thought Perl would do for you. Let me set my general reply in context of MAC address parsing:

        First things first though:
        local $_ = join ':', qw/0 0A 0C B B8 F/; # $mac my $part = qr/[0-9A-Z]{1,2}/;
        First you used   my @parts = /^($part):($part):($part):($part):($part):($part)$/; which worked. Then you tried to shrink it to
        my @parts = / ^ (?: ($part) : ){5} ($part) $ /x;
        but that didn't work. Now, using "my" technique you just need to add three to four lines to achieve what you want.
          use re 'eval'; # Needed due to interpolation of $part
          my @parts;
          /
            (?{ local @_parts })
        
            ^
            (?:
              ($part)
              :
              (?{ local @_parts = (@_parts, $1) })
            ){5}
            ($part)
            $
        
            (?{ @parts = (@_parts, $2) })
          /x;
        
        The beauty of this technique is that you don't have to know how many times you need/want to match; something that is required if you use the x operator.

        If you just want to solve this particular problem, why not simply verify with your second more compact regex and then split it up on /:/?

        Update:
        Since I got negative response on this reply I reworded the beginning to make it better express what I meant. If it sounded offensive or bad in any way then that wasn't how it was meant and I apologize.

        ihb
Re: Capturing brackets within a repeat group
by Hofmator (Curate) on Jan 11, 2003 at 15:58 UTC
    To get round your problem you could of course also
    my @mac_bits = $mac =~ /\G ( [0-9A-Z]{1,2} ) (?: :|$ ) /igx; print 'no MAC address' unless (@mac_bits == 6);

    -- Hofmator

      Yup! That's essentially the solution I arrived at here, although the \G in your version isn't doing anything in this case. I believe (but am open to correction) that \G doesn't have any effect unless you also use the /c modifier and even then, it only has an effect once a failure has occurred in which case, a subsequent match on the same target string will start from the point of the previous failure.

      The reason for wanting the capturing group withing a repeat count to work in the way I described was that it would allow the s/// used in the above reference to only effect the substitution on the target string if the format of the target string exactly matched the regex.

      Your regex will happily match 'ff:ff' or 'ff:ff:ff:ff:ff:ff:ff:ff:ff:' as you are aware, which is why you are checking the size of the array afterwards. Thats ok, but in the case where you want to modify the target using the s/// operator, it forces you to match & capture, test and then modify *IF* the number of matches is correct

      my $mac = 'ff:ff:ff:ff:ff:ff'; if (6 == ($_ = () = $mac =~ /([0-9A-Z]{1,2})(?::|$)/ig) ){ $mac =~ s/([0-9A-Z]{1,2})(?::|$)/ substr "0$1", -2 /ieg; }

      in order that you ensure that you only modify the target if it actually conforms to the required format.

      That makes for a hell of a lot more work, redundancy, needless capturing and duplication than it would if the repeat group repeated the capture group as well.

      It might then look like this:

      my $mac = 'ff:ff:ff:ff:ff:ff'; $mac =~ s[^ (?: ( [0-9A-Z]{1,2} ) : ){5} ( [0-9A-Z]{1,2} ) $] [ sprintf '%02s' x 6, $1, $2, $3, $4, $5, $6 ]iex;

      No need for the redundant capturing, duplicated matching, nor even to test as the substitution will only occur if the target matches the pattern exactly.

      I think that John M. Dlugosz hit the nail on the head. The best way of acheive my aim is to use the x operator to build the regex then compile it with qr// like this.

      my $re_mac = '(?: ( [0-9A-Z]{1,2} ) : )' x 5 . '( [0-9A-Z]{1,2} )'; $re_mac = qr[$re_mac]ix; .... $mac =~ s[^ $re_mac $] [ sprintf '%02s' x 6, $1, $2, $3, $4, $5, $6 ]e +x;

      That satisfies my desire to avoid redundancy whilst only performing the substitution if the tightly specified regex is matched exactly. If I need to know whether the substitution occured, I can simply test its return.

      The main reason for the SoPW was simply that this was the first time I had ever tried to apply a repeat count to a capture group and when it didn't work the way my instincts told me it would, I tried to look up the description that explained the behaviour, and came up short. I'm still not entirely convinced that the passage that Arien cites is an explanation for the behaviour. Given the context of the passage, it seems entirely disparate from the usage I am describing. However, if it was in the authors mind to cover both situations in that short passage, then I think this is a case where a few more words, or perhaps a second short paragraph to seperate and clarify the two would have benefited.


      Examine what is said, not who speaks.

      The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

        although the \G in your version isn't doing anything in this case. I believe (but am open to correction) that \G doesn't have any effect unless you also use the /c modifier and even then, it only has an effect once a failure has occurred in which case, a subsequent match on the same target string will start from the point of the previous failure.

        then let me correct ;-)

        The '\G' forces the next match to start where the last ended. When the regex is executed the first time, '\G' is thus equivalent to '\A' (beginning of string). The next matches (due to the /g modifier) have to start where the previous one ended, so no part of the string can be skipped. This sure makes a difference, see the examples below.

        sub test_regex { local $_ = shift; local $\ = "\n"; print 'string: ', $_, ; print 'with \G: ', join(':', m/\G ( [0-9A-Z]{1,2} ) (?: :|$ ) / +igx); print 'without \G: ', join(':', m/ ( [0-9A-Z]{1,2} ) (?: :|$ ) / +igx), "\n"; } test_regex('0:0A:0C:B:B8:F'); test_regex('#0:0A:0C:B:B8:F'); test_regex('0: 0A:0C:B:B8:F'); test_regex('0:0Aa:0C:B:B8:F');

        -- Hofmator