Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm struggling with capture of matched bits when using a quantifier. It seems that if I used a quantifier (e.g. * in this case) for a subpattern then only the final match is captured and returned by the regex. Here's some code that illustrates the problem. I'd like to capture the numbers 2 through 19, but only get 2, 3 and 19.

Using ActiveState perl 5.8.8 (don't ask) if it matters.

Many thanks for any assistance!

+++ cut here +++

$str = '12/22/2005 20 Notice of Agenda of Matters Scheduled for Hearin +g (related document(s)2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16 +, [17], 18, 19) Filed by Fubar, Inc.. Hearing scheduled for 4/11/2013 at 11:30 AM'; @matches = ($str =~ m/\(Related document\(s\)(\d+)\, (\d+)(?:\, (?:\[| +)(\d+)(?:\]|))*\)/i); print "dollar amp is $&\n"; print "found " . scalar(@matches) . " matches.\n"; foreach (@matches) { print " $_\n"; }

Replies are listed 'Best First'.
Re: regex capture and quantifiers
by kennethk (Abbot) on Apr 30, 2013 at 20:49 UTC
    When you have Capture groups with Quantifiers, only the last match is returned. For example,
    use strict; use warnings; $_ = '1,2,3,4,5'; print /(,?\d)+/;
    will output
    ,5
    The solution for the task your shown here (IMHO) is to capture the entire series of numbers, then split on commas:
    use strict; use warnings; my $str = '12/22/2005 20 Notice of Agenda of Matters Scheduled for Hea +ring (related document(s)2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, + 16, [17], 18, 19) Filed by Fubar, Inc.. Hearing scheduled for 4/11/2013 at 11:30 AM'; my ($series) = $str =~ m/\(Related document\(s\)([\]\[\d, ]+)\)/i; my @matches = split /,\s*/, $series; s/\[|\]//g for @matches; print "dollar amp is $&\n"; print "found " . scalar(@matches) . " matches.\n"; foreach (@matches) { print " $_\n"; }

    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Re: regex capture and quantifiers
by AnomalousMonk (Archbishop) on Apr 30, 2013 at 23:32 UTC
    @matches = ($str =~ m/\(Related document\(s\)(\d+)\, (\d+)(?:\, (?:\[| +)(\d+)(?:\]|))*\)/i);

    Others have discussed the answer to your original question (a capture group contains only the last sub-string matched and captured even though it may have matched and captured many times), but there is something else in the regex quoted above that I think merits comment.

    The  (?:\[|) and  (?:\]|) regexes match either the  [ (left square bracket) or  ] (right square bracket), respectively, or else the empty pattern: note the  | alternation metacharacter. What the empty pattern matches is discussed in "The empty pattern //" sub-section of the discussion of the  m// operator in Regexp Quote-Like Operators in perlop and is probably not what you want. What you probably want is, respectively, something like  \[? and  \]? entirely replacing the grouped expressions.

Re: regex capture and quantifiers
by Laurent_R (Canon) on Apr 30, 2013 at 21:17 UTC

    Please provide the output that you got and the output that you expected. This way, I do not have to run tour program and try to figure out why what you get is not what you wanted.

    I think that your problem has probably to do the greediness of the * or + quantifiers in matches: they try to match as much as possible.

    Sometimes, the earlier part of your match does match much more that you expect from the string and end up with the wrong capture.

    For example, suppose that I want to match the second word of this sentence: "The quick brown fox jumps over the lazy dog." If I use this regep: /.+ (\w+) /, I might think that the early part of the regexp will "eat" the first word until the space and that the (\w+) will capture "quick". In fact, the '.+ ' will match as much as possible to still make the '(\w+) ' match something. So that the first part will match "The quick brown fox jumps over the " and that the (\w+) will match "lazy" as it can be seen in the follwing session under the Perl debugger:

    DB<5> $c = "The quick brown fox jumps over the lazy dog."; DB<6> print $1 if $c =~ /.+ (\w+) /; lazy DB<7>

    To prevent this, you have to use either the non-greedy quantifiers (+? and *?) or to be more specific in your regexp. For example, the following regexp will all match the second word as expected:

    DB<8> print $1 if $c =~ /[^ ]+ (\w+) /; quick DB<9> print $1 if $c =~ /\w+ (\w+) /; quick DB<10> print $1 if $c =~ /\S+ (\w+) /; quick DB<11> print $1 if $c =~ /.+? (\w+) /; quick DB<21>
Re: regex capture and quantifiers
by 2teez (Vicar) on Apr 30, 2013 at 21:11 UTC

    Something like this:

    use warnings; use strict; my $str = '12/22/2005 20 Notice of Agenda of Matters Scheduled for Hea +ring (related document(s)2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, + 16, [17], 18, 19) Filed by Fubar, Inc.. Hearing scheduled for 4/11/2013 at 11:30 AM'; if($str=~m{related.+\)(.+?)\[(.+?)\](.+?)\)}){ print join "\n" => map{ split/,\s*/,$_}($1,$2,$3); }
    Update: I think it is also a good practice to test what you match.

    If you tell me, I'll forget.
    If you show me, I'll remember.
    if you involve me, I'll understand.
    --- Author unknown to me
Re: regex capture and quantifiers
by peanut59 (Initiate) on Apr 30, 2013 at 20:18 UTC

    This was my first post. Don't know why it was posted as anonymous?

      Because you were probably not logged in when you posted. I think I made the same mistake for my first post here. If I understood well what happened at the time, the fact that you register does not log you in. So that if you post something immediately, you appear as anonymous monk.