jgeisler has asked for the wisdom of the Perl Monks concerning the following question:

I'd really like to create a regular expression that has + or * outside a parenthesis cause multiple captures when in list context. For instance, I'd like to match some text followed by an unknown number of words. For example, /$text \s (?: (\w+) \s?)+/x. Note the last +. I'd like it to cause a capture for each internal word. Perl doesn't let me do that. What kind of workarounds can I employ?

Replies are listed 'Best First'.
Re: Getting + and * to generate multiple captures
by ikegami (Patriarch) on Aug 17, 2006 at 17:04 UTC

    Yet another completely different approach is to embed code in the regexp.

    # We need use re 'eval' because we use interpolation and (?{...}) # in the same regexp. Beware of the implications of this directive. use re 'eval'; our @matches; # Don't use a lexical for this. local *matches; # Protect our caller's variables. / (?{ [] }) # Create a stack $text (?: \s (\w+) (?{ [ @{$^R}, $1 ] }) # Save last match on the stack. )+ (?{ @matches = @{$^R}; }) # Success! Save the result. /x;

    Since Perl 5.8.0, the $1 in the above can be replaced with $^N.

    It's possible to simplify the above code since the regexp engine will never backtrack through (?{ [ @{$^R}, $1 ] }) in this particular regexp, but it's much safer to assume there's always the possibility of backtracking through any (?{...}). That's why $^R is used.

    Update: The stack is unnecessarily big in the above code. The following greatly reduces the size of the stack, which probably also speeds things up greatly.

    sub flatten_list { my ($rv, $p) = @_; @$rv = (); while ($p) { unshift @$rv, $p->[1]; $p = $p->[0]; } } our @matches; local *matches; / $text (?: \s (\w+) (?{ [ $^R, $1 ] }) )+ (?{ flatten_list \@matches, $^R }) /x;
      Thanks, that will do what I want. I'm assuming I need to access @matches explicitly after running the match to grab the values I care about? Can I do something like:
      / $text (?: \s (\w+) (?{ push @matches, $^N }) )+ /x;
      to just populate @matches instead of creating the stack and then flattening it?

        For that very specific regexp, yes. That's the simplification to which I alluded. I'll repeat the reason I didn't post the simplification

        It's much safer to assume there's always the possibility of backtracking through any (?{...}). That's why $^R is used.

        It's too easy to miss a case where backtracking can occur.

        For example,
        / $text (?: \s (\w+) (?{ push @matches, $^N }) ){2,} / is wrong.
        / $text (?: \s (\w+) (?{ push @matches, $^N }) )+ ... / is wrong.
        / $text (?: \s (\w+) (?{ push @matches, $^N }) ... )+ / is wrong.

Re: Getting + and * to generate multiple captures
by liverpole (Monsignor) on Aug 17, 2006 at 16:55 UTC
    Hi jgeisler,

    Have you tried split?

    use strict; use warnings; my $msg = "the quick brown fox jumps over the lazy dog"; my $text = "fox"; my @words = ( ); if ($msg =~ /$text \s (((\w+) \s?)+)/x) { @words = split(/\s+/, $1); # @words now contains the list you want... printf "Words: %s\n", join(',', @words); } __END__ [Results] Words: jumps,over,the,lazy,dog

    s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/
      I've simplified my problem somewhat to ask the initial question. split() would cause me much pain because I really have multiple regular expressions with quite different separators (not the simple \s used in the example code). However, since all the regular expressions could capture the same content, I would like to only use one piece of code after the appropriate regular expression matches and returns the relevant information. In other words, I'm doing something like:
      foreach my $rx (@rxs) { if (my @captures = $text =~ /$rx/) { # do something meaningful with the captures } }
      I'd have to have a separate part to split() apart each value negating much of the gain of the loop.
Re: Getting + and * to generate multiple captures
by prasadbabu (Prior) on Aug 17, 2006 at 16:17 UTC

    Hi jgeisler

    See, you have used '+' which matches the unknown number of words exactly. But if you want to capture the matched unknown words after text, you have to use another parantheses as shown below.

    use strict; use warnings; my $text = 'text'; my $str = 'text some words here'; if ($str =~ /$text \s ((?: (\w+) \s?)+)/x) { print "The words after text are :$1\n"; } prints: The words after text are :some words here

    Prasad

      The problem with this is that I want each word in a separate array element. I'd have to use split() on this outer-capture and it turns out that this is not ideal for my bigger problem.