Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

Regex - backreferencing multiple matches in one group.

by why_bird (Pilgrim)
on Mar 04, 2008 at 11:01 UTC ( #671843=perlquestion: print w/replies, xml ) Need Help??

why_bird has asked for the wisdom of the Perl Monks concerning the following question:

Me again---having some trouble with regexes (again..!)

This is about grouping, and (back?)referencing. I'm reading in a configuration file, and for each line I end up with a string like this:

$table_ref="!Frequency[A][B][C] ..some other text.."

I want to get the name of the table (in this case 'Frequency'), and then each of the table variables (here represented by A, B, C) the variables will always be inside square brackets, and there are between 1 and 4 inclusive of them (but I won't know how many).

So, I tried a regex like this:

$temp_=~ /\!(\S+?) #tablename (?:\[(.*?)\]){1,4} #table variables \s+.*/x; #white space then rest of text. print Dumper ($1, $2, $3, $4, $5);

For the string above, the regex returned:

$VAR1 = 'Frequency';
$VAR2 = 'C';
$VAR3 = undef;
$VAR4 = undef;
$VAR5 = undef;

When I replaced the {1,4} with a {3} (I thought there might have been a problem with greediness) I got exactly the same output as above..

I sort of expected (or rather hoped!) that the {1,4} would create multiple matches which would go into $2, $3, $4. Should it and have I done something wrong, or is that not how it works? I read the perdoc info (and some other stuff), but couldn't gather from that how it was supposed to work in this case. Any suggestions, in either case?

Thankyou again monks, you are giving me an enormous amount of help. It makes learning Perl (even more) fun when there are knowledgable people to point out your stupid mistakes/misunderstandings :)

Those are my principles. If you don't like them I have others.
-- Groucho Marx

Replies are listed 'Best First'.
Re: Regex - backreferencing multiple matches in one group.
by moritz (Cardinal) on Mar 04, 2008 at 11:18 UTC
    Two ways:

    1) (recommended) parse all indexes in one group, and then postprocess it (for example with split)


    (capturing parenthesis on the outside)

    2) "fancy": Use the experimental (?{ ... }) code assertions (but be sure to read the warnings in perlre first):

    my @indexes; my $re = qr{ ... # everything before (?: \[ # opening [ ([^\]]+) # everthing except ] \] # closing ] (?{ push @indexes, $^R}) # store the match )+ # as many times as you want }x

    Update: here's why your solution doesn't work:

    The variables $1, $2, ... are set up at the time the pattern is compiled (ie before the regex engine sees the string it will match on).

    It counts the opening parenthesis from left to right, binding the first one to $1, the second to $2 etc.

    So you get this mapping:

    re: (..) (..)+ vars: $1 $2

    Now each time the second group matches, it writes the captured string into $2, which means you'll get the last match of that group in $2.

      So, this is either non-responsive (since I'm not using regexes) or one of those "hey, I never thought of that" situations. A cheater approach to this problem is to use 'split':
      use strict; use warnings; my $input = "!Frequency[A][B][C] ..some other text.."; $input =~ s/^!//; # dump that leading '!' print "'$input'\n"; my @pieces = split(/[\]\[]+/, $input); foreach my $piece (@pieces) { print "PIECE '$piece'\n"; }
      I see, thankyou :) Just deciding whether to play around with the 'fancy' option---I probably will, then decide to use the simple one anyway!
      Those are my principles. If you don't like them I have others.
      -- Groucho Marx
Re: Regex - backreferencing multiple matches in one group.
by clinton (Priest) on Mar 04, 2008 at 11:36 UTC
    Your captures will always be numbered according to the leftmost opening parenthesis, for instance:
    'abcdef' =~ / ( abc ( def) ) /x; $1 -> 'abcdef' $2 -> 'def'

    So you have two options here. You can either specify each matching group, or match in a loop. For instance:

    use Data::Dumper; my $table_ref="!Frequency[A][B][C] ..some other text.."; ### Groups specified explicitly my $var_rx = qr / \[ ( [^\]]+ ) \] /x; my $file_rx = qr / ! ( [^\[]+ ) /x; my $regex = qr/^ $file_rx $var_rx $var_rx? $var_rx? $var_rx? \s ++.*/x; my @matches = ( $table_ref =~ /$regex/); print Dumper (\@matches); ## Loop my ($filename) = ($table_ref =~ /^ $file_rx /gcx ); my @vars; while ( $table_ref =~ /\G $var_rx/gcx ) { push @vars,$1; } print Dumper($filename,\@vars);

    See perlre for an explanation of \G and /c


Re: Regex - backreferencing multiple matches in one group.
by ysth (Canon) on Mar 04, 2008 at 11:28 UTC
    Each () can only capture one thing; if it's repeated, the $<digit> variable gets whatever was captured last.

    Since there are only at most 4, try:

    $temp_=~ /\!(\S+?) #tablename \[(.*?)\] #table variables (?:\[(.*?)\] (?:\[(.*?)\] (?:\[(.*?)\])?)?)? \s+.*/x; #white space then rest of text. print Dumper ($1, $2, $3, $4, $5);
      A couple minor suggestions:
      Nesting is not needed.
      The "\s+.*" at the end does nothing.
      /\!(\S+?) #tablename \[(.*?)\] #table variables (?:\[(.*?)\])? (?:\[(.*?)\])? (?:\[(.*?)\])? /x
        Nesting is not needed, but more clearly indicates the up-to-4.

        The +.* I grant you, but the \s does do something.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://671843]
Approved by almut
Front-paged by clinton
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2023-05-31 16:14 GMT
Find Nodes?
    Voting Booth?

    No recent polls found