perl-diddler has asked for the wisdom of the Perl Monks concerning the following question:

I want to read in a line with a text field, followed by 4 floats separated by unspecified non-numerics.

I used an RE of the form:
(text);(?:(float)[non-num]){4}

This matches as expected, but I'm only getting 2 substring matches instead of getting back the 5 desired substrings. $1 is the text field, but $2 is filled with the final float substring with strings 1-3 being tossed. This isn't desirable.

I could duplicate the sub-RE that has the {4} count tag, 4 times, but that seems wasteful and less clear. Is there a way to preserve my idea matching "4"-sub-RE's while also returning the 1st 3 matches?

Seems like such a simple concept...sigh. Is this doable without nibbling at the line in a loop that picks off the trailing numerics with successive search & replace operations? TIA -l

2006-02-18 Retitled by planetscape, as per Monastery guidelines
Original title: 'simple question ?'

  • Comment on capturing multiple repeated regex subparts

Replies are listed 'Best First'.
Re: capturing multiple repeated regex subparts
by ikegami (Patriarch) on Feb 17, 2006 at 21:01 UTC

    Regexps are useful for validation, extraction and tokenizing. However, they are not as strong at parsing, as you have discovered. Parsing is nontheless possible, using advanced features.

    use v5.8.0; # or higher # For $^N our @rv; our @temp_rv; / (text); (?{ local @temp_rv = ( @temp_rv, $^N ) }) (?: (float) (?{ local @temp_rv = ( @temp_rv, $^N ) }) (?:non-num) ){4} (?{ @rv = @temp_rv }) /x;

    Tested.

    • local is needed in case of backtracking.
    • @rv = @temp_rv is necessary because @temp_rv will wind back to its original value before the regexp exits.
    • Package (our) variables (rather than lexical (my) variables) are needed because the code blocks in the regexp act as closures.
      Don't eschew $^R! That's what it's there for:
      # UPDATED: added comments about what's going on our @rv; # you like taking trips? ;) m{ (text); # the (?{ ... }) block's return value # is given to $^R, the magical variable # whose value is auto-localized and gets # rolled back when backtracking occurs. # $^R's initial value, then, is an array # ref with one element, $1's value. (?{ [$1] }) (?: (float) # then, four times, we add the float we # match in $2 to the end of $^R. we # can't just do push(@{$^R}, $2), because # that would break the auto-rollback magic, # so instead, we just let the return value # set $^R again. (?{ [ @{$^R}, $2 ] }) non-num ){4} # finally, we store @{$^R} in @rv. (?{ @rv = @{$^R} }) }x;

      Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
      How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
Re: capturing multiple repeated regex subparts
by Roy Johnson (Monsignor) on Feb 17, 2006 at 20:57 UTC
    Every group to be captured has to have its own explicit set of parentheses in the regex. You can't populate $2, $3, $4, and $5 by putting a quantifier after the 2nd group.

    So you'll probably want to do this in two steps. I'll pretend you have a $float regex and a $non_num regex:

    ($text, $nums) = /(text);((?:$float$non_num){4})/; @nums = split $non_num, $nums;

    Caution: Contents may have been coded under pressure.
Re: capturing multiple repeated regex subparts
by GrandFather (Saint) on Feb 17, 2006 at 20:55 UTC

    You have only two sets of capture brackets so you only get two captures. Your suggestion of duplicating the counted match is probably the best answer in this case - that's what copy and paste is for in your editor :)


    DWIM is Perl's answer to Gödel
Re: capturing multiple repeated regex subparts
by kwaping (Priest) on Feb 18, 2006 at 00:01 UTC
    How's this?
    #!/usr/bin/perl use strict; use warnings; use Data::Dumper::Simple; # (text);(?:(float)[non-num]){4}; my $text = "asdf;12.34x23.45y34.56z45.67n"; my @matches = ($text =~ /([\d.]+\D)/g); print Dumper(@matches);
Re: capturing multiple repeated regex subparts
by neilwatson (Priest) on Feb 17, 2006 at 20:49 UTC
    Some real code and real sample data would be helpful.

    Neil Watson
    watson-wilson.ca

Re: capturing multiple repeated regex subparts
by Aristotle (Chancellor) on Feb 20, 2006 at 03:44 UTC

    In addition to the other solutions posted, you can break up the regex using /g and \G and do the looping yourself.

    my @submatch; { $str =~ /(text);/g or last; push @submatch, $1; for( 1 .. 4 ) { $str =~ /\G(?:(float)[non-num])/g or do { @submatch = (); last; } push @submatch, $1; } }

    Makeshifts last the longest.