Re: Matching a pattern two or four times

(Edit: Added demerphq's variant to the mix. ++demerphq!)

I agree with LTjake, split is the way to go. For more validation, demerphq's answer down below is good, but there's a performance hit.

Note that I tested demerphq's answers both with and without a /o; adding the o helps performance a LOT, since you don't want to recompile the regex if you don't have to.

If you want to go with regexes, another alternative would have been to use //g, though.

Here's how your regex, split, demerphq's solution, a variant on demerphq's that avoids capturing, and 2 //g approaches compare on perl 5.6.1 under Cygwin:

$ perl testSplits.pl
Benchmark: running demsChkSplt, demsChkSplt_o, demsRE, demsRE_o, hisSp
+lit, regex_g, regex_g2, yourRE, each for at least 3 CPU seconds...
demsChkSplt:  3 wallclock secs ( 3.03 usr + -0.01 sys =  3.02 CPU) @ 2
+7103.21/s (n=81933)
demsChkSplt_o:  3 wallclock secs ( 3.02 usr +  0.01 sys =  3.03 CPU) @
+ 36854.04/s (n=111852)
    demsRE:  3 wallclock secs ( 3.16 usr +  0.00 sys =  3.16 CPU) @ 29
+056.42/s (n=91673)
  demsRE_o:  4 wallclock secs ( 3.04 usr +  0.00 sys =  3.04 CPU) @ 38
+914.00/s (n=118104)
  hisSplit:  3 wallclock secs ( 3.07 usr +  0.01 sys =  3.08 CPU) @ 75
+561.43/s (n=233107)
   regex_g:  4 wallclock secs ( 3.00 usr +  0.00 sys =  3.00 CPU) @ 28
+482.20/s (n=85589)
  regex_g2:  4 wallclock secs ( 3.09 usr +  0.01 sys =  3.10 CPU) @ 56
+279.32/s (n=174691)
    yourRE:  3 wallclock secs ( 3.09 usr +  0.01 sys =  3.10 CPU) @ 53
+248.47/s (n=164804)
                 Rate demsChkSplt regex_g demsRE demsChkSplt_o demsRE_
+o yourRE regex_g2 hisSplit
demsChkSplt   27103/s          --     -5%    -7%          -26%     -30
+%   -49%     -52%     -64%
regex_g       28482/s          5%      --    -2%          -23%     -27
+%   -47%     -49%     -62%
demsRE        29056/s          7%      2%     --          -21%     -25
+%   -45%     -48%     -62%
demsChkSplt_o 36854/s         36%     29%    27%            --      -5
+%   -31%     -35%     -51%
demsRE_o      38914/s         44%     37%    34%            6%       -
+-   -27%     -31%     -49%
yourRE        53248/s         96%     87%    83%           44%      37
+%     --      -5%     -30%
regex_g2      56279/s        108%     98%    94%           53%      45
+%     6%       --     -26%
hisSplit      75561/s        179%    165%   160%          105%      94
+%    42%      34%       --
[download]

Split wins by a long shot.

Doing /(\w+|[0-9.]+)/g is horrible (that was regex_g), but doing the /(\w+)/ and /([0-9.]+)/g separately (regex_g2) gives results comparable to your regex, with more readability and easier extensibility if you wind up with more columns.

In case you want details, here's the benchmark:

#!/usr/bin/perl -w
#
#use re 'debug';

use strict;
use Benchmark qw(cmpthese);

my $i=0;

my @data=(
    "Abc     21223.7   21225.33   22270.3   22280.1",
    "Def    21600.23  24567.43"
);

sub yourRE {
    my @fields;
    foreach(@data) {
        @fields=($_=~m/^(\w+)\s+([\d\.]+)\s+([\d\.]+)(?:\s+([\d\.]+)\s
++([\d\.]+))?/);
    }
}

sub hisSplit {
    my @fields;
    foreach(@data) {
        @fields=split;
    }
}

sub regex_g {
    my @fields;
    foreach(@data) {
        @fields=($_=~m/(\w+|[0-9.]+)/g);
    }
}

sub regex_g2 {
    my $name;
    my @digits;
    foreach(@data) {
        $name=($_=~m/(\w+)/);
        @digits=($_=~m/([0-9.]+)/g);
    }
}

my $num_rex=qr/(-?(?:\d+(?:\.\d*)?|\.\d+))/; # modified from: perldoc 
+-q scalar
                                             # is a number
sub demsRE {
    my @fields;
    foreach(@data) {
        @fields=($_=~/^\s* (\w+) \s+ $num_rex \s+ $num_rex
                      (?: \s+ $num_rex \s+ $num_rex )? \s*$/x);
    }
}

# /o helps a lot
sub demsRE_o {
    my @fields;
    foreach(@data) {
        @fields=($_=~/^\s* (\w+) \s+ $num_rex \s+ $num_rex
                      (?: \s+ $num_rex \s+ $num_rex )? \s*$/xo);
    }
}

my $nc_num_rex=qr/(?:\d+(?:\.\d*)?|\.\d+)/; # modified from: perldoc -
+q
                                            # scalar is a number

# let's see if it's the captures in dems's approach that slow things d
+own?
# turns out it isn't
sub demsChkSplt {
    my @fields;
    foreach(@data) {
        if(/^\s* (\w+) \s+ $nc_num_rex \s+ $nc_num_rex
                            (?: \s+ $nc_num_rex \s+ $nc_num_rex )? \s*
+$/x) {
            @fields=split;
        }
    }
}

# /o helps a lot
sub demsChkSplt_o {
    my @fields;
    foreach(@data) {
        if(/^\s* (\w+) \s+ $nc_num_rex \s+ $nc_num_rex
                            (?: \s+ $nc_num_rex \s+ $nc_num_rex )? \s*
+$/ox) {
            @fields=split;
        }
    }
}

cmpthese(-3,
        {
            yourRE   => \&yourRE,
            hisSplit => \&hisSplit,
            regex_g => \&regex_g,
            regex_g2 => \&regex_g2,
            demsRE => \&demsRE,
            demsChkSplt => \&demsChkSplt,
            demsRE_o => \&demsRE_o,
            demsChkSplt_o => \&demsChkSplt_o,
        }
     );
[download]

--
Mike

Comment on Re: Matching a pattern two or four times Select or Download Code