in reply to Matching a pattern two or four times

(Edit: Added demerphq's variant to the mix. ++demerphq!)

I agree with LTjake, split is the way to go. For more validation, demerphq's answer down below is good, but there's a performance hit.

Note that I tested demerphq's answers both with and without a /o; adding the o helps performance a LOT, since you don't want to recompile the regex if you don't have to.

If you want to go with regexes, another alternative would have been to use //g, though.

Here's how your regex, split, demerphq's solution, a variant on demerphq's that avoids capturing, and 2 //g approaches compare on perl 5.6.1 under Cygwin:

$ perl testSplits.pl Benchmark: running demsChkSplt, demsChkSplt_o, demsRE, demsRE_o, hisSp +lit, regex_g, regex_g2, yourRE, each for at least 3 CPU seconds... demsChkSplt: 3 wallclock secs ( 3.03 usr + -0.01 sys = 3.02 CPU) @ 2 +7103.21/s (n=81933) demsChkSplt_o: 3 wallclock secs ( 3.02 usr + 0.01 sys = 3.03 CPU) @ + 36854.04/s (n=111852) demsRE: 3 wallclock secs ( 3.16 usr + 0.00 sys = 3.16 CPU) @ 29 +056.42/s (n=91673) demsRE_o: 4 wallclock secs ( 3.04 usr + 0.00 sys = 3.04 CPU) @ 38 +914.00/s (n=118104) hisSplit: 3 wallclock secs ( 3.07 usr + 0.01 sys = 3.08 CPU) @ 75 +561.43/s (n=233107) regex_g: 4 wallclock secs ( 3.00 usr + 0.00 sys = 3.00 CPU) @ 28 +482.20/s (n=85589) regex_g2: 4 wallclock secs ( 3.09 usr + 0.01 sys = 3.10 CPU) @ 56 +279.32/s (n=174691) yourRE: 3 wallclock secs ( 3.09 usr + 0.01 sys = 3.10 CPU) @ 53 +248.47/s (n=164804) Rate demsChkSplt regex_g demsRE demsChkSplt_o demsRE_ +o yourRE regex_g2 hisSplit demsChkSplt 27103/s -- -5% -7% -26% -30 +% -49% -52% -64% regex_g 28482/s 5% -- -2% -23% -27 +% -47% -49% -62% demsRE 29056/s 7% 2% -- -21% -25 +% -45% -48% -62% demsChkSplt_o 36854/s 36% 29% 27% -- -5 +% -31% -35% -51% demsRE_o 38914/s 44% 37% 34% 6% - +- -27% -31% -49% yourRE 53248/s 96% 87% 83% 44% 37 +% -- -5% -30% regex_g2 56279/s 108% 98% 94% 53% 45 +% 6% -- -26% hisSplit 75561/s 179% 165% 160% 105% 94 +% 42% 34% --
Split wins by a long shot.

Doing /(\w+|[0-9.]+)/g is horrible (that was regex_g), but doing the /(\w+)/ and /([0-9.]+)/g separately (regex_g2) gives results comparable to your regex, with more readability and easier extensibility if you wind up with more columns.

In case you want details, here's the benchmark:

#!/usr/bin/perl -w # #use re 'debug'; use strict; use Benchmark qw(cmpthese); my $i=0; my @data=( "Abc 21223.7 21225.33 22270.3 22280.1", "Def 21600.23 24567.43" ); sub yourRE { my @fields; foreach(@data) { @fields=($_=~m/^(\w+)\s+([\d\.]+)\s+([\d\.]+)(?:\s+([\d\.]+)\s ++([\d\.]+))?/); } } sub hisSplit { my @fields; foreach(@data) { @fields=split; } } sub regex_g { my @fields; foreach(@data) { @fields=($_=~m/(\w+|[0-9.]+)/g); } } sub regex_g2 { my $name; my @digits; foreach(@data) { $name=($_=~m/(\w+)/); @digits=($_=~m/([0-9.]+)/g); } } my $num_rex=qr/(-?(?:\d+(?:\.\d*)?|\.\d+))/; # modified from: perldoc +-q scalar # is a number sub demsRE { my @fields; foreach(@data) { @fields=($_=~/^\s* (\w+) \s+ $num_rex \s+ $num_rex (?: \s+ $num_rex \s+ $num_rex )? \s*$/x); } } # /o helps a lot sub demsRE_o { my @fields; foreach(@data) { @fields=($_=~/^\s* (\w+) \s+ $num_rex \s+ $num_rex (?: \s+ $num_rex \s+ $num_rex )? \s*$/xo); } } my $nc_num_rex=qr/(?:\d+(?:\.\d*)?|\.\d+)/; # modified from: perldoc - +q # scalar is a number # let's see if it's the captures in dems's approach that slow things d +own? # turns out it isn't sub demsChkSplt { my @fields; foreach(@data) { if(/^\s* (\w+) \s+ $nc_num_rex \s+ $nc_num_rex (?: \s+ $nc_num_rex \s+ $nc_num_rex )? \s* +$/x) { @fields=split; } } } # /o helps a lot sub demsChkSplt_o { my @fields; foreach(@data) { if(/^\s* (\w+) \s+ $nc_num_rex \s+ $nc_num_rex (?: \s+ $nc_num_rex \s+ $nc_num_rex )? \s* +$/ox) { @fields=split; } } } cmpthese(-3, { yourRE => \&yourRE, hisSplit => \&hisSplit, regex_g => \&regex_g, regex_g2 => \&regex_g2, demsRE => \&demsRE, demsChkSplt => \&demsChkSplt, demsRE_o => \&demsRE_o, demsChkSplt_o => \&demsChkSplt_o, } );

--
Mike

Replies are listed 'Best First'.
Re: Re: Matching a pattern two or four times
by demerphq (Chancellor) on Sep 24, 2002 at 17:03 UTC
    Nice analysis. :-)

    A few nits however, for the regex versions you should state which handles what. For instance only demsRE and demsRE_o will handle negative numbers. Also the timing results would be more interesting if more cases were handled. Including fail cases. Ie, what happens if there are 3 numbers, or 5? Yada yada...

    But they are just nits. nice work.

    --- demerphq
    my friends call me, usually because I'm late....