Bilbo has asked for the wisdom of the Perl Monks concerning the following question:

I seem to be having trouble constructing a regex that I'm sure ought to be straightforward.

I want to read lines from a file which contain one or more alphanumeric characters, followed by either 2 or 4 positive real numbers, separated by white space. For example:

Abc 	21223.7   21225.33   22270.3   22280.1
Def	21600.23  24567.43

The only regexp I've managed to come up with which does what I want is:

m/^(\w+)\s+([\d\.]+)\s+([\d\.]+)(?:\s+([\d\.]+)\s+([\d\.]+))?/

but I feel sure that there must be something shorter and more elegant.

I tried using:

m/^(\w+)(?:\s+([\d\.]+)\s+([\d\.]+)){1,2}/
Which works when there are two numbers, but puts the last two numbers into $2 and $3, throwing away the first two numbers from lines with four numbers (higher number variables are all empty).

If I remove the ?: the same thing happens but in $3 and $4.

Replies are listed 'Best First'.
Re: Matching a pattern two or four times
by LTjake (Prior) on Sep 24, 2002 at 11:54 UTC
    Why not just split it up on white space?
    my @strings = ( 'Abc 21223.7 21225.33 22270.3 22280.1', 'Def 21600.23 24567.43' ); foreach (@strings) { my @arr = split(/\s+/, $_); print join(' - ', @arr), "\n"; }
    gives me:
    Abc - 21223.7 - 21225.33 - 22270.3 - 22280.1 Def - 21600.23 - 24567.43
    All of your numeric values would be in $arr[1] to $arr[$#arr]
Re: Matching a pattern two or four times
by RMGir (Prior) on Sep 24, 2002 at 12:22 UTC
    (Edit: Added demerphq's variant to the mix. ++demerphq!)

    I agree with LTjake, split is the way to go. For more validation, demerphq's answer down below is good, but there's a performance hit.

    Note that I tested demerphq's answers both with and without a /o; adding the o helps performance a LOT, since you don't want to recompile the regex if you don't have to.

    If you want to go with regexes, another alternative would have been to use //g, though.

    Here's how your regex, split, demerphq's solution, a variant on demerphq's that avoids capturing, and 2 //g approaches compare on perl 5.6.1 under Cygwin:

    $ perl testSplits.pl Benchmark: running demsChkSplt, demsChkSplt_o, demsRE, demsRE_o, hisSp +lit, regex_g, regex_g2, yourRE, each for at least 3 CPU seconds... demsChkSplt: 3 wallclock secs ( 3.03 usr + -0.01 sys = 3.02 CPU) @ 2 +7103.21/s (n=81933) demsChkSplt_o: 3 wallclock secs ( 3.02 usr + 0.01 sys = 3.03 CPU) @ + 36854.04/s (n=111852) demsRE: 3 wallclock secs ( 3.16 usr + 0.00 sys = 3.16 CPU) @ 29 +056.42/s (n=91673) demsRE_o: 4 wallclock secs ( 3.04 usr + 0.00 sys = 3.04 CPU) @ 38 +914.00/s (n=118104) hisSplit: 3 wallclock secs ( 3.07 usr + 0.01 sys = 3.08 CPU) @ 75 +561.43/s (n=233107) regex_g: 4 wallclock secs ( 3.00 usr + 0.00 sys = 3.00 CPU) @ 28 +482.20/s (n=85589) regex_g2: 4 wallclock secs ( 3.09 usr + 0.01 sys = 3.10 CPU) @ 56 +279.32/s (n=174691) yourRE: 3 wallclock secs ( 3.09 usr + 0.01 sys = 3.10 CPU) @ 53 +248.47/s (n=164804) Rate demsChkSplt regex_g demsRE demsChkSplt_o demsRE_ +o yourRE regex_g2 hisSplit demsChkSplt 27103/s -- -5% -7% -26% -30 +% -49% -52% -64% regex_g 28482/s 5% -- -2% -23% -27 +% -47% -49% -62% demsRE 29056/s 7% 2% -- -21% -25 +% -45% -48% -62% demsChkSplt_o 36854/s 36% 29% 27% -- -5 +% -31% -35% -51% demsRE_o 38914/s 44% 37% 34% 6% - +- -27% -31% -49% yourRE 53248/s 96% 87% 83% 44% 37 +% -- -5% -30% regex_g2 56279/s 108% 98% 94% 53% 45 +% 6% -- -26% hisSplit 75561/s 179% 165% 160% 105% 94 +% 42% 34% --
    Split wins by a long shot.

    Doing /(\w+|[0-9.]+)/g is horrible (that was regex_g), but doing the /(\w+)/ and /([0-9.]+)/g separately (regex_g2) gives results comparable to your regex, with more readability and easier extensibility if you wind up with more columns.

    In case you want details, here's the benchmark:


    --
    Mike
      Nice analysis. :-)

      A few nits however, for the regex versions you should state which handles what. For instance only demsRE and demsRE_o will handle negative numbers. Also the timing results would be more interesting if more cases were handled. Including fail cases. Ie, what happens if there are 3 numbers, or 5? Yada yada...

      But they are just nits. nice work.

      --- demerphq
      my friends call me, usually because I'm late....

Re: Matching a pattern two or four times
by demerphq (Chancellor) on Sep 24, 2002 at 15:30 UTC
    Personally I wonder at what you are doing. If your file contains only records in the above format then use split as other monks have suggested. However if you are trying to extract lines that match from a bunch of other crud then the regex will have to be the way to go.

    Also your regex for floating point numbers leaves quite a bit to be desired. For instance it will match ip addresses as well as floating point values (not to mention things like "................."). There are a variety of regexes that will handle numbers like this correctly to be found in the FAQS.

    my $num_rex=qr/(-?(?:\d+(?:\.\d*)?|\.\d+))/; # modified from: perldoc +-q scalar is a number while (<DATA>) { if (/^\s* (\w+) \s+ $num_rex \s+ $num_rex (?: \s+ $num_rex \s+ $num_ +rex )? \s*$/x) { print "Matched a word and ",(defined $4?"four":"two")," numbers: +$1 $2 $3",(defined $4?" $4 $5\n":"\n"); } } __DATA__ Abc 21223.7 21225.33 22270.3 22280.1 Def 21600.23 24567.43
    Oh, I changed it to be tolerant of leading and trailing whitespace. YMMV.

    HTH

    --- demerphq
    my friends call me, usually because I'm late....

Re: Matching a pattern two or four times
by Bilbo (Pilgrim) on Sep 24, 2002 at 17:18 UTC

    OK. Split seems to be the right answer - I'm not sure where my obsession with using a regex came from this morning. I think I was hoping to use it to do some level of validation of the input (just print out a warning and skip any lines which didn't match) but splitting then validating the results is more readable, if significantly longer.

    In this case regexes probably weren't the best way to do it, but I still don't understand how to retrieve matched groups from a repeated group. For example:

    my @lines = ("a 1 2", "b 3 4 5 6", "c 7 8 9"); foreach (@lines) { my @list = m/^[a-z](\s+\d+)+/g; print @list, "\n"; }
    Does not print
    1 2
    3 4 5 6
    7 8 9
    
    as I might have expected, but
    2
    6
    9
    
    What am I missing?
      What am I missing?

      2 things. First, you can't have multiple matches (//g) AND have your regex anchored at the start of the string.

      Second, the //g will return all the captured matches, so you don't need that last +. In fact, you can't HAVE that last +, or it doesn't work.

      my @lines = ("a 1 2", "b 3 4 5 6", "c 7 8 9"); foreach (@lines) { # you can't match multiple times starting at ^! my @list = m/(\s+\d+)/g; # no last + print @list, "\n"; }
      Of course, demerphq made a very good point about that regex not being sufficient to match all numbers...
      --
      Mike