Matching a pattern two or four times

Bilbo has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Matching a pattern two or four times by LTjake (Prior) on Sep 24, 2002 at 11:54 UTC
Why not just split it up on white space? `my @strings = ( 'Abc 21223.7 21225.33 22270.3 22280.1', 'Def 21600.23 24567.43' ); foreach (@strings) { my @arr = split(/\s+/, $_); print join(' - ', @arr), "\n"; }` [download] gives me: `Abc - 21223.7 - 21225.33 - 22270.3 - 22280.1 Def - 21600.23 - 24567.43` [download] All of your numeric values would be in `$arr[1]` to `$arr[$#arr]`	[reply] [d/l] [select]
Re: Matching a pattern two or four times by RMGir (Prior) on Sep 24, 2002 at 12:22 UTC
(Edit: Added demerphq's variant to the mix. ++demerphq!) I agree with LTjake, split is the way to go. For more validation, demerphq's answer down below is good, but there's a performance hit. Note that I tested demerphq's answers both with and without a /o; adding the o helps performance a LOT, since you don't want to recompile the regex if you don't have to. If you want to go with regexes, another alternative would have been to use //g, though. Here's how your regex, split, demerphq's solution, a variant on demerphq's that avoids capturing, and 2 //g approaches compare on perl 5.6.1 under Cygwin: $ perl testSplits.pl Benchmark: running demsChkSplt, demsChkSplt_o, demsRE, demsRE_o, hisSp +lit, regex_g, regex_g2, yourRE, each for at least 3 CPU seconds... demsChkSplt: 3 wallclock secs ( 3.03 usr + -0.01 sys = 3.02 CPU) @ 2 +7103.21/s (n=81933) demsChkSplt_o: 3 wallclock secs ( 3.02 usr + 0.01 sys = 3.03 CPU) @ + 36854.04/s (n=111852) demsRE: 3 wallclock secs ( 3.16 usr + 0.00 sys = 3.16 CPU) @ 29 +056.42/s (n=91673) demsRE_o: 4 wallclock secs ( 3.04 usr + 0.00 sys = 3.04 CPU) @ 38 +914.00/s (n=118104) hisSplit: 3 wallclock secs ( 3.07 usr + 0.01 sys = 3.08 CPU) @ 75 +561.43/s (n=233107) regex_g: 4 wallclock secs ( 3.00 usr + 0.00 sys = 3.00 CPU) @ 28 +482.20/s (n=85589) regex_g2: 4 wallclock secs ( 3.09 usr + 0.01 sys = 3.10 CPU) @ 56 +279.32/s (n=174691) yourRE: 3 wallclock secs ( 3.09 usr + 0.01 sys = 3.10 CPU) @ 53 +248.47/s (n=164804) Rate demsChkSplt regex_g demsRE demsChkSplt_o demsRE_ +o yourRE regex_g2 hisSplit demsChkSplt 27103/s -- -5% -7% -26% -30 +% -49% -52% -64% regex_g 28482/s 5% -- -2% -23% -27 +% -47% -49% -62% demsRE 29056/s 7% 2% -- -21% -25 +% -45% -48% -62% demsChkSplt_o 36854/s 36% 29% 27% -- -5 +% -31% -35% -51% demsRE_o 38914/s 44% 37% 34% 6% - +- -27% -31% -49% yourRE 53248/s 96% 87% 83% 44% 37 +% -- -5% -30% regex_g2 56279/s 108% 98% 94% 53% 45 +% 6% -- -26% hisSplit 75561/s 179% 165% 160% 105% 94 +% 42% 34% -- [download] Split wins by a long shot. Doing `/(\w+\|[0-9.]+)/g` is horrible (that was regex_g), but doing the /(\w+)/ and `/([0-9.]+)/g` separately (regex_g2) gives results comparable to your regex, with more readability and easier extensibility if you wind up with more columns. In case you want details, here's the benchmark: Read more... (3 kB) -- Mike	[reply] [d/l] [select]
Re: Re: Matching a pattern two or four times by demerphq (Chancellor) on Sep 24, 2002 at 17:03 UTC
Nice analysis. :-) A few nits however, for the regex versions you should state which handles what. For instance only demsRE and demsRE_o will handle negative numbers. Also the timing results would be more interesting if more cases were handled. Including fail cases. Ie, what happens if there are 3 numbers, or 5? Yada yada... But they are just nits. nice work. --- demerphq my friends call me, usually because I'm late....	[reply]
Re: Matching a pattern two or four times by demerphq (Chancellor) on Sep 24, 2002 at 15:30 UTC
Personally I wonder at what you are doing. If your file contains only records in the above format then use split as other monks have suggested. However if you are trying to extract lines that match from a bunch of other crud then the regex will have to be the way to go. Also your regex for floating point numbers leaves quite a bit to be desired. For instance it will match ip addresses as well as floating point values (not to mention things like "................."). There are a variety of regexes that will handle numbers like this correctly to be found in the FAQS. `my $num_rex=qr/(-?(?:\d+(?:\.\d)?\|\.\d+))/; # modified from: perldoc +-q scalar is a number while (<DATA>) { if (/^\s (\w+) \s+ $num_rex \s+ $num_rex (?: \s+ $num_rex \s+ $num_ +rex )? \s*$/x) { print "Matched a word and ",(defined $4?"four":"two")," numbers: +$1 $2 $3",(defined $4?" $4 $5\n":"\n"); } } __DATA__ Abc 21223.7 21225.33 22270.3 22280.1 Def 21600.23 24567.43` [download] Oh, I changed it to be tolerant of leading and trailing whitespace. YMMV. HTH --- demerphq my friends call me, usually because I'm late....	[reply] [d/l]
Re: Matching a pattern two or four times by Bilbo (Pilgrim) on Sep 24, 2002 at 17:18 UTC
OK. Split seems to be the right answer - I'm not sure where my obsession with using a regex came from this morning. I think I was hoping to use it to do some level of validation of the input (just print out a warning and skip any lines which didn't match) but splitting then validating the results is more readable, if significantly longer. In this case regexes probably weren't the best way to do it, but I still don't understand how to retrieve matched groups from a repeated group. For example: `my @lines = ("a 1 2", "b 3 4 5 6", "c 7 8 9"); foreach (@lines) { my @list = m/^[a-z](\s+\d+)+/g; print @list, "\n"; }` [download] Does not print 1 2 3 4 5 6 7 8 9 as I might have expected, but 2 6 9 What am I missing?	[reply] [d/l]
Re: Re: Matching a pattern two or four times by RMGir (Prior) on Sep 24, 2002 at 17:43 UTC
What am I missing? 2 things. First, you can't have multiple matches (//g) AND have your regex anchored at the start of the string. Second, the //g will return all the captured matches, so you don't need that last +. In fact, you can't HAVE that last +, or it doesn't work. `my @lines = ("a 1 2", "b 3 4 5 6", "c 7 8 9"); foreach (@lines) { # you can't match multiple times starting at ^! my @list = m/(\s+\d+)/g; # no last + print @list, "\n"; }` [download] Of course, demerphq made a very good point about that regex not being sufficient to match all numbers... -- Mike	[reply] [d/l]