Only using one capturing paren will improve the performance of your regex as it will not be forced to do as much backreferencing. In playing around with this, I managed to optimize the split by breaking it into a minimal number of segments. In all cases, with my example, split significantly outperformed the regex.
#!/usr/bin/perl -w
use strict;
use Benchmark;
use vars qw($myvar $result $a $b $c $d);
$myvar = "one,two,three,four";
timethese(1000000, {
Regex => '$a=$1, $b=$2, $c=$3, $d=$4 if $myvar =~ /^[^,]+,([^,]
++),[^,]+,[^,]+$/',
Split1 => '$result = (split /,/, $myvar)[1]',
Split2 => '$result = (split /,/, $myvar, 4)[1]',
Split3 => '$result = (split /,/, $myvar, 3)[1]'
});
Benchmark: timing 1000000 iterations of Regex, Split1, Split2, Split3.
+..
Regex: 26 wallclock secs (25.75 usr + 0.00 sys = 25.75 CPU)
Split1: 16 wallclock secs (16.31 usr + 0.00 sys = 16.31 CPU)
Split2: 16 wallclock secs (16.15 usr + 0.00 sys = 16.15 CPU)
Split3: 13 wallclock secs (12.74 usr + 0.00 sys = 12.74 CPU)
Note the whopping improvement in performance of Split3. In my benchmark, it's approximately twice as fast as the regex.
Cheers,
Ovid | [reply] [d/l] |
But your comparison isn't fair. You let the regex do way
much work than needed. There's no need to parse the entire
line, and only assign if there are exactly four fields - you
aren't doing that for the split cases either. Also,
you only have one set of parens, yet you do four assignments.
Picking a simpler regex, and doing just one assignment
improves the speed with 50%!
$a=$1 if $myvar =~ /^[^,]+,([^,]+)/
Still not as fast as the split, but it shows that
proper Benchmarking is an art.
-- Abigail | [reply] [d/l] |
D'oh! I optimized the split but not the regex :( That'll teach me to be careless. For honesty's sake:
timethese(1000000, {
Regex => '$a=$1 if $myvar =~ /^[^,]+,([^,]+)/',
Split1 => '$result = (split /,/, $myvar)[1]',
Split2 => '$result = (split /,/, $myvar, 4)[1]',
Split3 => '$result = (split /,/, $myvar, 3)[1]'
});
Benchmark: timing 1000000 iterations of Regex, Split1, Split2, Split3.
+..
Regex: 14 wallclock secs (14.12 usr + 0.00 sys = 14.12 CPU)
Split1: 17 wallclock secs (16.54 usr + 0.00 sys = 16.54 CPU)
Split2: 16 wallclock secs (16.75 usr + 0.00 sys = 16.75 CPU)
Split3: 14 wallclock secs (13.02 usr + 0.00 sys = 13.02 CPU)
I'm going to cry myself to sleep tonight.
Curiously, though, it was the null assignments that appeared to be killing the efficiency ($a=$1, $b=$2, $c=$3, $d=$4) much more than the unoptimized regex. Hmmm....
Cheers,
Ovid | [reply] [d/l] |