Re^8: Best way to store/sum multiple-field records? ("significant")

Thank you very much, tye, for your very useful and interesting comments.

Well, if you had repeated the same check that I did above (verify that your different approaches are actually doing the same thing) ...

Sadly enough, I actually did it before running the benchmark, as shown only in part here:

$ perl -e '$_ = "USERID1|2215|Jones|";
> my( $x, $y, $z ) = split /\|/;
> print "( $x, $y, $z )\n";'
( USERID1, 2215, Jones )

$ perl -e '$_ = "USERID1|2215|Jones|";
> ( $x, $y, $z ) = split /\|/, 3;
> print "( $x, $y, $z )\n";
> '
( 3, ,  )

$ perl -e '$_ = "USERID1|2215|Jones|";
> ( $x, $y, $z ) = split /\|/, $_, 3;
> print "( $x, $y, $z )\n";
> '
( USERID1, 2215, Jones| )
[download]

but I looked at the results too quickly and failed to see the difference (i.e. "Jones" versus "Jones|"). And this difference is quite significant.

So, I decided to run again the test, not changing the code, but rather changing the data to:

my @strings = qw(
  USERID1|2215|Jones
  USERID1|1000|Jones
  USERID3|1495|Dole
  USERID2|2500|Francis
  USERID2|1500|Francis
);
[download]

just because this is more in line with the type of data that I have to deal most frequently (no separator at line end), so that is the result:

$ perl bench_inside_outside.pl
             Rate  outside outside2   inside  inside2
outside  110902/s       --      -2%     -39%     -40%
outside2 113390/s       2%       --     -38%     -39%
inside   181595/s      64%      60%       --      -2%
inside2  186121/s      68%      64%       2%       --
[download]

Now, clearly, a 2% difference is not significant, this shows that my original untested opinion that it did not really matter to put a limit to the split if the number of available fields is equal to the limit was correct, and that my subsequent opposite opinion based on a faulty test was wrong. Thank you for you enlightenment on this. Just in case someone worries, I am not concluding from that I should believe my untested opinion rather than my test results, but clearly I should be more cautious about the significance of my tests.

Without getting into the details of your very interesting post, I would say that, sometimes, I really need to know whether one way of doing things if significantly faster than another (say, for example, s/// versus tr///, or m// versus index(), etc.). But in the end,, only real tests with real data really make sense. The benchmark module is quite useful to prune early the tree of possible courses of action. In the end, only test with real data really matters.

I am dealing with a 35M customer base, with about a billion billing services, and dozens of billions of usages (phone calls, SMS, Internet Connections, Video down loadings, etc.) per month. Performance matters for me.

Benchmarks provided by the benchmark module give quite interesting information about the best way to do things, but the really interesting data comes from actual testing.

Comment on Re^8: Best way to store/sum multiple-field records? ("significant") Select or Download Code