Re^2: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer)

Okay, I already updated that a couple of times for corrections, and this is new data, so I'm replying to myself.

Update: As the above node, the implementations are broken, but pass the tests where they should not have. The theme of the node, that different perl builds are doing drastically different things with the same code, stands.

I downloaded, compiled, and installed 5.9.5 on my Linux box. I also have a few more tweaks I've tried. Here's some result summaries (the Linux box with 5.9.5 -- the first test listed -- is 30 seconds. The rest are still 2):

mrm_3, mrm_4, mrm_5, mrm_1, avar2_pos_inplace, and moritz are the tops in 5.9.5 on my 1Ghz, 512MB RAM Athlon with Mandriva 2006 community edition, in that order. They're only separated by 2%, and I ran this test at cmpthese(-30,...) instead of -2 for extra reliability.
Strawberry 5.8.8 has them as mrm_1, mrm_2, mrm_4, mrm_4, avar2_pos_inplace, and moritz.
AS 5.8.0 has avar2_pos_inplace, mrm_3, mrm_4, mrm_1, mrm_5, and moritz. It shows avar2_pos_inplace ahead by 5-20% the following place still.
cygperl 5.8.6 still shows avar2_pos_inplace in a dead heat with several of the mrm_ solutions. The top five change order on nearly every run. moritz's solution comes in sixth reliably.
perl 5.8.7 on the Linux box shows avar2_pos_inplace, mrm_1, mrm_4, mrm_5, mrm_3, then moritz. avar2_pos_inplace varies its lead from 4% to about 14% over mrm_1.

I should note that moritz's solution is between 50% and 75% slower than the top pure-Perl solution in all of these tests, and the rest of the ones I've tested fall below that.

I should also note that my Linux 5.8.7 does nearly twice as many iterations per second of every solution (of those faster than about 200 iterations per second anyway) than my 5.9.5 does, so I'm curious as to whether that's a development version thing or if my new perl just isn't compiled with as much optimization as the one that came with the distro. Switching to -O4 from -O2 for optimization and replacing some older x86-family lib references in the makefiles and rebuilding doesn't help much. I'm guessing the devel branch just isn't tuned at the source level as much as the stable branch, which makes sense.

Here's my code for mrm_4 and mrm_5:

sub mrm_4 {
    # from [bart]'s vec()
    my ($s1, $s2) = @_;
    use bytes;

    my $pos = 0;
    while ( -1 < ( $pos = index $$s1, '\0', $pos ) ) {
        vec( $$s1, $pos, 8 ) ||= vec( $s2, $pos, 8 );
    }
}

sub mrm_5 {
    # from moritz's, seeing if four-arg substr() is
    # faster or slower than lvalue substr()
    my ( $s1, $s2 ) = @_;
    use bytes;
    my $pos = 0;
    while ( -1 < ( $pos = index $$s1, '\0', $pos ) ) {
        substr( $$s1, $pos, 1, substr( $s2, $pos, 1 ) );
    }
}
[download]

Comment on Re^2: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer) Download Code