in reply to Re^2: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer)
in thread Challenge: CPU-optimized byte-wise or-equals (for a meter of beer)

This is very interesting. Your version rates at the same speed as moritz's and avar's once the bug is fixed - it's 0<, not -1<.

One thing that is odd is that the specific problem I'm working has all of the chr(0)'s in groups of 3. So, I figured that I could use that and change the primary line to:

vec( $$s1, $pos, 24 ) ||= vec( $$s2, $pos, 24 ) while 0 < ( $pos = index $$s1, chr(0)x3, $pos );
Except, that slows it down by 20%. If I change it so that the 24's stay, but it goes back to being chr(0) without the x3, it's back to being the same speed. I wonder why that is. I also wonder why the knowledge of being able to work 3 bytes at a time doesn't speed things up at all.

As for why this wasn't in the problem statement - I wanted to solve the general problem and was willing to pay a meter of beer to see the various solutions. That there's an additional constraint in what I'm actually using the solutions for doesn't change what I was willing to pay a meter of beer for. :-)


My criteria for good software:
  1. Does it work?
  2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
  • Comment on Re^3: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer)
  • Download Code

Replies are listed 'Best First'.
Re^4: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer)
by Corion (Patriarch) on Sep 13, 2007 at 07:16 UTC

    If you're working with 3-byte groups, that suspiciously sounds to me like you're treating RGB-data, and even if you're not, there is a crazy yet highly efficient (I believe) way of manipulating such data if you have the hardware (and software) to do it:

    Using OpenGL and a chroma-key filter, you can "draw" the two strings over each other in parallel and then retrieve the resulting "image" from the massively parallel hardware again. It's not always certain that the parallelism you gain by offloading the work to the GPU outweighs the cost of transferring the data over the bus and the result back again, especially when benchmarked against the C versions you already have. See http://www.gpgpu.org/ for more information on the concept.

      Yes, I am treating RGB data. Specifically, I'm drawing the background PNGs for KayudaMaps. Unfortunately, since this is run on a server, I don't have a GPU (though that can be remedied if it's determined that it's worth it).

      The capability that this meter-of-beer is meant to give me is to overlay pictures with transparent bits on top of other pictures without having to use GL, ImageMagick, or Imager. It's not a rewrite of how any of those should do it - I'm just solving my specific problem in a very specialized manner. For example, since the images I'm working with don't have to be perfect, I've constrained my RGB values to 1-255, reserving 0 as transparent. The degradation is unnoticeable and I can handle transparent overlays without any speed penalties. This is also due to the fact that, for me, it's either fully transparent or fully opaque - another piece that specific to my problem that isn't generalizable to PNGs.


      My criteria for good software:
      1. Does it work?
      2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?

        You can, of course, push the problem to the client by using CSS and transparency in the browser. But it's far nicer to do the processing server-side. My experience with OpenGL and GLSL has been very positive so far and I found the GLSL to be relatively readable and quite concise in formulating the process that I wanted to happen, as soon as I had a mental model of how the calculation I wanted to be done could be modelled as a single pass over an array.

        Maybe in the next 2 to 5 years, GPUs will become stock hardware, even without the graphics output, because they are incredibly suited to problems that can be formulated in an easily parallelizable way that reads from one or more memory sections and only write to one memory destination. But until then, you will be easier off by using the Inline::C version, especially in such one-off situations where it's likely that the bandwidth needed for uploading/downloading the image outweighs the performance gain of the (relatively simple) operation.

      That sounds like what I've been doing to solve a similar problem when I wanted to compute influences (in a 2D environment, with influences radiating out in a given radius from a given pixel, and finding the strongest for any given point). I've achieved a 500x speedup using gpgpu solutions compared to solving it in a purely CPU based way.

      Of course, it took longer to write and it needs capable hardware (vertex and fragment shaders are used), but it's fast.

Re^4: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer)
by mr_mischief (Monsignor) on Sep 13, 2007 at 15:33 UTC
    I think demerphq mentioned it already, but index() returns 0 if the 0-th character matches. It returns -1 if nothing in the string matches ( unless someone has played with $[ ).

    I also had a couple of other thoughts. Surely the Inline::C solutions are faster, but I'm still trying to wring performance out of pure Perl. 1. Why not use $s2 through the reference as well? and 2. What about use bytes (); and then explicitly using bytes::substr and bytes::index instead of doing the import?

    This code tests both of those together, for a whopping improvement over my solutions using the import and just using $s1 by reference.

    sub mrm_6 { # from mrn_5, testing bytes::misc explicitly instead of importing # also in-place using of $s2 my ( $s1, $s2 ) = @_; use bytes (); my $pos = 0; while ( -1 < ( $pos = bytes::index( $$s1, '\0', $pos ) ) ) { bytes::substr( $$s1, $pos, 1, bytes::substr( $$s2, $pos, 1 ) ) +; } }

    The results are impressive for such simple changes:

    Strawberry Perl 5.8.8 on WinXP, AthlonXP 2400+, 1Gig

    Rate ikegami_s ikegami_s 37.0/s -- avar 188/s 409% avar2 192/s 418% avar2_pos 282/s 663% ikegami_tr 342/s 825% moritz 783/s 2018% avar2_pos_inplace 1640/s 4338% mrm_3 2436/s 6491% mrm_4 2449/s 6524% mrm_5 2459/s 6553% mrm_1 2517/s 6709% mrm_6 3372/s 9023%
    cygperl 5.8.6 on the same machine as above
    Rate ikegami_s ikegami_s 37.9/s -- avar 285/s 650% avar2 310/s 718% ikegami_tr 316/s 733% avar2_pos 709/s 1769% moritz 1002/s 2543% mrm_5 3631/s 9474% mrm_3 3841/s 10027% mrm_4 3913/s 10219% mrm_1 4044/s 10562% avar2_pos_inplace 4393/s 11484% mrm_6 6237/s 16345%
    perl 5.8.7 on Mandriva Linux 2006 Athlon 1000, 512MB RAM
    Rate ikegami_s ikegami_s 17.2/s -- avar 168/s 876% avar2 187/s 983% ikegami_tr 205/s 1091% avar2_pos 307/s 1684% moritz 620/s 3499% mrm_4 1088/s 6218% avar2_pos_inplace 1184/s 6775% mrm_3 1184/s 6775% mrm_5 1224/s 7006% mrm_1 1240/s 7101% mrm_6 1921/s 11052%
    ... and perhaps a sign of good things to come: perl 5.9.5 on the above-mentioned Linux box
    Rate ikegami_s ikegami_s 13.3/s -- avar 172/s 1189% ikegami_tr 176/s 1220% avar2 182/s 1267% avar2_pos 258/s 1837% moritz 412/s 2987% avar2_pos_inplace 776/s 5722% mrm_3 780/s 5749% mrm_1 785/s 5788% mrm_5 798/s 5882% mrm_4 806/s 5942% mrm_6 2683/s 20026%
    ActivePerl 5.8.0 on the Windows box didn't fare so well under this one

    I killed the benchmark for ActivePerl at 7 minutes CPU time, over 23 million page faults, and over 175MB of memory usage. I guess there's probably a bug in bytes or in perl in that build that's causing the thrashing.

      '\0' is not the same as chr(0). By changing to chr(0), your code drops dramatically in the rankings.

      My criteria for good software:
      1. Does it work?
      2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
        Update: Everything in this node is based on a silly mistake and is irrelevant. The readmore has my original lunatic ramblings.

        dragonchild's statement above is correct. chr(0) is a touch less efficient than "\0" or just plain 0 it seems, but not nearly so much as to make any difference in the real world. Still, the tests pass with the wrong string. The version of the benchmark and test code I grabbed from an earlier node must've reinforced my blunder.

        With correction "\x00" in place of the erroneous '\0' on Strawberry:
        Rate ikegami_s ikegami_s 36.4/s -- avar 185/s 408% avar2 192/s 426% avar2_pos 324/s 788% ikegami_tr 329/s 804% moritz 786/s 2056% mrm_5 1561/s 4183% mrm_3 1578/s 4229% avar2_pos_inplace 1610/s 4318% mrm_4 1620/s 4344% mrm_1 1640/s 4400% mrm_6 1899/s 5111% ok 1 - mrm_1 gets some value ok 2 - mrm_4 gets same value ok 3 - avar2_pos_inplace gets same value ok 4 - moritz gets same value ok 5 - mrm_3 gets same value ok 6 - mrm_5 gets same value ok 7 - avar2_pos gets same value ok 8 - ikegami_tr gets same value ok 9 - mrm_6 gets same value ok 10 - ikegami_s gets same value ok 11 - avar gets same value ok 12 - avar2 gets same value
        And this is with this code:
        #!/usr/bin/perl use 5.6.0; use strict; use warnings FATAL => 'all'; use Benchmark qw( cmpthese ); my $s1 = do_rand(0, 100_000); my $s2 = do_rand(1, 100_000); my $subs = { # 'split1' => sub { my $s3 = split1( $s1, $s2 ) }, # 'substr1' => sub { my $s3 = substr1( $s1, $s2 ) }, # 'kyle' => sub { my $s3 = kyle( $s1, $s2 ) }, 'moritz' => sub { my $s3 = moritz( $s1, $s2 ) }, # 'corion' => sub { my $s3 = corion( $s1, $s2 ) }, 'ikegami_s' => sub { my $s3 = ikegami_s( $s1, $s2 ) }, 'ikegami_tr' => sub { my $s3 = ikegami_tr( $s1, $s2 ) }, 'avar' => sub { my $s3 = avar( $s1, $s2 ) }, 'avar2' => sub { my $s3 = avar2( $s1, $s2 ) }, 'avar2_pos' => sub { my $s3 = avar2_pos( $s1, $s2 ) }, 'avar2_pos_inplace' => sub { avar2_pos_inplace( \$s1, $s2 ); $s1 +}, 'mrm_1' => sub { mrm_1( \$s1, $s2 ); $s1 }, # 'mrm_2' => sub { mrm_2( \$s1, $s2 ); $s1 }, 'mrm_3' => sub { mrm_3( \$s1, $s2 ); $s1 }, 'mrm_4' => sub { mrm_4( \$s1, $s2 ); $s1 }, 'mrm_5' => sub { mrm_5( \$s1, $s2 ); $s1 }, 'mrm_6' => sub { mrm_6( \$s1, \$s2 ); $s1 }, }; cmpthese( -2, $subs ); use Test::More; plan 'tests' => scalar keys %{$subs}; my $s3; foreach my $subname ( keys %{$subs} ) { my $sub = $subs->{$subname}; if ( defined $s3 ) { is( $sub->(), $s3, "$subname gets same value" ); } else { $s3 = $sub->(); ok( defined $s3, "$subname gets some value" ); } } sub split1 { my ($s1, $s2) = @_; my @s1 = split //, $s1; my @s2 = split //, $s2; foreach my $idx ( 0 .. $#s1 ) { if ( $s1[$idx] eq chr(0) ) { $s1[$idx] = $s2[$idx]; } } return join '', @s1; } sub substr1 { my ($s1, $s2) = @_; for my $idx ( 0 .. length($s1) ) { if ( substr($s1,$idx,1) eq chr(0) ) { substr($s1, $idx, 1) = substr($s2, $idx, 1); } } return $s1; } sub kyle { my ($s1, $s2) = @_; my $out = $s1; while ( $s1 =~ m/\000/g ) { my $pos = pos; substr( $out, $pos, 1 ) = substr( $s2, $pos, 1 ); } return $out; } sub moritz { my ($s1, $s2) = @_; my $pos = 0; while ( -1 < ( $pos = index $s1, "\000", $pos ) ) { substr( $s1, $pos, 1 ) = substr( $s2, $pos, 1 ); } return $s1; } sub ikegami_s { my ($s1, $s2) = @_; (my $mask = $s1) =~ s/[^\x00]/\xFF/g; return ($s1 & $mask) | ($s2 & ~$mask); } sub ikegami_tr { my ($s1, $s2) = @_; (my $mask = $s1) =~ tr/\x00/\xFF/c; return ($s1 & $mask) | ($s2 & ~$mask); } sub corion { my ($s1, $s2) = @_; my $ofs = 0; return join "", map { $ofs += length; $_ => substr $s2, $ofs++, 1 } + split /\0/, $s1, -1; } sub avar { my ($s1, $s2) = @_; my $s3 = $s1; { use bytes; $s3 =~ s/(\0)/substr $s2, $+[0]-1, 1/eg; } $s3; } sub avar2 { my ($s1, $s2) = @_; use bytes; $s1 =~ s/(\0)/substr $s2, $+[0]-1, 1/eg; return $s1; } sub avar2_pos { my ($s1, $s2) = @_; use bytes; $s1 =~ s/\0/substr $s2, pos($s1), 1/eg; return $s1; } sub avar2_pos_inplace { my ($s1, $s2) = @_; use bytes; $$s1 =~ s/\0/substr $s2, pos($$s1), 1/eg; } sub mrm_1 { my ( $s1, $s2 ) = @_; # from [moritz]'s work use bytes; my $pos = 0; while ( -1 < ( $pos = index $$s1, "\x00", $pos ) ) { substr( $$s1, $pos, 1 ) = substr( $s2, $pos, 1 ); } } sub mrm_3 { my ( $s1, $s2 ) = @_; # from moritz's, builds a separate loop of zeros use bytes; my @zeros = (); my $pos = 0; while ( -1 < ( $pos = index $$s1, "\x00", $pos ) ) { push @zeros, $pos; } for ( @zeros ) { substr( $$s1, $_, 1 ) = substr( $s2, $_, 1 ); } } sub mrm_4 { # from [bart]'s vec() my ($s1, $s2) = @_; use bytes; my $pos = 0; while ( -1 < ( $pos = index $$s1, "\x00", $pos ) ) { vec( $$s1, $pos, 8 ) ||= vec( $s2, $pos, 8 ); } } sub mrm_5 { # from moritz's, seeing if four-arg substr() is faster or slower t +han lvalue substr() my ( $s1, $s2 ) = @_; use bytes; my $pos = 0; while ( -1 < ( $pos = index $$s1, "\x00", $pos ) ) { substr( $$s1, $pos, 1, substr( $s2, $pos, 1 ) ); } } sub mrm_6 { # from mrn_5, testing bytes::misc explicitly instead of importing # also in-place using of $s2 my ( $s1, $s2 ) = @_; use bytes (); my $pos = 0; while ( -1 < ( $pos = bytes::index( $$s1, "\x00", $pos ) ) ) { bytes::substr( $$s1, $pos, 1, bytes::substr( $$s2, $pos, 1 ) ) +; } } # This makes sure that $s1 has chr(0)'s in it and $s2 does not. sub do_rand { my $min = shift; my $len = shift; my $n = ""; for (1 .. $len) { $n .= chr( rand(255-$min)+$min ); } return $n; } #sub do_rand { # my $n = (shift) ? int(rand(255)) : int(rand(254)) + 1; # return chr( $n ); #} __END__
Re^4: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer)
by NiJo (Friar) on Sep 14, 2007 at 19:28 UTC
    Algorithms should scale quite differently to this. Char based solutions (bit ops, tr) would not profit. String / regexp based solutions like index, s//, split could gain speed. I'm not a regexp engine guru, but I assume the engine is intelligent enough not to compare char with char in this situation. In the most common case you get away with comparing the 1st char of the search string with every 3rd char of the long "text" string. That's a 3-fold speed increase.

    Thanks to very experienced monks the regexp engine is not too bad in real world use cases.

    Edit: Even index takes more than one char.