in reply to Re: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer)
in thread Challenge: CPU-optimized byte-wise or-equals (for a meter of beer)

It's not vec() that's slowing this down. According to my measurements, this version is one of the fastest pure-Perl ones so far:
sub mrm_4 { # from [bart]'s vec() my ($s1, $s2) = @_; use bytes; my $pos = 0; vec( $$s1, $pos, 8 ) ||= vec( $s2, $pos, 8 ) while -1 < ( $pos = i +ndex $$s1, '\0', $pos ); }

It must be the copying and the for loop that are slowing it down compared to working on the referenced scalar and using index().

  • Comment on Re^2: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer)
  • Download Code

Replies are listed 'Best First'.
Re^3: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer)
by dragonchild (Archbishop) on Sep 13, 2007 at 01:31 UTC
    This is very interesting. Your version rates at the same speed as moritz's and avar's once the bug is fixed - it's 0<, not -1<.

    One thing that is odd is that the specific problem I'm working has all of the chr(0)'s in groups of 3. So, I figured that I could use that and change the primary line to:

    vec( $$s1, $pos, 24 ) ||= vec( $$s2, $pos, 24 ) while 0 < ( $pos = index $$s1, chr(0)x3, $pos );
    Except, that slows it down by 20%. If I change it so that the 24's stay, but it goes back to being chr(0) without the x3, it's back to being the same speed. I wonder why that is. I also wonder why the knowledge of being able to work 3 bytes at a time doesn't speed things up at all.

    As for why this wasn't in the problem statement - I wanted to solve the general problem and was willing to pay a meter of beer to see the various solutions. That there's an additional constraint in what I'm actually using the solutions for doesn't change what I was willing to pay a meter of beer for. :-)


    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?

      If you're working with 3-byte groups, that suspiciously sounds to me like you're treating RGB-data, and even if you're not, there is a crazy yet highly efficient (I believe) way of manipulating such data if you have the hardware (and software) to do it:

      Using OpenGL and a chroma-key filter, you can "draw" the two strings over each other in parallel and then retrieve the resulting "image" from the massively parallel hardware again. It's not always certain that the parallelism you gain by offloading the work to the GPU outweighs the cost of transferring the data over the bus and the result back again, especially when benchmarked against the C versions you already have. See http://www.gpgpu.org/ for more information on the concept.

        Yes, I am treating RGB data. Specifically, I'm drawing the background PNGs for KayudaMaps. Unfortunately, since this is run on a server, I don't have a GPU (though that can be remedied if it's determined that it's worth it).

        The capability that this meter-of-beer is meant to give me is to overlay pictures with transparent bits on top of other pictures without having to use GL, ImageMagick, or Imager. It's not a rewrite of how any of those should do it - I'm just solving my specific problem in a very specialized manner. For example, since the images I'm working with don't have to be perfect, I've constrained my RGB values to 1-255, reserving 0 as transparent. The degradation is unnoticeable and I can handle transparent overlays without any speed penalties. This is also due to the fact that, for me, it's either fully transparent or fully opaque - another piece that specific to my problem that isn't generalizable to PNGs.


        My criteria for good software:
        1. Does it work?
        2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?

        That sounds like what I've been doing to solve a similar problem when I wanted to compute influences (in a 2D environment, with influences radiating out in a given radius from a given pixel, and finding the strongest for any given point). I've achieved a 500x speedup using gpgpu solutions compared to solving it in a purely CPU based way.

        Of course, it took longer to write and it needs capable hardware (vertex and fragment shaders are used), but it's fast.

      I think demerphq mentioned it already, but index() returns 0 if the 0-th character matches. It returns -1 if nothing in the string matches ( unless someone has played with $[ ).

      I also had a couple of other thoughts. Surely the Inline::C solutions are faster, but I'm still trying to wring performance out of pure Perl. 1. Why not use $s2 through the reference as well? and 2. What about use bytes (); and then explicitly using bytes::substr and bytes::index instead of doing the import?

      This code tests both of those together, for a whopping improvement over my solutions using the import and just using $s1 by reference.

      sub mrm_6 { # from mrn_5, testing bytes::misc explicitly instead of importing # also in-place using of $s2 my ( $s1, $s2 ) = @_; use bytes (); my $pos = 0; while ( -1 < ( $pos = bytes::index( $$s1, '\0', $pos ) ) ) { bytes::substr( $$s1, $pos, 1, bytes::substr( $$s2, $pos, 1 ) ) +; } }

      The results are impressive for such simple changes:

      Strawberry Perl 5.8.8 on WinXP, AthlonXP 2400+, 1Gig

      Rate ikegami_s ikegami_s 37.0/s -- avar 188/s 409% avar2 192/s 418% avar2_pos 282/s 663% ikegami_tr 342/s 825% moritz 783/s 2018% avar2_pos_inplace 1640/s 4338% mrm_3 2436/s 6491% mrm_4 2449/s 6524% mrm_5 2459/s 6553% mrm_1 2517/s 6709% mrm_6 3372/s 9023%
      cygperl 5.8.6 on the same machine as above
      Rate ikegami_s ikegami_s 37.9/s -- avar 285/s 650% avar2 310/s 718% ikegami_tr 316/s 733% avar2_pos 709/s 1769% moritz 1002/s 2543% mrm_5 3631/s 9474% mrm_3 3841/s 10027% mrm_4 3913/s 10219% mrm_1 4044/s 10562% avar2_pos_inplace 4393/s 11484% mrm_6 6237/s 16345%
      perl 5.8.7 on Mandriva Linux 2006 Athlon 1000, 512MB RAM
      Rate ikegami_s ikegami_s 17.2/s -- avar 168/s 876% avar2 187/s 983% ikegami_tr 205/s 1091% avar2_pos 307/s 1684% moritz 620/s 3499% mrm_4 1088/s 6218% avar2_pos_inplace 1184/s 6775% mrm_3 1184/s 6775% mrm_5 1224/s 7006% mrm_1 1240/s 7101% mrm_6 1921/s 11052%
      ... and perhaps a sign of good things to come: perl 5.9.5 on the above-mentioned Linux box
      Rate ikegami_s ikegami_s 13.3/s -- avar 172/s 1189% ikegami_tr 176/s 1220% avar2 182/s 1267% avar2_pos 258/s 1837% moritz 412/s 2987% avar2_pos_inplace 776/s 5722% mrm_3 780/s 5749% mrm_1 785/s 5788% mrm_5 798/s 5882% mrm_4 806/s 5942% mrm_6 2683/s 20026%
      ActivePerl 5.8.0 on the Windows box didn't fare so well under this one

      I killed the benchmark for ActivePerl at 7 minutes CPU time, over 23 million page faults, and over 175MB of memory usage. I guess there's probably a bug in bytes or in perl in that build that's causing the thrashing.

        '\0' is not the same as chr(0). By changing to chr(0), your code drops dramatically in the rankings.

        My criteria for good software:
        1. Does it work?
        2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
      Algorithms should scale quite differently to this. Char based solutions (bit ops, tr) would not profit. String / regexp based solutions like index, s//, split could gain speed. I'm not a regexp engine guru, but I assume the engine is intelligent enough not to compare char with char in this situation. In the most common case you get away with comparing the 1st char of the search string with every 3rd char of the long "text" string. That's a 3-fold speed increase.

      Thanks to very experienced monks the regexp engine is not too bad in real world use cases.

      Edit: Even index takes more than one char.