Re^2: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer)

Replies are listed 'Best First'.
Re^3: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer) by dragonchild (Archbishop) on Sep 13, 2007 at 01:31 UTC
This is very interesting. Your version rates at the same speed as moritz's and avar's once the bug is fixed - it's 0<, not -1<. One thing that is odd is that the specific problem I'm working has all of the chr(0)'s in groups of 3. So, I figured that I could use that and change the primary line to: `vec( $$s1, $pos, 24 ) \|\|= vec( $$s2, $pos, 24 ) while 0 < ( $pos = index $$s1, chr(0)x3, $pos );` [download] Except, that slows it down by 20%. If I change it so that the 24's stay, but it goes back to being chr(0) without the x3, it's back to being the same speed. I wonder why that is. I also wonder why the knowledge of being able to work 3 bytes at a time doesn't speed things up at all. As for why this wasn't in the problem statement - I wanted to solve the general problem and was willing to pay a meter of beer to see the various solutions. That there's an additional constraint in what I'm actually using the solutions for doesn't change what I was willing to pay a meter of beer for. :-) My criteria for good software: Does it work? Can someone else come in, make a change, and be reasonably certain no bugs were introduced?	[reply] [d/l]
Re^4: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer) by Corion (Patriarch) on Sep 13, 2007 at 07:16 UTC
If you're working with 3-byte groups, that suspiciously sounds to me like you're treating RGB-data, and even if you're not, there is a crazy yet highly efficient (I believe) way of manipulating such data if you have the hardware (and software) to do it: Using OpenGL and a chroma-key filter, you can "draw" the two strings over each other in parallel and then retrieve the resulting "image" from the massively parallel hardware again. It's not always certain that the parallelism you gain by offloading the work to the GPU outweighs the cost of transferring the data over the bus and the result back again, especially when benchmarked against the C versions you already have. See http://www.gpgpu.org/ for more information on the concept.	[reply]
Re^5: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer) by dragonchild (Archbishop) on Sep 13, 2007 at 12:49 UTC
Yes, I am treating RGB data. Specifically, I'm drawing the background PNGs for KayudaMaps. Unfortunately, since this is run on a server, I don't have a GPU (though that can be remedied if it's determined that it's worth it). The capability that this meter-of-beer is meant to give me is to overlay pictures with transparent bits on top of other pictures without having to use GL, ImageMagick, or Imager. It's not a rewrite of how any of those should do it - I'm just solving my specific problem in a very specialized manner. For example, since the images I'm working with don't have to be perfect, I've constrained my RGB values to 1-255, reserving 0 as transparent. The degradation is unnoticeable and I can handle transparent overlays without any speed penalties. This is also due to the fact that, for me, it's either fully transparent or fully opaque - another piece that specific to my problem that isn't generalizable to PNGs. My criteria for good software: Does it work? Can someone else come in, make a change, and be reasonably certain no bugs were introduced?	[reply]
Re^6: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer) by Corion (Patriarch) on Sep 13, 2007 at 12:56 UTC
Re^7: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer) by dragonchild (Archbishop) on Sep 13, 2007 at 15:18 UTC
Re^5: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer) by szbalint (Friar) on Sep 13, 2007 at 08:03 UTC
That sounds like what I've been doing to solve a similar problem when I wanted to compute influences (in a 2D environment, with influences radiating out in a given radius from a given pixel, and finding the strongest for any given point). I've achieved a 500x speedup using gpgpu solutions compared to solving it in a purely CPU based way. Of course, it took longer to write and it needs capable hardware (vertex and fragment shaders are used), but it's fast.	[reply]
Re^4: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer) by mr_mischief (Monsignor) on Sep 13, 2007 at 15:33 UTC
I think demerphq mentioned it already, but index() returns 0 if the 0-th character matches. It returns -1 if nothing in the string matches ( unless someone has played with $[ ). I also had a couple of other thoughts. Surely the Inline::C solutions are faster, but I'm still trying to wring performance out of pure Perl. 1. Why not use $s2 through the reference as well? and 2. What about `use bytes ();` and then explicitly using bytes::substr and bytes::index instead of doing the import? This code tests both of those together, for a whopping improvement over my solutions using the import and just using $s1 by reference. `sub mrm_6 { # from mrn_5, testing bytes::misc explicitly instead of importing # also in-place using of $s2 my ( $s1, $s2 ) = @_; use bytes (); my $pos = 0; while ( -1 < ( $pos = bytes::index( $$s1, '\0', $pos ) ) ) { bytes::substr( $$s1, $pos, 1, bytes::substr( $$s2, $pos, 1 ) ) +; } }` [download] The results are impressive for such simple changes: Strawberry Perl 5.8.8 on WinXP, AthlonXP 2400+, 1Gig `Rate ikegami_s ikegami_s 37.0/s -- avar 188/s 409% avar2 192/s 418% avar2_pos 282/s 663% ikegami_tr 342/s 825% moritz 783/s 2018% avar2_pos_inplace 1640/s 4338% mrm_3 2436/s 6491% mrm_4 2449/s 6524% mrm_5 2459/s 6553% mrm_1 2517/s 6709% mrm_6 3372/s 9023%` [download] cygperl 5.8.6 on the same machine as above `Rate ikegami_s ikegami_s 37.9/s -- avar 285/s 650% avar2 310/s 718% ikegami_tr 316/s 733% avar2_pos 709/s 1769% moritz 1002/s 2543% mrm_5 3631/s 9474% mrm_3 3841/s 10027% mrm_4 3913/s 10219% mrm_1 4044/s 10562% avar2_pos_inplace 4393/s 11484% mrm_6 6237/s 16345%` [download] perl 5.8.7 on Mandriva Linux 2006 Athlon 1000, 512MB RAM `Rate ikegami_s ikegami_s 17.2/s -- avar 168/s 876% avar2 187/s 983% ikegami_tr 205/s 1091% avar2_pos 307/s 1684% moritz 620/s 3499% mrm_4 1088/s 6218% avar2_pos_inplace 1184/s 6775% mrm_3 1184/s 6775% mrm_5 1224/s 7006% mrm_1 1240/s 7101% mrm_6 1921/s 11052%` [download] ... and perhaps a sign of good things to come: perl 5.9.5 on the above-mentioned Linux box `Rate ikegami_s ikegami_s 13.3/s -- avar 172/s 1189% ikegami_tr 176/s 1220% avar2 182/s 1267% avar2_pos 258/s 1837% moritz 412/s 2987% avar2_pos_inplace 776/s 5722% mrm_3 780/s 5749% mrm_1 785/s 5788% mrm_5 798/s 5882% mrm_4 806/s 5942% mrm_6 2683/s 20026%` [download] ActivePerl 5.8.0 on the Windows box didn't fare so well under this one I killed the benchmark for ActivePerl at 7 minutes CPU time, over 23 million page faults, and over 175MB of memory usage. I guess there's probably a bug in bytes or in perl in that build that's causing the thrashing.	[reply] [d/l] [select]
Re^5: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer) by dragonchild (Archbishop) on Sep 13, 2007 at 16:28 UTC
'\0' is not the same as chr(0). By changing to chr(0), your code drops dramatically in the rankings. My criteria for good software: Does it work? Can someone else come in, make a change, and be reasonably certain no bugs were introduced?	[reply]
Re^6: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer) by mr_mischief (Monsignor) on Sep 13, 2007 at 16:36 UTC
Re^7: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer) by dragonchild (Archbishop) on Sep 13, 2007 at 16:50 UTC
Some notes below your chosen depth have not been shown here
Re^6: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer) by mr_mischief (Monsignor) on Sep 13, 2007 at 21:08 UTC
Re^7: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer) by ikegami (Patriarch) on Sep 14, 2007 at 02:05 UTC
Some notes below your chosen depth have not been shown here
Re^7: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer) by ikegami (Patriarch) on Sep 14, 2007 at 01:14 UTC
Re^4: Challenge: CPU-optimized byte-wise or-equals (for a meter of beer) by NiJo (Friar) on Sep 14, 2007 at 19:28 UTC
Algorithms should scale quite differently to this. Char based solutions (bit ops, tr) would not profit. String / regexp based solutions like index, s//, split could gain speed. I'm not a regexp engine guru, but I assume the engine is intelligent enough not to compare char with char in this situation. In the most common case you get away with comparing the 1st char of the search string with every 3rd char of the long "text" string. That's a 3-fold speed increase. Thanks to very experienced monks the regexp engine is not too bad in real world use cases. Edit: Even index takes more than one char.	[reply]