in reply to Re^2: Risque Romantic Rosetta Roman Race
in thread Risque Romantic Rosetta Roman Race
Is this the tybalt89 optimization? ... or is there another optimization I missed?
Yes.
It takes 32 logical cores for your Perl/MCE version to catch up to my C++ version 1.0. Is that right?
That was done using 16 physical and 16 logical CPU cores via taskset -c 0-15,32-47. BTW, I captured the UNIX time to include any global cleanup. It now takes the entire CPU (64 logical threads) for Perl MCE 1.0 to run faster. :) The Perl time includes launching Perl, loading modules, spawning and reaping workers (~ 0.06 secs).
# captured UNIX time C++ 1.0 : 0.450s C++ fast_io : 0.291s Perl MCE 64 thds : 0.252s
I tried also, an ARRAY for indexed-based lookups. But, that runs slower. Edit: Tried unpack, tip by tybalt89. ARRAY lookup is now faster.
# HASH my %rtoa = ( M=>1000, D=>500, C=>100, L=>50, X=>10, V=>5, I=>1 ); # ARRAY, characters M D C L X V I my @rtoa; @rtoa[qw( 77 68 67 76 88 86 73 )] = qw( 1000 500 100 50 10 5 + 1 ); Perl MCE 64 thds : 0.252s @rtoa{ split //, uc($_) }; Perl MCE 64 thds : 0.297s @rtoa[ map ord, split //, uc($_) ]; Perl MCE 64 thds : 0.192s @rtoa[ unpack 'c*', uc($_) ];
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^4: Risque Romantic Rosetta Roman Race - MCE Array Reduce
by marioroy (Prior) on May 12, 2023 at 10:40 UTC | |
The Perl MCE code can be made faster by applying CPU affinity and enabling slurp IO. Updated on May 12, 2023 with tip from tybalt89 using unpack.
Read more... (3 kB)
The UNIX real time includes ~ 0.06 seconds for launching Perl, loading modules, spawning and reaping workers.
| [reply] [d/l] [select] |
Re^4: Risque Romantic Rosetta Roman Race
by tybalt89 (Monsignor) on May 12, 2023 at 16:11 UTC | |
Would run faster than ? | [reply] [d/l] [select] |
by marioroy (Prior) on May 12, 2023 at 18:24 UTC | |
Thanks, tybalt89. Yes, it runs faster :) completing in less than 0.2 seconds. I updated the MCE demonstration.
| [reply] [d/l] |
Re^4: Risque Romantic Rosetta Roman Race
by eyepopslikeamosquito (Archbishop) on May 13, 2023 at 09:03 UTC | |
> It now takes the entire CPU (64 logical threads) for Perl MCE 1.0 to run faster. :) Now that's a challenge! Can I can push the dial further? ... or will the ingenious tybalt89's unorthodox assistance from the side allow you to move the needle back towards 32? :) Since I know how much you enjoyed my (anonymonk-provoked) MAX_STR_LEN_L hack in the long Long List is Long series, I've tried a similar stunt here in a desperate attempt to improve data locality and cache performance. I also added a (cheating) vector reserve and the total time at the end (thanks for pointing out this oversight). Anyways, here are the timings of my latest version, rtoa-pgatram-fixed.cpp, using the fast_io library:
Update: with marioroy rtoa-pgatram-fixed2 below (without fast_io):
... with fast_io:
rtoa-pgatram-fixed.cpp
| [reply] [d/l] [select] |
by marioroy (Prior) on May 13, 2023 at 10:32 UTC | |
... move the needle back towards 32? Ah, I missed sharing that it no longer takes the full CPU (64-threads) to run as fast as C++. Below, I specify t1.txt four times to increase the compute time. It takes 17 physical CPU cores for Perl to run faster than C++ :). Update: Using faster MCE variant. See tybalt89's enhancement.
I modified rtoa-pgatram-fixed.cpp and removed the last vector, cstart3, and cend3. Hence, write to standard output immediately. Perl now needs 4 more CPU cores to run faster. Crazy :)
The above results were captured on Fedora Linux 38. I also tried the Perl binary on Clear Linux for better performance :)
About the Perl MCE demonstration. I made the demonstration simply for showcasing running parallel in Perl. It was a fun exercise for checking how many CPU cores does Perl need to reach C++ using fast_io. | [reply] [d/l] [select] |
by eyepopslikeamosquito (Archbishop) on May 14, 2023 at 04:17 UTC | |
For fun, I combined your change to eliminate the last vector with some old OpenMP code I'm sure you'll recognize. :)
As expected, it's a little bit faster:
References Added Later
| [reply] [d/l] [select] |
by eyepopslikeamosquito (Archbishop) on May 15, 2023 at 03:33 UTC | |
I may have been overthinking this. :) Here's a simple all-in-one version with no interim storage in vectors at all.
As you can see, this is twice as fast as rtoa-pgatram-fixed.
Update: Oops, the above rtoa-pgatram-fixed timing figures were built without using fast_io. The timings with fastio on my machine are: ... not twice as fast, but it's faster when you don't store anything in a vector ... though rtoa-pgatram-openmp might be faster with many files ... so I probably need to find a way to make rtoa-pgatram-allinone concurrent somehow (e.g. via chunking). Will this all in one version rtoa-pgatram-allinone be deemed acceptable by marioroy? | [reply] [d/l] [select] |
by marioroy (Prior) on May 15, 2023 at 12:23 UTC | |
by eyepopslikeamosquito (Archbishop) on May 16, 2023 at 00:36 UTC | |
|