in reply to Risque Romantic Rosetta Roman Race

I combined reading, processing, and output into a small chunking routine utilizing MCE. The parallel Perl code may run faster than C, depending on the number of logical CPU cores. This includes tybalt89's optimization.

# rtoa-pgatram-mce.pl # Example run: perl rtoa-pgatram-mce.pl t1.txt >mce.tmp # # Convert a "modern" Roman Numeral to its arabic (decimal) equivalent. # The alpabetic input string may be assumed to always contain a valid +Roman Numeral in the range 1-3999. # Roman numerals may be upper or lower case. # Error handling is not required. # For example: # input "XLII" should produce the arabic (decimal) value 42 # input "mi" should produce the arabic (decimal) value 1001 use strict; use warnings; use List::Util qw(reduce); use Time::HiRes qw(time); use MCE; # Function roman_to_arabic # Output a list of their arabic (decimal) values to standard output. # sub roman_to_arabic { my $files = shift; # in: reference to a list of files containin +g Roman Numerals (one per line) my %rtoa = ( M=>1000, D=>500, C=>100, L=>50, X=>10, V=>5, I=>1 ); # construct a pool of workers my $mce = MCE->new( max_workers => MCE::Util::get_ncpu(), chunk_size => 90 * 1024, init_relay => 1, # if defined, tells MCE to load MCE::Relay p +lus extra setup posix_exit => 1, user_func => sub { my ( $mce, $chunk_ref, $chunk_id, $output ) = ( @_, '' ); chomp @{$chunk_ref}; for ( @{$chunk_ref} ) { # $output .= reduce { $a+$b-$a%$b*2 } map { $rtoa{$_} } spli +t //, uc($_); $output .= reduce { $a+$b-$a%$b*2 } @rtoa{ split //, uc($_ +) }; $output .= "\n"; } # output orderly MCE::relay { print $output }; } ); # process a list of files for my $fname ( @{$files} ) { warn("$0: cannot access '$fname': No such file or directory\n"), + next unless -e $fname; warn("$0: cannot access '$fname': Permission denied\n"), next unless -r $fname; warn("$0: cannot process '$fname': Is a directory\n"), next if -d $fname; $mce->process({ input_data => $fname }); } # reap MCE workers $mce->shutdown; } @ARGV or die "usage: $0 file...\n"; my @rtoa_files = @ARGV; warn "rtoa pgatram start\n"; my $tstart = time; roman_to_arabic(\@rtoa_files); warn sprintf("time %0.03f secs\n", time - $tstart);

I updated rtoa-pgatram.pl to include the same optimization. What I find interesting is the time difference compared to real time for the C demonstration. The delta time is likely global cleanup. Parallelization involved running on physical and logical CPU cores via the taskset utility. Hence, the reason not seeing 8x comparing one thread and eight threads. For reference, running on eight physical cores completed in 0.814 seconds (7.2x).

$ time ./rtoa-pgatram t1.txt >cpp.tmp read_input_file : 3999000 items read file time : 0.136 secs roman_to_dec time : 0.116 secs output time : 0.187 secs real 0m0.450s user 0m0.416s sys 0m0.034s $ time perl rtoa-pgatram.pl t1.txt >pgatram.tmp rtoa pgatram start read_input_files : 1 secs roman_to_arabic : 4 secs output : 1 secs total : 6 secs real 0m6.259s user 0m6.131s sys 0m0.128s $ time perl rtoa-pgatram-mce.pl t1.txt >mce.tmp rtoa pgatram start time 1 thread : 5.808 secs time 8 threads : 1.151 secs ( 5.05x) time 16 threads : 0.643 secs ( 9.03x) time 32 threads : 0.347 secs (16.74x) time 64 threads : 0.225 secs (25.81x) 1 thd 8 thds 16 thds 32 thds 64 thds real 0m5.835s 0m1.178s 0m0.671s 0m0.375s 0m0.252s user 0m5.832s 0m9.064s 0m8.874s 0m8.935s 0m9.698s sys 0m0.008s 0m0.024s 0m0.030s 0m0.073s 0m0.136s

Thank you, for the interesting series. This allowed me to practice using MCE.

$ cksum cpp.tmp pgatram.tmp mce.tmp 1397892467 18888000 cpp.tmp 1397892467 18888000 pgatram.tmp 1397892467 18888000 mce.tmp

Replies are listed 'Best First'.
Re^2: Risque Romantic Rosetta Roman Race
by eyepopslikeamosquito (Archbishop) on May 12, 2023 at 07:56 UTC

    marioroy, I remain in awe of your MCE masterwork, the most impressive contribution to CPAN in the past ten years IMHO.

    > I updated rtoa-pgatram.pl to include the same optimization

    Is this the tybalt89 optimization? ... or is there another optimization I missed?

    It takes 32 logical cores for your Perl/MCE version to catch up to my C++ version 1.0. Is that right? If so, you might need to work a little harder, because I just updated my C++ version in the root node with an option to apply a one-line change utilizing the quirky C++ fast_io library (which I think you know about ;-) ... no surprise the output time dropped from 0.247 secs to 0.098 secs.

    After the long running Rosetta Code: Long List is Long saga (which I think you know about ;-), there are many more C++ tricks yet to be tried in this node, such as OpenMP, Boost, abseil, Judy Arrays ... though I honestly don't feel inclined to try for a repeat saga. :)

      Is this the tybalt89 optimization? ... or is there another optimization I missed?

      Yes.

      It takes 32 logical cores for your Perl/MCE version to catch up to my C++ version 1.0. Is that right?

      That was done using 16 physical and 16 logical CPU cores via taskset -c 0-15,32-47. BTW, I captured the UNIX time to include any global cleanup. It now takes the entire CPU (64 logical threads) for Perl MCE 1.0 to run faster. :) The Perl time includes launching Perl, loading modules, spawning and reaping workers (~ 0.06 secs).

      # captured UNIX time C++ 1.0 : 0.450s C++ fast_io : 0.291s Perl MCE 64 thds : 0.252s

      I tried also, an ARRAY for indexed-based lookups. But, that runs slower. Edit: Tried unpack, tip by tybalt89. ARRAY lookup is now faster.

      # HASH my %rtoa = ( M=>1000, D=>500, C=>100, L=>50, X=>10, V=>5, I=>1 ); # ARRAY, characters M D C L X V I my @rtoa; @rtoa[qw( 77 68 67 76 88 86 73 )] = qw( 1000 500 100 50 10 5 + 1 ); Perl MCE 64 thds : 0.252s @rtoa{ split //, uc($_) }; Perl MCE 64 thds : 0.297s @rtoa[ map ord, split //, uc($_) ]; Perl MCE 64 thds : 0.192s @rtoa[ unpack 'c*', uc($_) ];

        The Perl MCE code can be made faster by applying CPU affinity and enabling slurp IO.

        Updated on May 12, 2023 with tip from tybalt89 using unpack.
        Updated on May 13, 2023: fixed typo.

        The UNIX real time includes ~ 0.06 seconds for launching Perl, loading modules, spawning and reaping workers.

        # captured UNIX time C++ 1.0 : 0.450s C++ fast_io : 0.291s Perl MCE 64 thds : 0.252s # after tweaks: CPU affinity and slurp IO Perl MCE 64 thds : 0.218s # more tweaks: using an ARRAY and unpack Perl MCE 64 thds : 0.192s $ time perl rtoa-pgatram-mce.pl t1.txt >mce.tmp rtoa pgatram start time 0.165 secs real 0m0.192s user 0m7.596s sys 0m0.420s

        > It now takes the entire CPU (64 logical threads) for Perl MCE 1.0 to run faster. :)

        Now that's a challenge! Can I can push the dial further? ... or will the ingenious tybalt89's unorthodox assistance from the side allow you to move the needle back towards 32? :)

        Since I know how much you enjoyed my (anonymonk-provoked) MAX_STR_LEN_L hack in the long Long List is Long series, I've tried a similar stunt here in a desperate attempt to improve data locality and cache performance. I also added a (cheating) vector reserve and the total time at the end (thanks for pointing out this oversight).

        Anyways, here are the timings of my latest version, rtoa-pgatram-fixed.cpp, using the fast_io library:

        $ ./rtoa-pgatram-fixed t1.txt >f.tmp read_input_files : 3999000 items read file time : 0.164 secs roman_to_dec time : 0.088 secs output time : 0.100 secs total time : 0.353 secs $ diff f.tmp roman.tmp $ ./rtoa-pgatram-fixed t1.txt >f.tmp read_input_files : 3999000 items read file time : 0.167 secs roman_to_dec time : 0.086 secs output time : 0.098 secs total time : 0.353 secs $ diff f.tmp roman.tmp $ ./rtoa-pgatram-fixed t1.txt >f.tmp read_input_files : 3999000 items read file time : 0.173 secs roman_to_dec time : 0.086 secs output time : 0.093 secs total time : 0.353 secs $ diff f.tmp roman.tmp

        Update: with marioroy rtoa-pgatram-fixed2 below (without fast_io):

        $ ./rtoa-pgatram-fixed2 t1.txt >f.tmp read_input_files : 3999000 items read file time : 0.189 secs roman_to_dec time : 0.371 secs total time : 0.560 secs $ diff f.tmp roman.tmp

        ... with fast_io:

        $ ./rtoa-pgatram-fixed2 t1.txt >f.tmp read_input_files : 3999000 items read file time : 0.178 secs roman_to_dec time : 0.143 secs total time : 0.322 secs $ diff f.tmp roman.tmp

        rtoa-pgatram-fixed.cpp

        // rtoa-pgatram-fixed.cpp. Crude fixed length string version. // Compile with: // g++ -o rtoa-pgatram-fixed -std=c++20 -Wall -O3 rtoa-pgatram-fixed +.cpp // or: // clang++ -o rtoa-pgatram-fixed -std=c++20 -Wall -O3 rtoa-pgatram-f +ixed.cpp // or: // g++ -o rtoa-pgatram-fixed -std=c++20 -Wall -O3 -I "$HOME/local-fa +st_io/fast_io/include" rtoa-pgatram-fixed.cpp // to use the locally installed fast_io header-only library #include <cctype> #include <cstring> #include <iostream> #include <string> #include <vector> #include <numeric> #include <chrono> #include <iomanip> // See [id://11149504] for more info on the fast_io C++ library // #include <fast_io.h> // ------------------------------------------------------------------- +--------- typedef std::chrono::high_resolution_clock high_resolution_clock; typedef std::chrono::high_resolution_clock::time_point time_point; typedef std::chrono::milliseconds milliseconds; double elaspe_time( time_point cend, time_point cstart) { return double ( std::chrono::duration_cast<milliseconds>(cend - cstart).count() ) * 1e-3; } // ------------------------------------------------------------------- +--------- // Longest roman numeral is MMMDCCCLXXXVIII (3888) of length 15 // XXX: I'm off by one somewhere because 3888 fails with // MAX_STR_LEN_L of 16 but works with 17 #define MAX_STR_LEN_L 17 // The basic idea is to keep this struct small and without pointers to // improve data locality/cache performance when traversing the vector struct str_type { char slen; char str[MAX_STR_LEN_L]; }; using vec_str_type = std::vector<str_type>; using vec_int_type = std::vector<int>; // Read an input file of Roman Numerals and append them to a list // Return the number of Roman Numerals appended static int read_input_file( const char* fname, // in: file name containing a list of +Roman Numerals vec_str_type& vec_ret) // out: a vector of Roman Numeral strin +gs { FILE* fh; str_type line; int cnt = 0; fh = ::fopen(fname, "r"); if ( fh == NULL ) { std::cerr << "Error opening '" << fname << "' : " << strerror(er +rno) << "\n"; return 0; } while ( ::fgets( line.str, MAX_STR_LEN_L, fh ) != NULL ) { line.slen = ::strlen(line.str) - 1; // -1 to strip trailing n +ewline vec_ret.emplace_back(line); ++cnt; } ::fclose(fh); return cnt; } // --------------------------------------------------------------- // Though there are less than 256 initializers in this ascii table, // the others are guaranteed by ANSI C to be initialized to zero. static const int romtab[256] = { 0,0,0,0,0,0, 0, 0, 0, 0, // 00- 09 0,0,0,0,0,0, 0, 0, 0, 0, // 10- 19 0,0,0,0,0,0, 0, 0, 0, 0, // 20- 29 0,0,0,0,0,0, 0, 0, 0, 0, // 30- 39 0,0,0,0,0,0, 0, 0, 0, 0, // 40- 49 0,0,0,0,0,0, 0, 0, 0, 0, // 50- 59 0,0,0,0,0,0, 0, 100, 500, 0, // 60- 69 0,0,0,1,0,0, 50,1000, 0, 0, // 70- 79 0,0,0,0,0,0, 5, 0, 10, 0, // 80- 89 0,0,0,0,0,0, 0, 0, 0, 100, // 90- 99 500,0,0,0,0,1, 0, 0, 50,1000, // 100-109 0,0,0,0,0,0, 0, 0, 5, 0, // 110-119 10,0,0,0,0,0, 0, 0, 0, 0 // 120-129 }; // Return the arabic number for a roman letter c. // Return zero if the roman letter c is invalid. inline int urtoa(int c) { return romtab[c]; } inline int accfn(int t, char c) { return t + urtoa(c) - t % urtoa(c) * 2; } inline int roman_to_dec(const str_type& st) { return std::accumulate( st.str, st.str + st.slen, 0, accfn ); } int main(int argc, char* argv[]) { if (argc < 2) { std::cerr << "usage: rtoa-pgatram-fixed file...\n"; return 1; } // Get the list of input files from the command line int nfiles = argc - 1; char** fname = &argv[1]; std::cerr << std::setprecision(3) << std::setiosflags(std::ios::fix +ed); time_point cstart1, cend1, cstart2, cend2, cstart3, cend3; // Read the input files into roman_list vec_str_type roman_list; roman_list.reserve(3999000); cstart1 = high_resolution_clock::now(); int cnt = 0; for (int i = 0; i < nfiles; ++i) { cnt += read_input_file( fname[i], roman_list ); } cend1 = high_resolution_clock::now(); double ctaken1 = elaspe_time(cend1, cstart1); std::cerr << "read_input_files : " << cnt << " items\n"; std::cerr << "read file time : " << std::setw(8) << ctaken1 << " + secs\n"; // Convert roman to decimal vec_int_type arabic_list; arabic_list.reserve(3999000); cstart2 = high_resolution_clock::now(); for ( auto const& r : roman_list ) { arabic_list.emplace_back( roman_to_dec(r) ); } cend2 = high_resolution_clock::now(); double ctaken2 = elaspe_time(cend2, cstart2); std::cerr << "roman_to_dec time : " << std::setw(8) << ctaken2 << " + secs\n"; // Output to stdout cstart3 = high_resolution_clock::now(); for ( auto const& i : arabic_list ) { std::cout << i << '\n'; // fast_io::io::println(i); } cend3 = high_resolution_clock::now(); double ctaken3 = elaspe_time(cend3, cstart3); std::cerr << "output time : " << std::setw(8) << ctaken3 << " + secs\n"; double ctaken = elaspe_time(cend3, cstart1); std::cerr << "total time : " << std::setw(8) << ctaken << +" secs\n"; return 0; }