Re: Risque Romantic Rosetta Roman Race

I combined reading, processing, and output into a small chunking routine utilizing MCE. The parallel Perl code may run faster than C, depending on the number of logical CPU cores. This includes tybalt89's optimization.

# rtoa-pgatram-mce.pl
# Example run: perl rtoa-pgatram-mce.pl t1.txt >mce.tmp
#
# Convert a "modern" Roman Numeral to its arabic (decimal) equivalent.
# The alpabetic input string may be assumed to always contain a valid 
+Roman Numeral in the range 1-3999.
# Roman numerals may be upper or lower case.
# Error handling is not required.
# For example:
#    input "XLII" should produce the arabic (decimal) value 42
#    input "mi"   should produce the arabic (decimal) value 1001

use strict;
use warnings;
use List::Util qw(reduce);
use Time::HiRes qw(time);
use MCE;

# Function roman_to_arabic
# Output a list of their arabic (decimal) values to standard output.
#
sub roman_to_arabic {
   my $files = shift;    # in:  reference to a list of files containin
+g Roman Numerals (one per line)
   my %rtoa = ( M=>1000, D=>500, C=>100, L=>50, X=>10, V=>5, I=>1 );

   # construct a pool of workers
   my $mce = MCE->new(
      max_workers => MCE::Util::get_ncpu(),
      chunk_size  => 90 * 1024,
      init_relay  => 1,   # if defined, tells MCE to load MCE::Relay p
+lus extra setup
      posix_exit  => 1,
      user_func   => sub {
         my ( $mce, $chunk_ref, $chunk_id, $output ) = ( @_, '' );
         chomp @{$chunk_ref};
         for ( @{$chunk_ref} ) {
          # $output .= reduce { $a+$b-$a%$b*2 } map { $rtoa{$_} } spli
+t //, uc($_);
            $output .= reduce { $a+$b-$a%$b*2 } @rtoa{ split //, uc($_
+) };
            $output .= "\n";
         }
         # output orderly
         MCE::relay { print $output };
      }
   );

   # process a list of files
   for my $fname ( @{$files} ) {
      warn("$0: cannot access '$fname': No such file or directory\n"),
+ next
         unless -e $fname;
      warn("$0: cannot access '$fname': Permission denied\n"), next
         unless -r $fname;
      warn("$0: cannot process '$fname': Is a directory\n"), next
         if -d $fname;
      $mce->process({ input_data => $fname });
   }

   # reap MCE workers
   $mce->shutdown;
}

@ARGV or die "usage: $0 file...\n";
my @rtoa_files = @ARGV;

warn "rtoa pgatram start\n";
my $tstart = time;
roman_to_arabic(\@rtoa_files);
warn sprintf("time %0.03f secs\n", time - $tstart);
[download]

I updated rtoa-pgatram.pl to include the same optimization. What I find interesting is the time difference compared to real time for the C demonstration. The delta time is likely global cleanup. Parallelization involved running on physical and logical CPU cores via the taskset utility. Hence, the reason not seeing 8x comparing one thread and eight threads. For reference, running on eight physical cores completed in 0.814 seconds (7.2x).

$ time ./rtoa-pgatram t1.txt >cpp.tmp
read_input_file   : 3999000 items
read file time    :    0.136 secs
roman_to_dec time :    0.116 secs
output time       :    0.187 secs

real    0m0.450s
user    0m0.416s
sys     0m0.034s

$ time perl rtoa-pgatram.pl t1.txt >pgatram.tmp
rtoa pgatram start
read_input_files : 1 secs
roman_to_arabic  : 4 secs
output           : 1 secs
total            : 6 secs

real    0m6.259s
user    0m6.131s
sys     0m0.128s

$ time perl rtoa-pgatram-mce.pl t1.txt >mce.tmp
rtoa pgatram start
time  1 thread   : 5.808 secs
time  8 threads  : 1.151 secs ( 5.05x)
time 16 threads  : 0.643 secs ( 9.03x)
time 32 threads  : 0.347 secs (16.74x)
time 64 threads  : 0.225 secs (25.81x)

          1 thd     8 thds   16 thds   32 thds   64 thds
real    0m5.835s  0m1.178s  0m0.671s  0m0.375s  0m0.252s
user    0m5.832s  0m9.064s  0m8.874s  0m8.935s  0m9.698s
sys     0m0.008s  0m0.024s  0m0.030s  0m0.073s  0m0.136s
[download]

Thank you, for the interesting series. This allowed me to practice using MCE.

$ cksum cpp.tmp pgatram.tmp mce.tmp
1397892467 18888000 cpp.tmp
1397892467 18888000 pgatram.tmp
1397892467 18888000 mce.tmp
[download]

Comment on Re: Risque Romantic Rosetta Roman Race - MCE Hash Reduce Select or Download Code

Replies are listed 'Best First'.
Re^2: Risque Romantic Rosetta Roman Race by eyepopslikeamosquito (Archbishop) on May 12, 2023 at 07:56 UTC
marioroy, I remain in awe of your MCE masterwork, the most impressive contribution to CPAN in the past ten years IMHO. > I updated rtoa-pgatram.pl to include the same optimization Is this the tybalt89 optimization? ... or is there another optimization I missed? It takes 32 logical cores for your Perl/MCE version to catch up to my C++ version 1.0. Is that right? If so, you might need to work a little harder, because I just updated my C++ version in the root node with an option to apply a one-line change utilizing the quirky C++ fast_io library (which I think you know about ;-) ... no surprise the output time dropped from 0.247 secs to 0.098 secs. After the long running Rosetta Code: Long List is Long saga (which I think you know about ;-), there are many more C++ tricks yet to be tried in this node, such as OpenMP, Boost, abseil, Judy Arrays ... though I honestly don't feel inclined to try for a repeat saga. :)	[reply]
Re^3: Risque Romantic Rosetta Roman Race - Tried Array Lookup by marioroy (Prior) on May 12, 2023 at 09:29 UTC
Is this the tybalt89 optimization? ... or is there another optimization I missed? Yes. It takes 32 logical cores for your Perl/MCE version to catch up to my C++ version 1.0. Is that right? That was done using 16 physical and 16 logical CPU cores via taskset -c 0-15,32-47. BTW, I captured the UNIX time to include any global cleanup. It now takes the entire CPU (64 logical threads) for Perl MCE 1.0 to run faster. :) The Perl time includes launching Perl, loading modules, spawning and reaping workers (~ 0.06 secs). `# captured UNIX time C++ 1.0 : 0.450s C++ fast_io : 0.291s Perl MCE 64 thds : 0.252s` [download] I tried also, an ARRAY for indexed-based lookups. But, that runs slower. Edit: Tried unpack, tip by tybalt89. ARRAY lookup is now faster. `# HASH my %rtoa = ( M=>1000, D=>500, C=>100, L=>50, X=>10, V=>5, I=>1 ); # ARRAY, characters M D C L X V I my @rtoa; @rtoa[qw( 77 68 67 76 88 86 73 )] = qw( 1000 500 100 50 10 5 + 1 ); Perl MCE 64 thds : 0.252s @rtoa{ split //, uc($_) }; Perl MCE 64 thds : 0.297s @rtoa[ map ord, split //, uc($_) ]; Perl MCE 64 thds : 0.192s @rtoa[ unpack 'c*', uc($_) ];` [download]	[reply] [d/l] [select]
Re^4: Risque Romantic Rosetta Roman Race - MCE Array Reduce by marioroy (Prior) on May 12, 2023 at 10:40 UTC
The Perl MCE code can be made faster by applying CPU affinity and enabling slurp IO. Updated on May 12, 2023 with tip from tybalt89 using unpack. Updated on May 13, 2023: fixed typo. Read more... (3 kB) The UNIX real time includes ~ 0.06 seconds for launching Perl, loading modules, spawning and reaping workers. `# captured UNIX time C++ 1.0 : 0.450s C++ fast_io : 0.291s Perl MCE 64 thds : 0.252s # after tweaks: CPU affinity and slurp IO Perl MCE 64 thds : 0.218s # more tweaks: using an ARRAY and unpack Perl MCE 64 thds : 0.192s $ time perl rtoa-pgatram-mce.pl t1.txt >mce.tmp rtoa pgatram start time 0.165 secs real 0m0.192s user 0m7.596s sys 0m0.420s` [download]	[reply] [d/l] [select]
Re^4: Risque Romantic Rosetta Roman Race by tybalt89 (Monsignor) on May 12, 2023 at 16:11 UTC
Would `unpack 'c*', uc($_)` [download] run faster than `map ord, split //, uc($_)` [download] ?	[reply] [d/l] [select]
Re^5: Risque Romantic Rosetta Roman Race - Testing Unpack by marioroy (Prior) on May 12, 2023 at 18:24 UTC
Re^4: Risque Romantic Rosetta Roman Race by eyepopslikeamosquito (Archbishop) on May 13, 2023 at 09:03 UTC
> It now takes the entire CPU (64 logical threads) for Perl MCE 1.0 to run faster. :) Now that's a challenge! Can I can push the dial further? ... or will the ingenious tybalt89's unorthodox assistance from the side allow you to move the needle back towards 32? :) Since I know how much you enjoyed my (anonymonk-provoked) `MAX_STR_LEN_L` hack in the long Long List is Long series, I've tried a similar stunt here in a desperate attempt to improve data locality and cache performance. I also added a (cheating) `vector` reserve and the total time at the end (thanks for pointing out this oversight). Anyways, here are the timings of my latest version, `rtoa-pgatram-fixed.cpp`, using the `fast_io` library: $ ./rtoa-pgatram-fixed t1.txt >f.tmp read_input_files : 3999000 items read file time : 0.164 secs roman_to_dec time : 0.088 secs output time : 0.100 secs total time : 0.353 secs $ diff f.tmp roman.tmp $ ./rtoa-pgatram-fixed t1.txt >f.tmp read_input_files : 3999000 items read file time : 0.167 secs roman_to_dec time : 0.086 secs output time : 0.098 secs total time : 0.353 secs $ diff f.tmp roman.tmp $ ./rtoa-pgatram-fixed t1.txt >f.tmp read_input_files : 3999000 items read file time : 0.173 secs roman_to_dec time : 0.086 secs output time : 0.093 secs total time : 0.353 secs $ diff f.tmp roman.tmp [download] Update: with marioroy `rtoa-pgatram-fixed2` below (without `fast_io`): `$ ./rtoa-pgatram-fixed2 t1.txt >f.tmp read_input_files : 3999000 items read file time : 0.189 secs roman_to_dec time : 0.371 secs total time : 0.560 secs $ diff f.tmp roman.tmp` [download] ... with `fast_io`: `$ ./rtoa-pgatram-fixed2 t1.txt >f.tmp read_input_files : 3999000 items read file time : 0.178 secs roman_to_dec time : 0.143 secs total time : 0.322 secs $ diff f.tmp roman.tmp` [download] rtoa-pgatram-fixed.cpp // rtoa-pgatram-fixed.cpp. Crude fixed length string version. // Compile with: // g++ -o rtoa-pgatram-fixed -std=c++20 -Wall -O3 rtoa-pgatram-fixed +.cpp // or: // clang++ -o rtoa-pgatram-fixed -std=c++20 -Wall -O3 rtoa-pgatram-f +ixed.cpp // or: // g++ -o rtoa-pgatram-fixed -std=c++20 -Wall -O3 -I "$HOME/local-fa +st_io/fast_io/include" rtoa-pgatram-fixed.cpp // to use the locally installed fast_io header-only library #include <cctype> #include <cstring> #include <iostream> #include <string> #include <vector> #include <numeric> #include <chrono> #include <iomanip> // See [id://11149504] for more info on the fast_io C++ library // #include <fast_io.h> // ------------------------------------------------------------------- +--------- typedef std::chrono::high_resolution_clock high_resolution_clock; typedef std::chrono::high_resolution_clock::time_point time_point; typedef std::chrono::milliseconds milliseconds; double elaspe_time( time_point cend, time_point cstart) { return double ( std::chrono::duration_cast<milliseconds>(cend - cstart).count() ) * 1e-3; } // ------------------------------------------------------------------- +--------- // Longest roman numeral is MMMDCCCLXXXVIII (3888) of length 15 // XXX: I'm off by one somewhere because 3888 fails with // MAX_STR_LEN_L of 16 but works with 17 #define MAX_STR_LEN_L 17 // The basic idea is to keep this struct small and without pointers to // improve data locality/cache performance when traversing the vector struct str_type { char slen; char str[MAX_STR_LEN_L]; }; using vec_str_type = std::vector<str_type>; using vec_int_type = std::vector<int>; // Read an input file of Roman Numerals and append them to a list // Return the number of Roman Numerals appended static int read_input_file( const char* fname, // in: file name containing a list of +Roman Numerals vec_str_type& vec_ret) // out: a vector of Roman Numeral strin +gs { FILE* fh; str_type line; int cnt = 0; fh = ::fopen(fname, "r"); if ( fh == NULL ) { std::cerr << "Error opening '" << fname << "' : " << strerror(er +rno) << "\n"; return 0; } while ( ::fgets( line.str, MAX_STR_LEN_L, fh ) != NULL ) { line.slen = ::strlen(line.str) - 1; // -1 to strip trailing n +ewline vec_ret.emplace_back(line); ++cnt; } ::fclose(fh); return cnt; } // --------------------------------------------------------------- // Though there are less than 256 initializers in this ascii table, // the others are guaranteed by ANSI C to be initialized to zero. static const int romtab[256] = { 0,0,0,0,0,0, 0, 0, 0, 0, // 00- 09 0,0,0,0,0,0, 0, 0, 0, 0, // 10- 19 0,0,0,0,0,0, 0, 0, 0, 0, // 20- 29 0,0,0,0,0,0, 0, 0, 0, 0, // 30- 39 0,0,0,0,0,0, 0, 0, 0, 0, // 40- 49 0,0,0,0,0,0, 0, 0, 0, 0, // 50- 59 0,0,0,0,0,0, 0, 100, 500, 0, // 60- 69 0,0,0,1,0,0, 50,1000, 0, 0, // 70- 79 0,0,0,0,0,0, 5, 0, 10, 0, // 80- 89 0,0,0,0,0,0, 0, 0, 0, 100, // 90- 99 500,0,0,0,0,1, 0, 0, 50,1000, // 100-109 0,0,0,0,0,0, 0, 0, 5, 0, // 110-119 10,0,0,0,0,0, 0, 0, 0, 0 // 120-129 }; // Return the arabic number for a roman letter c. // Return zero if the roman letter c is invalid. inline int urtoa(int c) { return romtab[c]; } inline int accfn(int t, char c) { return t + urtoa(c) - t % urtoa(c) * 2; } inline int roman_to_dec(const str_type& st) { return std::accumulate( st.str, st.str + st.slen, 0, accfn ); } int main(int argc, char* argv[]) { if (argc < 2) { std::cerr << "usage: rtoa-pgatram-fixed file...\n"; return 1; } // Get the list of input files from the command line int nfiles = argc - 1; char** fname = &argv[1]; std::cerr << std::setprecision(3) << std::setiosflags(std::ios::fix +ed); time_point cstart1, cend1, cstart2, cend2, cstart3, cend3; // Read the input files into roman_list vec_str_type roman_list; roman_list.reserve(3999000); cstart1 = high_resolution_clock::now(); int cnt = 0; for (int i = 0; i < nfiles; ++i) { cnt += read_input_file( fname[i], roman_list ); } cend1 = high_resolution_clock::now(); double ctaken1 = elaspe_time(cend1, cstart1); std::cerr << "read_input_files : " << cnt << " items\n"; std::cerr << "read file time : " << std::setw(8) << ctaken1 << " + secs\n"; // Convert roman to decimal vec_int_type arabic_list; arabic_list.reserve(3999000); cstart2 = high_resolution_clock::now(); for ( auto const& r : roman_list ) { arabic_list.emplace_back( roman_to_dec(r) ); } cend2 = high_resolution_clock::now(); double ctaken2 = elaspe_time(cend2, cstart2); std::cerr << "roman_to_dec time : " << std::setw(8) << ctaken2 << " + secs\n"; // Output to stdout cstart3 = high_resolution_clock::now(); for ( auto const& i : arabic_list ) { std::cout << i << '\n'; // fast_io::io::println(i); } cend3 = high_resolution_clock::now(); double ctaken3 = elaspe_time(cend3, cstart3); std::cerr << "output time : " << std::setw(8) << ctaken3 << " + secs\n"; double ctaken = elaspe_time(cend3, cstart1); std::cerr << "total time : " << std::setw(8) << ctaken << +" secs\n"; return 0; } [download]	[reply] [d/l] [select]
Re^5: Risque Romantic Rosetta Roman Race - MCE Results on AMD Box by marioroy (Prior) on May 13, 2023 at 10:32 UTC
Re^6: Risque Romantic Rosetta Roman Race - OpenMP by eyepopslikeamosquito (Archbishop) on May 14, 2023 at 04:17 UTC
Re^6: Risque Romantic Rosetta Roman Race - All in One by eyepopslikeamosquito (Archbishop) on May 15, 2023 at 03:33 UTC
Some notes below your chosen depth have not been shown here