Re^9: Risque Romantic Rosetta Roman Race

> Sadly, the documentation for fast_io seems to be non-existent and I always struggle to figure out how to use it.

Same here. This took a team. I've been grepping for clues inside the fast_io benchmark and examples folders. To break 0.5 seconds, I updated the allinone2 demonstration with fast_io memory mapping.

# https://perlmonks.org/?node_id=11152186
$ ./rtoa-pgatram-allinone2 t1.txt t1.txt t1.txt t1.txt | cksum
do_it_all time    :    0.515 secs  fast_io scan, line_get
do_it_all time    :    0.490 secs  fast_io memory mapping
737201628 75552000
[download]

Memory mapping update:

// Read an input file of Roman Numerals and do it all
static void do_it_all(
   std::string_view fname       //  in: file name containing a list of
+ Roman Numerals
)
{
   try {
   #if 1
      // Load entire file to memory through memory mapping.
      using file_loader_type = fast_io::native_file_loader;
      file_loader_type loader(fname, fast_io::open_mode::in | fast_io:
+:open_mode::follow);
      // Loop through contiguous container of the file.
      for (char const *first{loader.data()}, *last{loader.data()+loade
+r.size()}; first!=last; ) {
         auto start_ptr{first}; first = fast_io::find_lf(first, last);
+ auto end_ptr{first};
         if (start_ptr == end_ptr) continue;
         int dec = roman_to_dec(std::string_view(start_ptr, end_ptr - 
+start_ptr));
         fast_io::io::println(dec);
         ++first;
      }
   #else
      fast_io::filebuf_file fbf(fname, fast_io::open_mode::in | fast_i
+o::open_mode::follow);
      for (std::string line; fast_io::io::scan<true>(fbf, fast_io::mnp
+::line_get(line)); ) {
         fast_io::io::println(roman_to_dec(line));
      }
   #endif
   }
   catch (fast_io::error e) {
      fast_io::io::perrln("Error opening '", fname, "' : ", e);
   };
}
[download]

It requires running Perl on Clear Linux to keep up :)

# https://perlmonks.org/?node_id=11152186
$ time ./rtoa-pgatram-allinone2 t1.txt t1.txt t1.txt t1.txt | cksum
do_it_all time    :    0.490 secs  fast_io memory mapping
737201628 75552000

real    0m0.492s
user    0m0.471s
sys     0m0.037s

# https://perlmonks.org/?node_id=11152168 max_workers => 32
$ time perl rtoa-pgatram-mce.pl t1.txt t1.txt t1.txt t1.txt | cksum
rtoa pgatram start
time 0.480 secs
737201628 75552000

real    0m0.504s
user    0m13.887s
sys     0m0.231s

# https://perlmonks.org/?node_id=11152168 max_workers => 64
$ time perl rtoa-pgatram-mce.pl t1.txt t1.txt t1.txt t1.txt | cksum
rtoa pgatram start
time 0.425 secs
737201628 75552000

real    0m0.449s
user    0m21.884s
sys     0m0.592s
[download]

Perl MCE is mind-boggling to me. Because, the UNIX time includes all of it; the time to launch Perl, load modules, spawn workers, notify workers per each input file / run, and finally reap workers. The overhead is 0.075 and 0.093 seconds for 32 and 64 workers, respectively. The overhead also includes MCE::Relay where workers must wait their turn orderly by chunk_id value, behind the scene.

I commented out a few lines of code to measure the overhead time running MCE including MCE::Relay.

      user_func   => sub {
         my ( $mce, $slurp_ref, $chunk_id, $output ) = ( @_, '' );
       # open my $fh, '<', $slurp_ref;
       # while ( <$fh> ) {
       #    chomp;
       #    my $n = 0;
       #    $n += $_ - $n % $_ * 2 for @rtoa[ unpack 'c*', $_ ];
       #    $output .= "$n\n";
       # }
       # close $fh;
         # output orderly
         MCE::relay {
       #    print $output;
         };
      }
[download]

How much time does Perl MCE chunking four input files including MCE::Relay take, factoring out compute and output? Notice how doubling up on workers only adds less than 2 hundreds of a second to the UNIX real time. But notice the user and sys times. Chunking an input file and involving relay occur simultaneously behind the scene. The relay block runs serially.

# max_workers => 32
$ time perl rtoa-pgatram-mce.pl t1.txt t1.txt t1.txt t1.txt
rtoa pgatram start
time 0.052 secs

real    0m0.075s
user    0m0.122s
sys     0m0.170s

# max_workers => 64
$ time perl rtoa-pgatram-mce.pl t1.txt t1.txt t1.txt t1.txt
rtoa pgatram start
time 0.070 secs

real    0m0.093s
user    0m0.195s
sys     0m0.441s
[download]

Comment on Re^9: Risque Romantic Rosetta Roman Race - All in One Select or Download Code

Replies are listed 'Best First'.
Re^10: Risque Romantic Rosetta Roman Race - All in One - OpenMP by marioroy (Prior) on May 16, 2023 at 13:39 UTC
The following is my first draft using OpenMP, dividing the work equally (though, not chunking). I tried keeping the overhead low, achieving 5x faster compared to Perl MCE. Threads ids greater than 0 store locally to a buffer for later output, causing the application to become memory bound. Well, I saw improvements up to 24 workers on the AMD box which is what the memory controller can handle. C++ results: $ NUM_THREADS=1 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.498 secs 737201628 75552000 $ NUM_THREADS=4 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.176 secs 737201628 75552000 $ NUM_THREADS=8 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.124 secs 737201628 75552000 $ NUM_THREADS=16 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.096 secs 737201628 75552000 $ NUM_THREADS=24 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.090 secs 737201628 75552000 $ NUM_THREADS=32 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.122 secs 737201628 75552000 $ time NUM_THREADS=24 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t +1.txt \| cksum do_it_all time : 0.090 secs 737201628 75552000 real 0m0.093s user 0m1.858s sys 0m0.103s [download] For comparison, Perl MCE results on Clear Linux: `# https://perlmonks.org/?node_id=11152168 max_workers => 32 $ time perl rtoa-pgatram-mce.pl t1.txt t1.txt t1.txt t1.txt \| cksum rtoa pgatram start time 0.480 secs 737201628 75552000 real 0m0.504s user 0m13.887s sys 0m0.231s # https://perlmonks.org/?node_id=11152168 max_workers => 64 $ time perl rtoa-pgatram-mce.pl t1.txt t1.txt t1.txt t1.txt \| cksum rtoa pgatram start time 0.425 secs Perl 737201628 75552000 real 0m0.449s user 0m21.884s sys 0m0.592s` [download] OpenMP-aware rtoa-pgatram-allinone2b.cpp: This is my 3rd revision. The 1st revision using std::vector achieved max 2x. The 2nd revision using std::deque was slower. So I tried again, grepping for "concat" in the fast_io library for use with std::string, reaching 5x. Updated on May 19, 2023 Read more... (7 kB)	[reply] [d/l] [select]
Re^11: Risque Romantic Rosetta Roman Race - All in One - OpenMP by marioroy (Prior) on May 19, 2023 at 13:31 UTC
Thanks to eyepopslikeamosquito, for planting the Roman Race saga. :) During this journey, I came across a website at CPU fun "Processing a File with OpenMP". I reached out to the website authors and fast_io author (including eyepopslikeamosquito), sharing results of my follow-on project Grep Count C++ OpenMP demonstrations, counting matching lines. A discovery was made while testing the grep-count-pmap variant. It turns out that a faster version is possible using the Portable Memory Mapping C++ class. This one scales better than the fast_io memory mapping class. Before (fast_io memory mapping): $ NUM_THREADS=1 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.498 secs 737201628 75552000 $ NUM_THREADS=4 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.176 secs 737201628 75552000 $ NUM_THREADS=8 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.124 secs 737201628 75552000 $ NUM_THREADS=16 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.096 secs 737201628 75552000 [download] After (portable memory mapping): $ NUM_THREADS=1 ./rtoa-pgatram-allinone2c t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.488 secs 737201628 75552000 $ NUM_THREADS=4 ./rtoa-pgatram-allinone2c t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.143 secs 737201628 75552000 $ NUM_THREADS=8 ./rtoa-pgatram-allinone2c t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.091 secs 737201628 75552000 $ NUM_THREADS=16 ./rtoa-pgatram-allinone2c t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.065 secs 737201628 75552000 [download] rtoa-pgatram-allinone2c.cpp Read more... (7 kB)	[reply] [d/l] [select]
Re^12: Risque Romantic Rosetta Roman Race - All in One - OpenMP by marioroy (Prior) on May 29, 2023 at 14:23 UTC
I wanted to come back and provide a MCE-like chunking variant, for computing Roman Numerals to Decimal. It runs faster than the memory mapping solutions consuming 8 or more threads. fast_io memory mapping: $ NUM_THREADS=1 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.498 secs 737201628 75552000 $ NUM_THREADS=4 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.176 secs 737201628 75552000 $ NUM_THREADS=8 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.124 secs 737201628 75552000 $ NUM_THREADS=16 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.096 secs 737201628 75552000 [download] portable memory mapping: $ NUM_THREADS=1 ./rtoa-pgatram-allinone2c t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.488 secs 737201628 75552000 $ NUM_THREADS=4 ./rtoa-pgatram-allinone2c t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.143 secs 737201628 75552000 $ NUM_THREADS=8 ./rtoa-pgatram-allinone2c t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.091 secs 737201628 75552000 $ NUM_THREADS=16 ./rtoa-pgatram-allinone2c t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.065 secs 737201628 75552000 [download] MCE-like chunking: $ NUM_THREADS=1 ./rtoa-pgatram-allinone2d t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.489 secs 737201628 75552000 $ NUM_THREADS=4 ./rtoa-pgatram-allinone2d t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.144 secs 737201628 75552000 $ NUM_THREADS=8 ./rtoa-pgatram-allinone2d t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.075 secs 737201628 75552000 $ NUM_THREADS=16 ./rtoa-pgatram-allinone2d t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.048 secs 737201628 75552000 [download] rtoa-pgatram-allinone2d.cpp Read more... (8 kB)	[reply] [d/l] [select]
Re^12: Risque Romantic Rosetta Roman Race - All in One - OpenMP by marioroy (Prior) on May 19, 2023 at 17:04 UTC
The Portable Memory Mapping C++ class does not hamper scaling on a many-core box. In order to demonstrate this, I prefer using the follow-on project, counting lines matching a pattern. I begin by creating a 5 GB file in /tmp to better understand the performance characteristic between the two memory mapping implementations; fast_io and the portable C++ class. The input file contains greater than 100 million lines. `$ for i in $(seq 1 210); do cat large.txt; done >/tmp/big.txt $ ls -lh /tmp/big.txt -rw-r--r-- 1 mario mario 5.0G May 19 11:19 /tmp/big.txt` [download] I compare the MCE egrep.pl example (for fun), grep-count-pcre2 (using fast_io mapping), and grep-count-pmap (using portable mapping). On my system, it requires 10 workers for Perl to run faster than the system grep command. `$ time grep -c "[aA].[eE].[iI].[oO].[uU]" /tmp/big.txt 2195760 real 0m8.721s user 0m8.317s sys 0m0.404s` [download] The Perl MCE solution attempts to catch up due to scaling wonderfully. `$ time ./egrep.pl --max-workers=2 -c "[aA].[eE].[iI].[oO].[uU]" /t +mp/big.txt 2195760 N-thds 2 5 10 20 30 real 0m39.068s 0m16.088s 0m8.146s 0m4.229s 0m3.084s user 1m17.614s 1m19.624s 1m20.441s 1m21.823s 1m25.296s sys 0m0.419s 0m0.464s 0m0.434s 0m0.526s 0m0.645s` [download] Next is grep-count-pcre2 using the fast_io memory mapping C++ class. For some reason, this is unable to scale linearly. `$ time NUM_THREADS=2 ./grep-count-pcre2 "[aA].[eE].[iI].[oO].[uU]" + /tmp/big.txt parallel (2) Total Lines: 105000000, Matching Lines: 2195760 N-thds 2 5 10 20 30 real 0m9.256s 0m4.382s 0m2.748s 0m1.929s 0m1.689s user 0m15.858s 0m16.563s 0m17.267s 0m18.391s 0m19.497s sys 0m1.678s 0m1.339s 0m1.186s 0m1.225s 0m1.260s` [download] Finally grep-count-pmap (also PCRE2), but using the portable memory mapping C++ class. This scales better, achieving better performance. `$ time NUM_THREADS=2 ./grep-count-pmap "[aA].[eE].[iI].[oO].[uU]" +/tmp/big.txt parallel (2) Total Lines: 105000000, Matching Lines: 2195760 N-thds 2 5 10 20 30 real 0m8.113s 0m3.249s 0m1.675s 0m0.958s 0m0.707s user 0m15.333s 0m15.646s 0m15.809s 0m16.775s 0m18.110s sys 0m0.840s 0m0.266s 0m0.343s 0m0.489s 0m0.506s` [download] Consuming 30 CPU cores, grep-count-pcre2 is about 2 times faster than Perl. The portable map solution grep-count-pmap is beyond 2 times faster than grep-count-pcre2 or 4 times faster than Perl. The following is a Perl MCE-like chunking implementation including similar MCE::Relay logic for orderly output. The C++ demonstration supports standard input and file arguments. It consumes very little memory compared to the memory mapping solutions. The results are similar to grep-count-pmap. Notice the user time being faster, due to using SIMD for counting the linefeed characters. `$ time NUM_THREADS=2 ./grep-count-chunk "[aA].[eE].[iI].[oO].[uU]" + /tmp/big.txt parallel (2) Total Lines: 105000000, Matching Lines: 2195760 N-thds 2 5 10 20 30 real 0m8.017s 0m3.257s 0m1.537s 0m0.801s 0m0.618s user 0m14.524s 0m14.997s 0m14.906s 0m15.397s 0m17.519s sys 0m1.010s 0m1.019s 0m0.412s 0m0.515s 0m0.649s` [download] grep-count-chunk.cc Read more... (11 kB)	[reply] [d/l] [select]