in reply to Re^8: Risque Romantic Rosetta Roman Race - All in One
in thread Risque Romantic Rosetta Roman Race

> Sadly, the documentation for fast_io seems to be non-existent and I always struggle to figure out how to use it.

Same here. This took a team. I've been grepping for clues inside the fast_io benchmark and examples folders. To break 0.5 seconds, I updated the allinone2 demonstration with fast_io memory mapping.

# https://perlmonks.org/?node_id=11152186 $ ./rtoa-pgatram-allinone2 t1.txt t1.txt t1.txt t1.txt | cksum do_it_all time : 0.515 secs fast_io scan, line_get do_it_all time : 0.490 secs fast_io memory mapping 737201628 75552000

Memory mapping update:

// Read an input file of Roman Numerals and do it all static void do_it_all( std::string_view fname // in: file name containing a list of + Roman Numerals ) { try { #if 1 // Load entire file to memory through memory mapping. using file_loader_type = fast_io::native_file_loader; file_loader_type loader(fname, fast_io::open_mode::in | fast_io: +:open_mode::follow); // Loop through contiguous container of the file. for (char const *first{loader.data()}, *last{loader.data()+loade +r.size()}; first!=last; ) { auto start_ptr{first}; first = fast_io::find_lf(first, last); + auto end_ptr{first}; if (start_ptr == end_ptr) continue; int dec = roman_to_dec(std::string_view(start_ptr, end_ptr - +start_ptr)); fast_io::io::println(dec); ++first; } #else fast_io::filebuf_file fbf(fname, fast_io::open_mode::in | fast_i +o::open_mode::follow); for (std::string line; fast_io::io::scan<true>(fbf, fast_io::mnp +::line_get(line)); ) { fast_io::io::println(roman_to_dec(line)); } #endif } catch (fast_io::error e) { fast_io::io::perrln("Error opening '", fname, "' : ", e); }; }

It requires running Perl on Clear Linux to keep up :)

# https://perlmonks.org/?node_id=11152186 $ time ./rtoa-pgatram-allinone2 t1.txt t1.txt t1.txt t1.txt | cksum do_it_all time : 0.490 secs fast_io memory mapping 737201628 75552000 real 0m0.492s user 0m0.471s sys 0m0.037s # https://perlmonks.org/?node_id=11152168 max_workers => 32 $ time perl rtoa-pgatram-mce.pl t1.txt t1.txt t1.txt t1.txt | cksum rtoa pgatram start time 0.480 secs 737201628 75552000 real 0m0.504s user 0m13.887s sys 0m0.231s # https://perlmonks.org/?node_id=11152168 max_workers => 64 $ time perl rtoa-pgatram-mce.pl t1.txt t1.txt t1.txt t1.txt | cksum rtoa pgatram start time 0.425 secs 737201628 75552000 real 0m0.449s user 0m21.884s sys 0m0.592s

Perl MCE is mind-boggling to me. Because, the UNIX time includes all of it; the time to launch Perl, load modules, spawn workers, notify workers per each input file / run, and finally reap workers. The overhead is 0.075 and 0.093 seconds for 32 and 64 workers, respectively. The overhead also includes MCE::Relay where workers must wait their turn orderly by chunk_id value, behind the scene.

I commented out a few lines of code to measure the overhead time running MCE including MCE::Relay.

user_func => sub { my ( $mce, $slurp_ref, $chunk_id, $output ) = ( @_, '' ); # open my $fh, '<', $slurp_ref; # while ( <$fh> ) { # chomp; # my $n = 0; # $n += $_ - $n % $_ * 2 for @rtoa[ unpack 'c*', $_ ]; # $output .= "$n\n"; # } # close $fh; # output orderly MCE::relay { # print $output; }; }

How much time does Perl MCE chunking four input files including MCE::Relay take, factoring out compute and output? Notice how doubling up on workers only adds less than 2 hundreds of a second to the UNIX real time. But notice the user and sys times. Chunking an input file and involving relay occur simultaneously behind the scene. The relay block runs serially.

# max_workers => 32 $ time perl rtoa-pgatram-mce.pl t1.txt t1.txt t1.txt t1.txt rtoa pgatram start time 0.052 secs real 0m0.075s user 0m0.122s sys 0m0.170s # max_workers => 64 $ time perl rtoa-pgatram-mce.pl t1.txt t1.txt t1.txt t1.txt rtoa pgatram start time 0.070 secs real 0m0.093s user 0m0.195s sys 0m0.441s

Replies are listed 'Best First'.
Re^10: Risque Romantic Rosetta Roman Race - All in One - OpenMP
by marioroy (Prior) on May 16, 2023 at 13:39 UTC

    The following is my first draft using OpenMP, dividing the work equally (though, not chunking). I tried keeping the overhead low, achieving 5x faster compared to Perl MCE. Threads ids greater than 0 store locally to a buffer for later output, causing the application to become memory bound. Well, I saw improvements up to 24 workers on the AMD box which is what the memory controller can handle.

    C++ results:

    $ NUM_THREADS=1 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.498 secs 737201628 75552000 $ NUM_THREADS=4 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.176 secs 737201628 75552000 $ NUM_THREADS=8 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.124 secs 737201628 75552000 $ NUM_THREADS=16 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.096 secs 737201628 75552000 $ NUM_THREADS=24 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.090 secs 737201628 75552000 $ NUM_THREADS=32 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.122 secs 737201628 75552000 $ time NUM_THREADS=24 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t +1.txt | cksum do_it_all time : 0.090 secs 737201628 75552000 real 0m0.093s user 0m1.858s sys 0m0.103s

    For comparison, Perl MCE results on Clear Linux:

    # https://perlmonks.org/?node_id=11152168 max_workers => 32 $ time perl rtoa-pgatram-mce.pl t1.txt t1.txt t1.txt t1.txt | cksum rtoa pgatram start time 0.480 secs 737201628 75552000 real 0m0.504s user 0m13.887s sys 0m0.231s # https://perlmonks.org/?node_id=11152168 max_workers => 64 $ time perl rtoa-pgatram-mce.pl t1.txt t1.txt t1.txt t1.txt | cksum rtoa pgatram start time 0.425 secs Perl 737201628 75552000 real 0m0.449s user 0m21.884s sys 0m0.592s

    OpenMP-aware rtoa-pgatram-allinone2b.cpp:

    This is my 3rd revision. The 1st revision using std::vector achieved max 2x. The 2nd revision using std::deque was slower. So I tried again, grepping for "concat" in the fast_io library for use with std::string, reaching 5x.

    Updated on May 19, 2023

      Thanks to eyepopslikeamosquito, for planting the Roman Race saga. :)

      During this journey, I came across a website at CPU fun "Processing a File with OpenMP". I reached out to the website authors and fast_io author (including eyepopslikeamosquito), sharing results of my follow-on project Grep Count C++ OpenMP demonstrations, counting matching lines. A discovery was made while testing the grep-count-pmap variant. It turns out that a faster version is possible using the Portable Memory Mapping C++ class. This one scales better than the fast_io memory mapping class.

      Before (fast_io memory mapping):

      $ NUM_THREADS=1 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.498 secs 737201628 75552000 $ NUM_THREADS=4 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.176 secs 737201628 75552000 $ NUM_THREADS=8 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.124 secs 737201628 75552000 $ NUM_THREADS=16 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.096 secs 737201628 75552000

      After (portable memory mapping):

      $ NUM_THREADS=1 ./rtoa-pgatram-allinone2c t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.488 secs 737201628 75552000 $ NUM_THREADS=4 ./rtoa-pgatram-allinone2c t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.143 secs 737201628 75552000 $ NUM_THREADS=8 ./rtoa-pgatram-allinone2c t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.091 secs 737201628 75552000 $ NUM_THREADS=16 ./rtoa-pgatram-allinone2c t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.065 secs 737201628 75552000

      rtoa-pgatram-allinone2c.cpp

        I wanted to come back and provide a MCE-like chunking variant, for computing Roman Numerals to Decimal. It runs faster than the memory mapping solutions consuming 8 or more threads.

        fast_io memory mapping:

        $ NUM_THREADS=1 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.498 secs 737201628 75552000 $ NUM_THREADS=4 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.176 secs 737201628 75552000 $ NUM_THREADS=8 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.124 secs 737201628 75552000 $ NUM_THREADS=16 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.096 secs 737201628 75552000

        portable memory mapping:

        $ NUM_THREADS=1 ./rtoa-pgatram-allinone2c t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.488 secs 737201628 75552000 $ NUM_THREADS=4 ./rtoa-pgatram-allinone2c t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.143 secs 737201628 75552000 $ NUM_THREADS=8 ./rtoa-pgatram-allinone2c t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.091 secs 737201628 75552000 $ NUM_THREADS=16 ./rtoa-pgatram-allinone2c t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.065 secs 737201628 75552000

        MCE-like chunking:

        $ NUM_THREADS=1 ./rtoa-pgatram-allinone2d t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.489 secs 737201628 75552000 $ NUM_THREADS=4 ./rtoa-pgatram-allinone2d t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.144 secs 737201628 75552000 $ NUM_THREADS=8 ./rtoa-pgatram-allinone2d t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.075 secs 737201628 75552000 $ NUM_THREADS=16 ./rtoa-pgatram-allinone2d t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.048 secs 737201628 75552000

        rtoa-pgatram-allinone2d.cpp

        The Portable Memory Mapping C++ class does not hamper scaling on a many-core box. In order to demonstrate this, I prefer using the follow-on project, counting lines matching a pattern. I begin by creating a 5 GB file in /tmp to better understand the performance characteristic between the two memory mapping implementations; fast_io and the portable C++ class. The input file contains greater than 100 million lines.

        $ for i in $(seq 1 210); do cat large.txt; done >/tmp/big.txt $ ls -lh /tmp/big.txt -rw-r--r-- 1 mario mario 5.0G May 19 11:19 /tmp/big.txt

        I compare the MCE egrep.pl example (for fun), grep-count-pcre2 (using fast_io mapping), and grep-count-pmap (using portable mapping). On my system, it requires 10 workers for Perl to run faster than the system grep command.

        $ time grep -c "[aA].*[eE].*[iI].*[oO].*[uU]" /tmp/big.txt 2195760 real 0m8.721s user 0m8.317s sys 0m0.404s

        The Perl MCE solution attempts to catch up due to scaling wonderfully.

        $ time ./egrep.pl --max-workers=2 -c "[aA].*[eE].*[iI].*[oO].*[uU]" /t +mp/big.txt 2195760 N-thds 2 5 10 20 30 real 0m39.068s 0m16.088s 0m8.146s 0m4.229s 0m3.084s user 1m17.614s 1m19.624s 1m20.441s 1m21.823s 1m25.296s sys 0m0.419s 0m0.464s 0m0.434s 0m0.526s 0m0.645s

        Next is grep-count-pcre2 using the fast_io memory mapping C++ class. For some reason, this is unable to scale linearly.

        $ time NUM_THREADS=2 ./grep-count-pcre2 "[aA].*[eE].*[iI].*[oO].*[uU]" + /tmp/big.txt parallel (2) Total Lines: 105000000, Matching Lines: 2195760 N-thds 2 5 10 20 30 real 0m9.256s 0m4.382s 0m2.748s 0m1.929s 0m1.689s user 0m15.858s 0m16.563s 0m17.267s 0m18.391s 0m19.497s sys 0m1.678s 0m1.339s 0m1.186s 0m1.225s 0m1.260s

        Finally grep-count-pmap (also PCRE2), but using the portable memory mapping C++ class. This scales better, achieving better performance.

        $ time NUM_THREADS=2 ./grep-count-pmap "[aA].*[eE].*[iI].*[oO].*[uU]" +/tmp/big.txt parallel (2) Total Lines: 105000000, Matching Lines: 2195760 N-thds 2 5 10 20 30 real 0m8.113s 0m3.249s 0m1.675s 0m0.958s 0m0.707s user 0m15.333s 0m15.646s 0m15.809s 0m16.775s 0m18.110s sys 0m0.840s 0m0.266s 0m0.343s 0m0.489s 0m0.506s

        Consuming 30 CPU cores, grep-count-pcre2 is about 2 times faster than Perl. The portable map solution grep-count-pmap is beyond 2 times faster than grep-count-pcre2 or 4 times faster than Perl.

        The following is a Perl MCE-like chunking implementation including similar MCE::Relay logic for orderly output. The C++ demonstration supports standard input and file arguments. It consumes very little memory compared to the memory mapping solutions. The results are similar to grep-count-pmap. Notice the user time being faster, due to using SIMD for counting the linefeed characters.

        $ time NUM_THREADS=2 ./grep-count-chunk "[aA].*[eE].*[iI].*[oO].*[uU]" + /tmp/big.txt parallel (2) Total Lines: 105000000, Matching Lines: 2195760 N-thds 2 5 10 20 30 real 0m8.017s 0m3.257s 0m1.537s 0m0.801s 0m0.618s user 0m14.524s 0m14.997s 0m14.906s 0m15.397s 0m17.519s sys 0m1.010s 0m1.019s 0m0.412s 0m0.515s 0m0.649s

        grep-count-chunk.cc