in reply to Re^6: Risque Romantic Rosetta Roman Race - All in One
in thread Risque Romantic Rosetta Roman Race

> be deemed acceptable by marioroy

Yikes! I'm a rookie when it comes to C++ and simply here for the fun and learning.

> Am I missing something?

There's no reason to confirm to fixed length, IMO. I gutted the fixed-length code. It runs faster, completing in 0.490 seconds.

C++ Results:

# https://perlmonks.org/?node_id=11152156 $ ./rtoa-pgatram-fixed t1.txt t1.txt t1.txt t1.txt | cksum read_input_files : 15996000 items read file time : 0.356 secs roman_to_dec time : 0.460 secs output time : 0.124 secs total time : 0.941 secs 737201628 75552000 # https://perlmonks.org/?node_id=11152177 $ NUM_THREADS=4 ./rtoa-pgatram-openmp t1.txt t1.txt t1.txt t1.txt | ck +sum use OpenMP read_input_files : 15996000 items read file time : 0.159 secs roman_to_dec time : 0.469 secs total time : 0.628 secs 737201628 75552000 # https://perlmonks.org/?node_id=11152182 $ ./rtoa-pgatram-allinone t1.txt t1.txt t1.txt t1.txt | cksum do_it_all time : 0.637 secs 737201628 75552000 # https://perlmonks.org/?node_id=11152186 $ ./rtoa-pgatram-allinone2 t1.txt t1.txt t1.txt t1.txt | cksum do_it_all time : 0.515 secs fast_io scan, line_get do_it_all time : 0.490 secs fast_io memory mapping 737201628 75552000

Perl Results:

# https://perlmonks.org/?node_id=11152168 max_workers => 26 $ perl rtoa-pgatram-mce.pl t1.txt t1.txt t1.txt t1.txt | cksum rtoa pgatram start time 0.658 secs Perl on Fedora Linux 38 time 0.574 secs Perl on Clear Linux 737201628 75552000 # https://perlmonks.org/?node_id=11152168 max_workers => 32 $ perl rtoa-pgatram-mce.pl t1.txt t1.txt t1.txt t1.txt | cksum rtoa pgatram start time 0.548 secs Perl on Fedora Linux 38 time 0.480 secs Perl on Clear Linux 737201628 75552000

rtoa-pgatram-allinone2.cpp

Updated on May 19, 2023

// rtoa-pgatram-allinone2.cpp. Crude allinone version. // based on rtoa-pgatram-allinone.cpp https://perlmonks.org/?node_id +=11152182 // // Obtain the fast_io library (required dependency): // git clone --depth=1 https://github.com/cppfastio/fast_io // // Compile with g++ or clang++: // clang++ -o rtoa-pgatram-allinone2 -std=c++20 -Wall -O3 -I fast_io +/include rtoa-pgatram-allinone2.cpp #include <cctype> #include <cstring> #include <string> #include <numeric> #include <chrono> #include <thread> #include <iostream> #include <iomanip> // See [id://11149504] for more info on the fast_io C++ library #include <fast_io.h> #include <fast_io_legacy.h> // --------------------------------------------------------------- typedef std::chrono::high_resolution_clock high_resolution_clock; typedef std::chrono::high_resolution_clock::time_point time_point; typedef std::chrono::milliseconds milliseconds; double elaspe_time( time_point cend, time_point cstart) { return double ( std::chrono::duration_cast<milliseconds>(cend - cstart).count() ) * 1e-3; } // --------------------------------------------------------------- // Though there are less than 256 initializers in this ascii table, // the others are guaranteed by ANSI C to be initialized to zero. static const int romtab[256] = { 0,0,0,0,0,0, 0, 0, 0, 0, // 00- 09 0,0,0,0,0,0, 0, 0, 0, 0, // 10- 19 0,0,0,0,0,0, 0, 0, 0, 0, // 20- 29 0,0,0,0,0,0, 0, 0, 0, 0, // 30- 39 0,0,0,0,0,0, 0, 0, 0, 0, // 40- 49 0,0,0,0,0,0, 0, 0, 0, 0, // 50- 59 0,0,0,0,0,0, 0, 100, 500, 0, // 60- 69 0,0,0,1,0,0, 50,1000, 0, 0, // 70- 79 0,0,0,0,0,0, 5, 0, 10, 0, // 80- 89 0,0,0,0,0,0, 0, 0, 0, 100, // 90- 99 500,0,0,0,0,1, 0, 0, 50,1000, // 100-109 0,0,0,0,0,0, 0, 0, 5, 0, // 110-119 10,0,0,0,0,0, 0, 0, 0, 0 // 120-129 }; // Return the arabic number for a roman letter c. // Return zero if the roman letter c is invalid. inline int urtoa(int c) { return romtab[c]; } inline int accfn(int t, char c) { return t + urtoa(c) - t % urtoa(c) * 2; } inline int roman_to_dec(std::string_view st) { return std::accumulate(st.begin(), st.end(), 0, accfn); } // Read an input file of Roman Numerals and do it all static void do_it_all( std::string_view fname // in: file name containing a list of + Roman Numerals ) { try { #if 1 // Load entire file to memory through memory mapping. using file_loader_type = fast_io::native_file_loader; file_loader_type loader(fname, fast_io::open_mode::in | fast_io: +:open_mode::follow); // Loop through contiguous container of the file. for (char const *first{loader.data()}, *last{loader.data()+loade +r.size()}; first<last; ) { auto start_ptr{first}; first = fast_io::find_lf(first, last); auto end_ptr{first}; int dec = roman_to_dec(std::string_view(start_ptr, end_ptr - +start_ptr)); fast_io::io::println(dec); ++first; } #else fast_io::filebuf_file fbf(fname, fast_io::open_mode::in | fast_i +o::open_mode::follow); for (std::string line; fast_io::io::scan<true>(fbf, fast_io::mnp +::line_get(line)); ) { fast_io::io::println(roman_to_dec(line)); } #endif } catch (fast_io::error e) { fast_io::io::perrln("Error opening '", fname, "' : ", e); }; } int main(int argc, char* argv[]) { if (argc < 2) { if (argc > 0) std::cerr << "usage: rtoa-pgatram-allinone2 file... >out.txt\ +n"; return 1; } // Get the list of input files from the command line int nfiles = argc - 1; char** fname = &argv[1]; std::cerr << std::setprecision(3) << std::setiosflags(std::ios::fix +ed); time_point cstartall, cendall; cstartall = high_resolution_clock::now(); for (int i = 0; i < nfiles; ++i) do_it_all( fname[i] ); cendall = high_resolution_clock::now(); double ctakenall = elaspe_time(cendall, cstartall); std::cerr << "do_it_all time : " << std::setw(8) << ctakenall << + " secs\n"; return 0; }

Replies are listed 'Best First'.
Re^8: Risque Romantic Rosetta Roman Race - All in One
by eyepopslikeamosquito (Archbishop) on May 16, 2023 at 00:36 UTC

    Thanks so much for that! And for confirming it is not "cheating" in this performance race to remove all vectors from a solution and write to stdout directly.

    Sadly, the documentation for fast_io seems to be non-existent and I always struggle to figure out how to use it. Hope someone writes a nice fast_io user manual.

    I also get confused as to whether I've built with fast_io or not, so I've created a new version, rtoa-pgatram-allinone3.cpp below, based on your excellent rtoa-pgatram-allinone2.cpp, that prints whether it's using fastio or not (via a USE_FAST_IO_L define). When built to not use fastio this new version uses std C++11 I/O (which in my experience is significantly slower that std C I/O).

    Here are the timings on my laptop.

    $ time ./rtoa-pgatram-allinone3 t1.txt t1.txt t1.txt t1.txt >f4.tmp not using fast_io do_it_all time : 1.936 secs real 0m1.953s user 0m1.832s sys 0m0.121s $ cmp f4.tmp fixed4.tmp

    $ time ./rtoa-pgatram-allinone3 t1.txt t1.txt t1.txt t1.txt >f4.tmp using fast_io do_it_all time : 0.875 secs real 0m0.894s user 0m0.803s sys 0m0.091s $ cmp f4.tmp fixed4.tmp

    Original rtoa-pgatram-allinone from this node using C std I/O for reading files and fast_io for writing to stdout:

    $ time ./rtoa-pgatram-allinone t1.txt t1.txt t1.txt t1.txt >f4.tmp do_it_all time : 1.034 secs real 0m1.049s user 0m0.988s sys 0m0.061s

    // rtoa-pgatram-allinone3.cpp. Crude allinone version. // based on rtoa-pgatram-allinone2.cpp https://perlmonks.org/?node_i +d=11152186 // // Obtain the fast_io library (required dependency): // git clone --depth=1 https://github.com/cppfastio/fast_io // // Compile with g++ or clang++: // g++ -o rtoa-pgatram-allinone3 -std=c++20 -Wall -O3 -I fast_io/inc +lude rtoa-pgatram-allinone3.cpp // Uncomment to use the fast_io C++ library #define USE_FAST_IO_L 1 #include <cctype> #include <cstring> #include <string> #include <numeric> #include <chrono> #include <thread> #include <iostream> #include <iomanip> #ifdef USE_FAST_IO_L // See [id://11149504] for more info on the fast_io C++ library #include <fast_io.h> #include <fast_io_legacy.h> #else #include <fstream> #include <sstream> #endif // --------------------------------------------------------------- typedef std::chrono::high_resolution_clock high_resolution_clock; typedef std::chrono::high_resolution_clock::time_point time_point; typedef std::chrono::milliseconds milliseconds; double elaspe_time( time_point cend, time_point cstart) { return double ( std::chrono::duration_cast<milliseconds>(cend - cstart).count() ) * 1e-3; } // --------------------------------------------------------------- // Though there are less than 256 initializers in this ascii table, // the others are guaranteed by ANSI C to be initialized to zero. static const int romtab[256] = { 0,0,0,0,0,0, 0, 0, 0, 0, // 00- 09 0,0,0,0,0,0, 0, 0, 0, 0, // 10- 19 0,0,0,0,0,0, 0, 0, 0, 0, // 20- 29 0,0,0,0,0,0, 0, 0, 0, 0, // 30- 39 0,0,0,0,0,0, 0, 0, 0, 0, // 40- 49 0,0,0,0,0,0, 0, 0, 0, 0, // 50- 59 0,0,0,0,0,0, 0, 100, 500, 0, // 60- 69 0,0,0,1,0,0, 50,1000, 0, 0, // 70- 79 0,0,0,0,0,0, 5, 0, 10, 0, // 80- 89 0,0,0,0,0,0, 0, 0, 0, 100, // 90- 99 500,0,0,0,0,1, 0, 0, 50,1000, // 100-109 0,0,0,0,0,0, 0, 0, 5, 0, // 110-119 10,0,0,0,0,0, 0, 0, 0, 0 // 120-129 }; // Return the arabic number for a roman letter c. // Return zero if the roman letter c is invalid. inline int urtoa(int c) { return romtab[c]; } inline int accfn(int t, char c) { return t + urtoa(c) - t % urtoa(c) * 2; } inline int roman_to_dec(std::string_view st) { return std::accumulate(st.begin(), st.end(), 0, accfn); } #ifdef USE_FAST_IO_L // Read an input file of Roman Numerals and do it all (fast_io version +) static void do_it_all( std::string_view fname // in: file name containing a list of + Roman Numerals ) { try { fast_io::filebuf_file fbf(fname, fast_io::open_mode::in); for (std::string line; fast_io::io::scan<true>(fbf, fast_io::mnp +::line_get(line)); ) { fast_io::io::println(roman_to_dec(line)); } } catch (fast_io::error e) { fast_io::io::perrln("Error opening '", fname, "' : ", e); }; } #else // Read an input file of Roman Numerals and do it all (std C++ version +) static void do_it_all( const char* fname // in: file name containing a list of Roma +n Numerals ) { std::ifstream infile(fname); if (!infile.is_open()) { std::cerr << "Error opening '" << fname << "'\n"; return; } for (std::string line; std::getline(infile, line); ) { std::cout << roman_to_dec(line) << "\n"; } infile.close(); } #endif int main(int argc, char* argv[]) { if (argc < 2) { if (argc > 0) std::cerr << "usage: rtoa-pgatram-allinone3 file... >out.txt\ +n"; return 1; } #ifdef USE_FAST_IO_L std::cerr << "using fast_io\n"; #else std::cerr << "not using fast_io\n"; #endif // Get the list of input files from the command line int nfiles = argc - 1; char** fname = &argv[1]; std::cerr << std::setprecision(3) << std::setiosflags(std::ios::fix +ed); time_point cstartall, cendall; cstartall = high_resolution_clock::now(); for (int i = 0; i < nfiles; ++i) do_it_all( fname[i] ); cendall = high_resolution_clock::now(); double ctakenall = elaspe_time(cendall, cstartall); std::cerr << "do_it_all time : " << std::setw(8) << ctakenall << + " secs\n"; return 0; }

      > Sadly, the documentation for fast_io seems to be non-existent and I always struggle to figure out how to use it.

      Same here. This took a team. I've been grepping for clues inside the fast_io benchmark and examples folders. To break 0.5 seconds, I updated the allinone2 demonstration with fast_io memory mapping.

      # https://perlmonks.org/?node_id=11152186 $ ./rtoa-pgatram-allinone2 t1.txt t1.txt t1.txt t1.txt | cksum do_it_all time : 0.515 secs fast_io scan, line_get do_it_all time : 0.490 secs fast_io memory mapping 737201628 75552000

      Memory mapping update:

      // Read an input file of Roman Numerals and do it all static void do_it_all( std::string_view fname // in: file name containing a list of + Roman Numerals ) { try { #if 1 // Load entire file to memory through memory mapping. using file_loader_type = fast_io::native_file_loader; file_loader_type loader(fname, fast_io::open_mode::in | fast_io: +:open_mode::follow); // Loop through contiguous container of the file. for (char const *first{loader.data()}, *last{loader.data()+loade +r.size()}; first!=last; ) { auto start_ptr{first}; first = fast_io::find_lf(first, last); + auto end_ptr{first}; if (start_ptr == end_ptr) continue; int dec = roman_to_dec(std::string_view(start_ptr, end_ptr - +start_ptr)); fast_io::io::println(dec); ++first; } #else fast_io::filebuf_file fbf(fname, fast_io::open_mode::in | fast_i +o::open_mode::follow); for (std::string line; fast_io::io::scan<true>(fbf, fast_io::mnp +::line_get(line)); ) { fast_io::io::println(roman_to_dec(line)); } #endif } catch (fast_io::error e) { fast_io::io::perrln("Error opening '", fname, "' : ", e); }; }

      It requires running Perl on Clear Linux to keep up :)

      # https://perlmonks.org/?node_id=11152186 $ time ./rtoa-pgatram-allinone2 t1.txt t1.txt t1.txt t1.txt | cksum do_it_all time : 0.490 secs fast_io memory mapping 737201628 75552000 real 0m0.492s user 0m0.471s sys 0m0.037s # https://perlmonks.org/?node_id=11152168 max_workers => 32 $ time perl rtoa-pgatram-mce.pl t1.txt t1.txt t1.txt t1.txt | cksum rtoa pgatram start time 0.480 secs 737201628 75552000 real 0m0.504s user 0m13.887s sys 0m0.231s # https://perlmonks.org/?node_id=11152168 max_workers => 64 $ time perl rtoa-pgatram-mce.pl t1.txt t1.txt t1.txt t1.txt | cksum rtoa pgatram start time 0.425 secs 737201628 75552000 real 0m0.449s user 0m21.884s sys 0m0.592s

      Perl MCE is mind-boggling to me. Because, the UNIX time includes all of it; the time to launch Perl, load modules, spawn workers, notify workers per each input file / run, and finally reap workers. The overhead is 0.075 and 0.093 seconds for 32 and 64 workers, respectively. The overhead also includes MCE::Relay where workers must wait their turn orderly by chunk_id value, behind the scene.

      I commented out a few lines of code to measure the overhead time running MCE including MCE::Relay.

      user_func => sub { my ( $mce, $slurp_ref, $chunk_id, $output ) = ( @_, '' ); # open my $fh, '<', $slurp_ref; # while ( <$fh> ) { # chomp; # my $n = 0; # $n += $_ - $n % $_ * 2 for @rtoa[ unpack 'c*', $_ ]; # $output .= "$n\n"; # } # close $fh; # output orderly MCE::relay { # print $output; }; }

      How much time does Perl MCE chunking four input files including MCE::Relay take, factoring out compute and output? Notice how doubling up on workers only adds less than 2 hundreds of a second to the UNIX real time. But notice the user and sys times. Chunking an input file and involving relay occur simultaneously behind the scene. The relay block runs serially.

      # max_workers => 32 $ time perl rtoa-pgatram-mce.pl t1.txt t1.txt t1.txt t1.txt rtoa pgatram start time 0.052 secs real 0m0.075s user 0m0.122s sys 0m0.170s # max_workers => 64 $ time perl rtoa-pgatram-mce.pl t1.txt t1.txt t1.txt t1.txt rtoa pgatram start time 0.070 secs real 0m0.093s user 0m0.195s sys 0m0.441s

        The following is my first draft using OpenMP, dividing the work equally (though, not chunking). I tried keeping the overhead low, achieving 5x faster compared to Perl MCE. Threads ids greater than 0 store locally to a buffer for later output, causing the application to become memory bound. Well, I saw improvements up to 24 workers on the AMD box which is what the memory controller can handle.

        C++ results:

        $ NUM_THREADS=1 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.498 secs 737201628 75552000 $ NUM_THREADS=4 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.176 secs 737201628 75552000 $ NUM_THREADS=8 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.124 secs 737201628 75552000 $ NUM_THREADS=16 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.096 secs 737201628 75552000 $ NUM_THREADS=24 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.090 secs 737201628 75552000 $ NUM_THREADS=32 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.122 secs 737201628 75552000 $ time NUM_THREADS=24 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t +1.txt | cksum do_it_all time : 0.090 secs 737201628 75552000 real 0m0.093s user 0m1.858s sys 0m0.103s

        For comparison, Perl MCE results on Clear Linux:

        # https://perlmonks.org/?node_id=11152168 max_workers => 32 $ time perl rtoa-pgatram-mce.pl t1.txt t1.txt t1.txt t1.txt | cksum rtoa pgatram start time 0.480 secs 737201628 75552000 real 0m0.504s user 0m13.887s sys 0m0.231s # https://perlmonks.org/?node_id=11152168 max_workers => 64 $ time perl rtoa-pgatram-mce.pl t1.txt t1.txt t1.txt t1.txt | cksum rtoa pgatram start time 0.425 secs Perl 737201628 75552000 real 0m0.449s user 0m21.884s sys 0m0.592s

        OpenMP-aware rtoa-pgatram-allinone2b.cpp:

        This is my 3rd revision. The 1st revision using std::vector achieved max 2x. The 2nd revision using std::deque was slower. So I tried again, grepping for "concat" in the fast_io library for use with std::string, reaching 5x.

        Updated on May 19, 2023