Re^10: Risque Romantic Rosetta Roman Race - All in One

The following is my first draft using OpenMP, dividing the work equally (though, not chunking). I tried keeping the overhead low, achieving 5x faster compared to Perl MCE. Threads ids greater than 0 store locally to a buffer for later output, causing the application to become memory bound. Well, I saw improvements up to 24 workers on the AMD box which is what the memory controller can handle.

C++ results:

$ NUM_THREADS=1  ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt
+ | cksum
do_it_all time    :    0.498 secs
737201628 75552000

$ NUM_THREADS=4  ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt
+ | cksum
do_it_all time    :    0.176 secs
737201628 75552000

$ NUM_THREADS=8  ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt
+ | cksum
do_it_all time    :    0.124 secs
737201628 75552000

$ NUM_THREADS=16 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt
+ | cksum
do_it_all time    :    0.096 secs
737201628 75552000

$ NUM_THREADS=24 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt
+ | cksum
do_it_all time    :    0.090 secs
737201628 75552000

$ NUM_THREADS=32 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt
+ | cksum
do_it_all time    :    0.122 secs
737201628 75552000

$ time NUM_THREADS=24 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t
+1.txt | cksum
do_it_all time    :    0.090 secs
737201628 75552000

real    0m0.093s
user    0m1.858s
sys     0m0.103s
[download]

For comparison, Perl MCE results on Clear Linux:

# https://perlmonks.org/?node_id=11152168 max_workers => 32
$ time perl rtoa-pgatram-mce.pl t1.txt t1.txt t1.txt t1.txt | cksum
rtoa pgatram start
time 0.480 secs
737201628 75552000

real    0m0.504s
user    0m13.887s
sys     0m0.231s

# https://perlmonks.org/?node_id=11152168 max_workers => 64
$ time perl rtoa-pgatram-mce.pl t1.txt t1.txt t1.txt t1.txt | cksum
rtoa pgatram start
time 0.425 secs  Perl
737201628 75552000

real    0m0.449s
user    0m21.884s
sys     0m0.592s
[download]

OpenMP-aware rtoa-pgatram-allinone2b.cpp:

This is my 3rd revision. The 1st revision using std::vector achieved max 2x. The 2nd revision using std::deque was slower. So I tried again, grepping for "concat" in the fast_io library for use with std::string, reaching 5x.

Updated on May 19, 2023

// rtoa-pgatram-allinone2b.cpp. Crude allinone version.
//   based on rtoa-pgatram-allinone2.cpp https://perlmonks.org/?node_i
+d=11152186
//
// Obtain the fast_io library (required dependency):
//   git clone --depth=1 https://github.com/cppfastio/fast_io
//
// Compile with g++ or clang++:
//   clang++ -o rtoa-pgatram-allinone2b -std=c++20 -fopenmp -Wall -O3 
+-I fast_io/include rtoa-pgatram-allinone2b.cpp
//
// OpenMP Little Book:
//   https://nanxiao.gitbooks.io/openmp-little-book/content/

#include <chrono>
#include <cstring>
#include <string>
#include <string_view>
#include <numeric>
#include <thread>

#ifdef _OPENMP
#include <omp.h>
#endif

#include <iostream>
#include <iomanip>

// See [id://11149504] for more info on the fast_io C++ library
#include <fast_io.h>
#include <fast_io_legacy.h>

// ---------------------------------------------------------------

typedef std::chrono::high_resolution_clock high_resolution_clock;
typedef std::chrono::high_resolution_clock::time_point time_point;
typedef std::chrono::milliseconds milliseconds;

double elaspe_time(
   time_point cend,
   time_point cstart)
{
   return double (
      std::chrono::duration_cast<milliseconds>(cend - cstart).count()
   ) * 1e-3;
}

// ---------------------------------------------------------------

// Though there are less than 256 initializers in this ascii table,
// the others are guaranteed by ANSI C to be initialized to zero.
static const int romtab[256] = {
    0,0,0,0,0,0,   0,   0,   0,   0,  //  00- 09
    0,0,0,0,0,0,   0,   0,   0,   0,  //  10- 19
    0,0,0,0,0,0,   0,   0,   0,   0,  //  20- 29
    0,0,0,0,0,0,   0,   0,   0,   0,  //  30- 39
    0,0,0,0,0,0,   0,   0,   0,   0,  //  40- 49
    0,0,0,0,0,0,   0,   0,   0,   0,  //  50- 59
    0,0,0,0,0,0,   0, 100, 500,   0,  //  60- 69
    0,0,0,1,0,0,  50,1000,   0,   0,  //  70- 79
    0,0,0,0,0,0,   5,   0,  10,   0,  //  80- 89
    0,0,0,0,0,0,   0,   0,   0, 100,  //  90- 99
  500,0,0,0,0,1,   0,   0,  50,1000,  // 100-109
    0,0,0,0,0,0,   0,   0,   5,   0,  // 110-119
   10,0,0,0,0,0,   0,   0,   0,   0   // 120-129
};

// Return the arabic number for a roman letter c.
// Return zero if the roman letter c is invalid.
inline int urtoa(int c) { return romtab[c]; }

inline int accfn(int t, char c) {
   return t + urtoa(c) - t % urtoa(c) * 2;
}

inline int roman_to_dec(std::string_view st) {
   return std::accumulate(st.begin(), st.end(), 0, accfn);
}

// ---------------------------------------------------------------

inline constexpr auto MIN_CHUNK_SIZE = 1048576;

// Helper funtion to get the next up value for integer division.
static std::size_t divide_up(std::size_t dividend, std::size_t divisor
+)
{
   if (dividend % divisor)
      return dividend / divisor + 1;
   else
      return dividend / divisor;
}

// Read an input file of Roman Numerals and do it all.
static void do_it_all(
   std::string_view fname,      //  in: file name containing a list of
+ Roman Numerals
   int nthds                    //  in: number of threads
)
{
   try {
      // Load entire file to memory through memory mapping.
      using file_loader_type = fast_io::native_file_loader;
      file_loader_type loader(fname, fast_io::open_mode::in | fast_io:
+:open_mode::follow);

      // Loop contiguous file container. Each thread process max 1 chu
+nk.
      std::size_t chunk_size = divide_up(loader.size(), nthds * 1);
      if (chunk_size < MIN_CHUNK_SIZE) chunk_size = MIN_CHUNK_SIZE;
      std::size_t num_chunks = divide_up(loader.size(), chunk_size);

      char const *end{loader.data() + loader.size() - 1};
      fast_io::out_buf_type obf{fast_io::out()};

      #pragma omp parallel for ordered schedule(static, 1)
      for (std::size_t chunk_id = 0; chunk_id < num_chunks; ++chunk_id
+) {
         char const *last{loader.data() + loader.size() - 1};
         char const *first{loader.data()};
         std::string buf;

         if (chunk_id < num_chunks - 1) {
            last = first + (chunk_size * (chunk_id + 1) - 1);
            if (*last != '\n') last = fast_io::find_lf(last, end);
         }
         if (chunk_id > 0) {
            first += (chunk_size * chunk_id - 1);
            if (*first != '\n') first = fast_io::find_lf(first, end);

            ++first;
         }

         while (first <= last) {
            auto start_ptr{first}; first = fast_io::find_lf(first, las
+t);
            auto end_ptr{first};

            // chunk ids greater than 0 store locally for later output
            int dec = roman_to_dec(std::string_view(start_ptr, end_ptr
+ - start_ptr));
            if (chunk_id)
               buf.append(fast_io::concatln(dec));
            else
               fast_io::io::println(obf, dec);

            ++first;
         }

         #pragma omp ordered
         {
            if (chunk_id)
               fast_io::io::print(obf, buf);
         }
      }
   }
   catch (fast_io::error e) {
      fast_io::io::perrln("Error opening '", fname, "' : ", e);
   };
}

int main(int argc, char* argv[])
{
   if (argc < 2) {
      if (argc > 0)
         std::cerr << "usage: rtoa-pgatram-allinone2b file... >out.txt
+\n";
      return 1;
   }

   // Get the list of input files from the command line
   int    nfiles = argc - 1;
   char** fname  = &argv[1];

   std::cerr << std::setprecision(3) << std::setiosflags(std::ios::fix
+ed);
   time_point cstartall, cendall;
   cstartall = high_resolution_clock::now();

#ifdef _OPENMP
   // Determine the number of threads.
   int nthds = std::thread::hardware_concurrency();
   const char* env_nthds1 = std::getenv("OMP_NUM_THREADS");
   const char* env_nthds2 = std::getenv("NUM_THREADS");

   if (env_nthds1 && strlen(env_nthds1))
      nthds = ::atoi(env_nthds1);
   else if (env_nthds2 && strlen(env_nthds2))
      nthds = ::atoi(env_nthds2);

   omp_set_dynamic(false);
   omp_set_num_threads(nthds);
#else
   int nthds = 1;
#endif

   for (int i = 0; i < nfiles; ++i)
      do_it_all(fname[i], nthds);

   cendall = high_resolution_clock::now();
   double ctakenall = elaspe_time(cendall, cstartall);
   std::cerr << "do_it_all time    : " << std::setw(8) << ctakenall <<
+ " secs\n";

   return 0;
}
[download]

Comment on Re^10: Risque Romantic Rosetta Roman Race - All in One - OpenMP Select or Download Code

Replies are listed 'Best First'.
Re^11: Risque Romantic Rosetta Roman Race - All in One - OpenMP by marioroy (Prior) on May 19, 2023 at 13:31 UTC
Thanks to eyepopslikeamosquito, for planting the Roman Race saga. :) During this journey, I came across a website at CPU fun "Processing a File with OpenMP". I reached out to the website authors and fast_io author (including eyepopslikeamosquito), sharing results of my follow-on project Grep Count C++ OpenMP demonstrations, counting matching lines. A discovery was made while testing the grep-count-pmap variant. It turns out that a faster version is possible using the Portable Memory Mapping C++ class. This one scales better than the fast_io memory mapping class. Before (fast_io memory mapping): $ NUM_THREADS=1 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.498 secs 737201628 75552000 $ NUM_THREADS=4 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.176 secs 737201628 75552000 $ NUM_THREADS=8 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.124 secs 737201628 75552000 $ NUM_THREADS=16 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.096 secs 737201628 75552000 [download] After (portable memory mapping): $ NUM_THREADS=1 ./rtoa-pgatram-allinone2c t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.488 secs 737201628 75552000 $ NUM_THREADS=4 ./rtoa-pgatram-allinone2c t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.143 secs 737201628 75552000 $ NUM_THREADS=8 ./rtoa-pgatram-allinone2c t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.091 secs 737201628 75552000 $ NUM_THREADS=16 ./rtoa-pgatram-allinone2c t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.065 secs 737201628 75552000 [download] rtoa-pgatram-allinone2c.cpp Read more... (7 kB)	[reply] [d/l] [select]
Re^12: Risque Romantic Rosetta Roman Race - All in One - OpenMP by marioroy (Prior) on May 29, 2023 at 14:23 UTC
I wanted to come back and provide a MCE-like chunking variant, for computing Roman Numerals to Decimal. It runs faster than the memory mapping solutions consuming 8 or more threads. fast_io memory mapping: $ NUM_THREADS=1 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.498 secs 737201628 75552000 $ NUM_THREADS=4 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.176 secs 737201628 75552000 $ NUM_THREADS=8 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.124 secs 737201628 75552000 $ NUM_THREADS=16 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.096 secs 737201628 75552000 [download] portable memory mapping: $ NUM_THREADS=1 ./rtoa-pgatram-allinone2c t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.488 secs 737201628 75552000 $ NUM_THREADS=4 ./rtoa-pgatram-allinone2c t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.143 secs 737201628 75552000 $ NUM_THREADS=8 ./rtoa-pgatram-allinone2c t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.091 secs 737201628 75552000 $ NUM_THREADS=16 ./rtoa-pgatram-allinone2c t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.065 secs 737201628 75552000 [download] MCE-like chunking: $ NUM_THREADS=1 ./rtoa-pgatram-allinone2d t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.489 secs 737201628 75552000 $ NUM_THREADS=4 ./rtoa-pgatram-allinone2d t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.144 secs 737201628 75552000 $ NUM_THREADS=8 ./rtoa-pgatram-allinone2d t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.075 secs 737201628 75552000 $ NUM_THREADS=16 ./rtoa-pgatram-allinone2d t1.txt t1.txt t1.txt t1.txt + \| cksum do_it_all time : 0.048 secs 737201628 75552000 [download] rtoa-pgatram-allinone2d.cpp Read more... (8 kB)	[reply] [d/l] [select]
Re^12: Risque Romantic Rosetta Roman Race - All in One - OpenMP by marioroy (Prior) on May 19, 2023 at 17:04 UTC
The Portable Memory Mapping C++ class does not hamper scaling on a many-core box. In order to demonstrate this, I prefer using the follow-on project, counting lines matching a pattern. I begin by creating a 5 GB file in /tmp to better understand the performance characteristic between the two memory mapping implementations; fast_io and the portable C++ class. The input file contains greater than 100 million lines. `$ for i in $(seq 1 210); do cat large.txt; done >/tmp/big.txt $ ls -lh /tmp/big.txt -rw-r--r-- 1 mario mario 5.0G May 19 11:19 /tmp/big.txt` [download] I compare the MCE egrep.pl example (for fun), grep-count-pcre2 (using fast_io mapping), and grep-count-pmap (using portable mapping). On my system, it requires 10 workers for Perl to run faster than the system grep command. `$ time grep -c "[aA].[eE].[iI].[oO].[uU]" /tmp/big.txt 2195760 real 0m8.721s user 0m8.317s sys 0m0.404s` [download] The Perl MCE solution attempts to catch up due to scaling wonderfully. `$ time ./egrep.pl --max-workers=2 -c "[aA].[eE].[iI].[oO].[uU]" /t +mp/big.txt 2195760 N-thds 2 5 10 20 30 real 0m39.068s 0m16.088s 0m8.146s 0m4.229s 0m3.084s user 1m17.614s 1m19.624s 1m20.441s 1m21.823s 1m25.296s sys 0m0.419s 0m0.464s 0m0.434s 0m0.526s 0m0.645s` [download] Next is grep-count-pcre2 using the fast_io memory mapping C++ class. For some reason, this is unable to scale linearly. `$ time NUM_THREADS=2 ./grep-count-pcre2 "[aA].[eE].[iI].[oO].[uU]" + /tmp/big.txt parallel (2) Total Lines: 105000000, Matching Lines: 2195760 N-thds 2 5 10 20 30 real 0m9.256s 0m4.382s 0m2.748s 0m1.929s 0m1.689s user 0m15.858s 0m16.563s 0m17.267s 0m18.391s 0m19.497s sys 0m1.678s 0m1.339s 0m1.186s 0m1.225s 0m1.260s` [download] Finally grep-count-pmap (also PCRE2), but using the portable memory mapping C++ class. This scales better, achieving better performance. `$ time NUM_THREADS=2 ./grep-count-pmap "[aA].[eE].[iI].[oO].[uU]" +/tmp/big.txt parallel (2) Total Lines: 105000000, Matching Lines: 2195760 N-thds 2 5 10 20 30 real 0m8.113s 0m3.249s 0m1.675s 0m0.958s 0m0.707s user 0m15.333s 0m15.646s 0m15.809s 0m16.775s 0m18.110s sys 0m0.840s 0m0.266s 0m0.343s 0m0.489s 0m0.506s` [download] Consuming 30 CPU cores, grep-count-pcre2 is about 2 times faster than Perl. The portable map solution grep-count-pmap is beyond 2 times faster than grep-count-pcre2 or 4 times faster than Perl. The following is a Perl MCE-like chunking implementation including similar MCE::Relay logic for orderly output. The C++ demonstration supports standard input and file arguments. It consumes very little memory compared to the memory mapping solutions. The results are similar to grep-count-pmap. Notice the user time being faster, due to using SIMD for counting the linefeed characters. `$ time NUM_THREADS=2 ./grep-count-chunk "[aA].[eE].[iI].[oO].[uU]" + /tmp/big.txt parallel (2) Total Lines: 105000000, Matching Lines: 2195760 N-thds 2 5 10 20 30 real 0m8.017s 0m3.257s 0m1.537s 0m0.801s 0m0.618s user 0m14.524s 0m14.997s 0m14.906s 0m15.397s 0m17.519s sys 0m1.010s 0m1.019s 0m0.412s 0m0.515s 0m0.649s` [download] grep-count-chunk.cc Read more... (11 kB)	[reply] [d/l] [select]