in reply to Re^10: Risque Romantic Rosetta Roman Race - All in One - OpenMP
in thread Risque Romantic Rosetta Roman Race

Thanks to eyepopslikeamosquito, for planting the Roman Race saga. :)

During this journey, I came across a website at CPU fun "Processing a File with OpenMP". I reached out to the website authors and fast_io author (including eyepopslikeamosquito), sharing results of my follow-on project Grep Count C++ OpenMP demonstrations, counting matching lines. A discovery was made while testing the grep-count-pmap variant. It turns out that a faster version is possible using the Portable Memory Mapping C++ class. This one scales better than the fast_io memory mapping class.

Before (fast_io memory mapping):

$ NUM_THREADS=1 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.498 secs 737201628 75552000 $ NUM_THREADS=4 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.176 secs 737201628 75552000 $ NUM_THREADS=8 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.124 secs 737201628 75552000 $ NUM_THREADS=16 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.096 secs 737201628 75552000

After (portable memory mapping):

$ NUM_THREADS=1 ./rtoa-pgatram-allinone2c t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.488 secs 737201628 75552000 $ NUM_THREADS=4 ./rtoa-pgatram-allinone2c t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.143 secs 737201628 75552000 $ NUM_THREADS=8 ./rtoa-pgatram-allinone2c t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.091 secs 737201628 75552000 $ NUM_THREADS=16 ./rtoa-pgatram-allinone2c t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.065 secs 737201628 75552000

rtoa-pgatram-allinone2c.cpp

// rtoa-pgatram-allinone2c.cpp. Crude allinone version. // based on rtoa-pgatram-allinone2.cpp https://perlmonks.org/?node_i +d=11152186 // // Obtain the fast_io library (required dependency): // git clone --depth=1 https://github.com/cppfastio/fast_io // // Obtain the Portable Memory Mapping C++ Class (required dependency): // git clone --depth=1 https://github.com/stbrumme/portable-memory-m +apping // Copy two files to your source directory: *.cpp and *.h // cp portable-memory-mapping/MemoryMapped.* . // // Compile with g++ or clang++: // clang++ -o rtoa-pgatram-allinone2c -std=c++20 -fopenmp -Wall -O3 +-I fast_io/include MemoryMapped.cpp rtoa-pgatram-allinone2c.cpp // // OpenMP Little Book: // https://nanxiao.gitbooks.io/openmp-little-book/content/ #include <chrono> #include <cstring> #include <string> #include <string_view> #include <numeric> #include <thread> #ifdef _OPENMP #include <omp.h> #endif #include <iostream> #include <iomanip> // See [id://11149504] for more info on the fast_io C++ library #include <fast_io.h> #include <fast_io_legacy.h> #include "MemoryMapped.h" // --------------------------------------------------------------- typedef std::chrono::high_resolution_clock high_resolution_clock; typedef std::chrono::high_resolution_clock::time_point time_point; typedef std::chrono::milliseconds milliseconds; double elaspe_time( time_point cend, time_point cstart) { return double ( std::chrono::duration_cast<milliseconds>(cend - cstart).count() ) * 1e-3; } // --------------------------------------------------------------- // Though there are less than 256 initializers in this ascii table, // the others are guaranteed by ANSI C to be initialized to zero. static const int romtab[256] = { 0,0,0,0,0,0, 0, 0, 0, 0, // 00- 09 0,0,0,0,0,0, 0, 0, 0, 0, // 10- 19 0,0,0,0,0,0, 0, 0, 0, 0, // 20- 29 0,0,0,0,0,0, 0, 0, 0, 0, // 30- 39 0,0,0,0,0,0, 0, 0, 0, 0, // 40- 49 0,0,0,0,0,0, 0, 0, 0, 0, // 50- 59 0,0,0,0,0,0, 0, 100, 500, 0, // 60- 69 0,0,0,1,0,0, 50,1000, 0, 0, // 70- 79 0,0,0,0,0,0, 5, 0, 10, 0, // 80- 89 0,0,0,0,0,0, 0, 0, 0, 100, // 90- 99 500,0,0,0,0,1, 0, 0, 50,1000, // 100-109 0,0,0,0,0,0, 0, 0, 5, 0, // 110-119 10,0,0,0,0,0, 0, 0, 0, 0 // 120-129 }; // Return the arabic number for a roman letter c. // Return zero if the roman letter c is invalid. inline int urtoa(int c) { return romtab[c]; } inline int accfn(int t, char c) { return t + urtoa(c) - t % urtoa(c) * 2; } inline int roman_to_dec(std::string_view st) { return std::accumulate(st.begin(), st.end(), 0, accfn); } // --------------------------------------------------------------- inline constexpr auto MIN_CHUNK_SIZE = 1048576; // Helper funtion to get the next up value for integer division. static std::size_t divide_up(std::size_t dividend, std::size_t divisor +) { if (dividend % divisor) return dividend / divisor + 1; else return dividend / divisor; } // Read an input file of Roman Numerals and do it all. static void do_it_all( std::string_view fname, // in: file name containing a list of + Roman Numerals int nthds // in: number of threads ) { // Map file to memory. MemoryMapped loader(fname.data(), MemoryMapped::WholeFile, MemoryMa +pped::SequentialScan); if (!loader.isValid()) { fast_io::io::perrln("Error opening '", fname, "' : ", std::errc{ +errno}); return; } // Get raw pointer to mapped memory. char* data = (char*) loader.getData(); std::size_t filesize = loader.size(); // Loop contiguous file container. Each thread process max 1 chunk. std::size_t chunk_size = divide_up(filesize, nthds * 1); if (chunk_size < MIN_CHUNK_SIZE) chunk_size = MIN_CHUNK_SIZE; std::size_t num_chunks = divide_up(filesize, chunk_size); char const *end{data + filesize - 1}; fast_io::out_buf_type obf{fast_io::out()}; #pragma omp parallel for ordered schedule(static, 1) for (std::size_t chunk_id = 0; chunk_id < num_chunks; ++chunk_id) { char const *last{data + filesize - 1}; char const *first{data}; std::string buf; if (chunk_id < num_chunks - 1) { last = first + (chunk_size * (chunk_id + 1) - 1); if (*last != '\n') last = fast_io::find_lf(last, end); } if (chunk_id > 0) { first += (chunk_size * chunk_id - 1); if (*first != '\n') first = fast_io::find_lf(first, end); ++first; } while (first <= last) { auto start_ptr{first}; first = fast_io::find_lf(first, last); auto end_ptr{first}; // chunk ids greater than 0 store locally for later output int dec = roman_to_dec(std::string_view(start_ptr, end_ptr - +start_ptr)); if (chunk_id) buf.append(fast_io::concatln(dec)); else fast_io::io::println(obf, dec); ++first; } #pragma omp ordered { if (chunk_id) fast_io::io::print(obf, buf); } } } int main(int argc, char* argv[]) { if (argc < 2) { if (argc > 0) std::cerr << "usage: rtoa-pgatram-allinone2c file... >out.txt +\n"; return 1; } // Get the list of input files from the command line int nfiles = argc - 1; char** fname = &argv[1]; std::cerr << std::setprecision(3) << std::setiosflags(std::ios::fix +ed); time_point cstartall, cendall; cstartall = high_resolution_clock::now(); #ifdef _OPENMP // Determine the number of threads. int nthds = std::thread::hardware_concurrency(); const char* env_nthds1 = std::getenv("OMP_NUM_THREADS"); const char* env_nthds2 = std::getenv("NUM_THREADS"); if (env_nthds1 && strlen(env_nthds1)) nthds = ::atoi(env_nthds1); else if (env_nthds2 && strlen(env_nthds2)) nthds = ::atoi(env_nthds2); omp_set_dynamic(false); omp_set_num_threads(nthds); #else int nthds = 1; #endif for (int i = 0; i < nfiles; ++i) do_it_all(fname[i], nthds); cendall = high_resolution_clock::now(); double ctakenall = elaspe_time(cendall, cstartall); std::cerr << "do_it_all time : " << std::setw(8) << ctakenall << + " secs\n"; return 0; }

Replies are listed 'Best First'.
Re^12: Risque Romantic Rosetta Roman Race - All in One - OpenMP
by marioroy (Prior) on May 29, 2023 at 14:23 UTC

    I wanted to come back and provide a MCE-like chunking variant, for computing Roman Numerals to Decimal. It runs faster than the memory mapping solutions consuming 8 or more threads.

    fast_io memory mapping:

    $ NUM_THREADS=1 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.498 secs 737201628 75552000 $ NUM_THREADS=4 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.176 secs 737201628 75552000 $ NUM_THREADS=8 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.124 secs 737201628 75552000 $ NUM_THREADS=16 ./rtoa-pgatram-allinone2b t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.096 secs 737201628 75552000

    portable memory mapping:

    $ NUM_THREADS=1 ./rtoa-pgatram-allinone2c t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.488 secs 737201628 75552000 $ NUM_THREADS=4 ./rtoa-pgatram-allinone2c t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.143 secs 737201628 75552000 $ NUM_THREADS=8 ./rtoa-pgatram-allinone2c t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.091 secs 737201628 75552000 $ NUM_THREADS=16 ./rtoa-pgatram-allinone2c t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.065 secs 737201628 75552000

    MCE-like chunking:

    $ NUM_THREADS=1 ./rtoa-pgatram-allinone2d t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.489 secs 737201628 75552000 $ NUM_THREADS=4 ./rtoa-pgatram-allinone2d t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.144 secs 737201628 75552000 $ NUM_THREADS=8 ./rtoa-pgatram-allinone2d t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.075 secs 737201628 75552000 $ NUM_THREADS=16 ./rtoa-pgatram-allinone2d t1.txt t1.txt t1.txt t1.txt + | cksum do_it_all time : 0.048 secs 737201628 75552000

    rtoa-pgatram-allinone2d.cpp

Re^12: Risque Romantic Rosetta Roman Race - All in One - OpenMP
by marioroy (Prior) on May 19, 2023 at 17:04 UTC

    The Portable Memory Mapping C++ class does not hamper scaling on a many-core box. In order to demonstrate this, I prefer using the follow-on project, counting lines matching a pattern. I begin by creating a 5 GB file in /tmp to better understand the performance characteristic between the two memory mapping implementations; fast_io and the portable C++ class. The input file contains greater than 100 million lines.

    $ for i in $(seq 1 210); do cat large.txt; done >/tmp/big.txt $ ls -lh /tmp/big.txt -rw-r--r-- 1 mario mario 5.0G May 19 11:19 /tmp/big.txt

    I compare the MCE egrep.pl example (for fun), grep-count-pcre2 (using fast_io mapping), and grep-count-pmap (using portable mapping). On my system, it requires 10 workers for Perl to run faster than the system grep command.

    $ time grep -c "[aA].*[eE].*[iI].*[oO].*[uU]" /tmp/big.txt 2195760 real 0m8.721s user 0m8.317s sys 0m0.404s

    The Perl MCE solution attempts to catch up due to scaling wonderfully.

    $ time ./egrep.pl --max-workers=2 -c "[aA].*[eE].*[iI].*[oO].*[uU]" /t +mp/big.txt 2195760 N-thds 2 5 10 20 30 real 0m39.068s 0m16.088s 0m8.146s 0m4.229s 0m3.084s user 1m17.614s 1m19.624s 1m20.441s 1m21.823s 1m25.296s sys 0m0.419s 0m0.464s 0m0.434s 0m0.526s 0m0.645s

    Next is grep-count-pcre2 using the fast_io memory mapping C++ class. For some reason, this is unable to scale linearly.

    $ time NUM_THREADS=2 ./grep-count-pcre2 "[aA].*[eE].*[iI].*[oO].*[uU]" + /tmp/big.txt parallel (2) Total Lines: 105000000, Matching Lines: 2195760 N-thds 2 5 10 20 30 real 0m9.256s 0m4.382s 0m2.748s 0m1.929s 0m1.689s user 0m15.858s 0m16.563s 0m17.267s 0m18.391s 0m19.497s sys 0m1.678s 0m1.339s 0m1.186s 0m1.225s 0m1.260s

    Finally grep-count-pmap (also PCRE2), but using the portable memory mapping C++ class. This scales better, achieving better performance.

    $ time NUM_THREADS=2 ./grep-count-pmap "[aA].*[eE].*[iI].*[oO].*[uU]" +/tmp/big.txt parallel (2) Total Lines: 105000000, Matching Lines: 2195760 N-thds 2 5 10 20 30 real 0m8.113s 0m3.249s 0m1.675s 0m0.958s 0m0.707s user 0m15.333s 0m15.646s 0m15.809s 0m16.775s 0m18.110s sys 0m0.840s 0m0.266s 0m0.343s 0m0.489s 0m0.506s

    Consuming 30 CPU cores, grep-count-pcre2 is about 2 times faster than Perl. The portable map solution grep-count-pmap is beyond 2 times faster than grep-count-pcre2 or 4 times faster than Perl.

    The following is a Perl MCE-like chunking implementation including similar MCE::Relay logic for orderly output. The C++ demonstration supports standard input and file arguments. It consumes very little memory compared to the memory mapping solutions. The results are similar to grep-count-pmap. Notice the user time being faster, due to using SIMD for counting the linefeed characters.

    $ time NUM_THREADS=2 ./grep-count-chunk "[aA].*[eE].*[iI].*[oO].*[uU]" + /tmp/big.txt parallel (2) Total Lines: 105000000, Matching Lines: 2195760 N-thds 2 5 10 20 30 real 0m8.017s 0m3.257s 0m1.537s 0m0.801s 0m0.618s user 0m14.524s 0m14.997s 0m14.906s 0m15.397s 0m17.519s sys 0m1.010s 0m1.019s 0m0.412s 0m0.515s 0m0.649s

    grep-count-chunk.cc