Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re^10: Rosetta Code: Long List is Long (faster - llil4vec - llil4judy - TBB code - llil5vec - boost)

by eyepopslikeamosquito (Archbishop)
on Jan 22, 2023 at 08:05 UTC ( [id://11149754]=note: print w/replies, xml ) Need Help??


in reply to Re^9: Rosetta Code: Long List is Long (faster - llil4vec - TBB code)
in thread Rosetta Code: Long List is Long

G'day marioroy,

So much learning in this thread! ... OpenMp, Intel TBB, fast_io ... and now good old Boost.

Since boost is mostly a header-only library, I installed the latest Dev version manually into my Ubuntu $HOME directory:

cd $HOME/local-boost wget https://boostorg.jfrog.io/artifactory/main/release/1.81.0/source/ +boost_1_81_0_rc1.tar.gz.json wget https://boostorg.jfrog.io/artifactory/main/release/1.81.0/source/ +boost_1_81_0_rc1.tar.gz tar -xzf boost_1_81_0_rc1.tar.gz

Then adjusted my build command:

clang++ -o llil5vec-tbb -std=c++20 -Wall -O3 -I "$HOME/llil/cmdlinux/f +ast_io/include" -I "$HOME/local-boost/boost_1_81_0" -I "$HOME/local-o +neapi-tbb/oneapi-tbb-2021.7.0/include" -L "$HOME/local-oneapi-tbb/one +api-tbb-2021.7.0/lib" llil5vec-tbb.cpp -l tbb

I was then able to compile your program with both clang++ and g++. Though g++ spat out a number of warnings similar to:

block_indirect_sort.hpp:274:23: warning: implicit capture of ‘this’ vi +a ‘[=]’ is deprecated in C++20 [-Wdeprecated]
the generated executable seemed to work fine, albeit a touch slower than clang++.

As you might expect, I was then unable to restrain myself from copying your llil4vec-tbb.cpp masterwork to llil5vec-tbb.cpp to try out a few changes.

I was pleasantly surprised that my first attempted change, replacing the emplace std::set sort with a vector sort gave a significant speed up on my laptop, allowing me to break the 1.5 second barrier for the first time! Woo hoo! Never thought I'd break the 1.5 second barrier ... though surely the magical one second barrier will prove out of reach.

This little test proved again that when it comes to C++ performance, vector always wins. :) Or at least should always be tried, it's true that std::map soundly beats std::vector when dealing with very long name strings of variable length (e.g. long1.txt long2.txt long3.txt).

You may also notice from the timings below, that I've switched to running with NUM_THREADS=6 (seemed hard to beat on my little laptop). Plus, thanks to your excellent std::chrono::high_resolution_clock improvements, I don't bother with the Linux time command any more.

Timings

A baseline, using the original std::set emplace sort:

$ NUM_THREADS=6 ./llil4vec-tbb big1.txt big2.txt big3.txt >f.tmp llil4vec-tbb (fixed string length=6) start use TBB get properties time : 0.399307 secs sort properties time : 0.277303 secs emplace set sort time : 0.764016 secs write stdout time : 0.51597 secs total time : 1.95691 secs

With the new vector sort:

$ NUM_THREADS=6 ./llil5vec-tbb big1.txt big2.txt big3.txt >big-5vec.tm +p llil5vec-tbb (fixed string length=6) start use TBB use boost sort get properties time : 0.367148 secs sort properties time : 0.274697 secs vector stable sort time : 0.385357 secs write stdout time : 0.390569 secs total time : 1.4179 secs $ diff big-5vec.tmp big-3vec.tmp

Updated timings of running updated llil4vec2.cpp below

With vector reduce timed separately and non-negative hack in fast_atoll64:

$ NUM_THREADS=6 ./llil4vec2 big1.txt big2.txt big3.txt >f.tmp llil4vec2 (fixed string length=6) start use OpenMP use boost sort get properties time : 0.353 secs sort properties time : 0.271 secs vector reduce time : 0.074 secs vector stable sort time : 0.197 secs write stdout time : 0.322 secs total time : 1.219 secs

With new get_properties function (using std::array and std::find):

$ NUM_THREADS=6 ./llil4vec2 big1.txt big2.txt big3.txt >f.tmp llil4vec2 (fixed string length=6) start use OpenMP use boost sort get properties time : 0.321 secs sort properties time : 0.268 secs vector reduce time : 0.078 secs vector stable sort time : 0.194 secs write stdout time : 0.329 secs total time : 1.192 secs

Slight improvement to 1.165 secs after adding new std::copy trick :) (4 Feb 2023):

Surprising new record of 1.120 secs (10 Feb 2023) after adding new USE_MEMCMP_L define. The vector reduce time was halved to a ridiculous 0.03 secs.

$ NUM_THREADS=6 ./llil4vec2 big1.txt big2.txt big3.txt >f.tmp llil4vec2 (fixed string length=6) use memcmp start use OpenMP use boost sort get properties time : 0.311 secs sort properties time : 0.261 secs vector reduce time : 0.030 secs vector stable sort time : 0.199 secs write stdout time : 0.316 secs total time : 1.120 secs

Update: 18/03/2023. Thanks to marioroy, finally broke the magical one second barrier! ... via a combination of mimalloc and memory-mapped-io, as demonstrated in llil4vec. Precise details are complicated and will be added later:

$ LD_PRELOAD=/usr/local/lib/libmimalloc.so NUM_THREADS=6 ./llil4vec-ne +w big1.txt big2.txt big3.txt >f.tmp llil4vec (fixed string length=6) start use OpenMP use boost sort nthrs=6 get properties 0.192 secs sort properties 0.261 secs vector reduce 0.033 secs vector stable sort 0.186 secs write stdout 0.311 secs total time 0.986 secs $ LD_PRELOAD=/usr/local/lib/libmimalloc.so NUM_THREADS=6 ./llil4vec-ne +w big1.txt big2.txt big3.txt >f.tmp llil4vec (fixed string length=6) start use OpenMP use boost sort nthrs=6 get properties 0.201 secs sort properties 0.257 secs vector reduce 0.032 secs vector stable sort 0.194 secs write stdout 0.300 secs total time 0.986 secs $ cmp f.tmp big-3vec.tmp

Further Update: Timings with llil4map (31-Jan-2023).

$ NUM_THREADS=6 ./llil4map big1.txt big2.txt big3.txt >f.tmp llil4map start use phmap::parallel_flat_hash_map use OpenMP use boost sort get properties time : 1.6778 secs finish merging time : 0.754921 secs vector stable sort time : 0.76791 secs write stdout time : 0.382316 secs total time : 3.5832 secs

Further Update: Timings with llil4judy (April-2023).

Built Judy C Library as described at Re: need help with judy array searching (Judy Array References).

$ NUM_THREADS=6 LD_LIBRARY_PATH=/usr/local/lib ./llil4judy big1.txt bi +g2.txt big3.txt >f.tmp llil4judy (fixed string length=6) start use OpenMP use boost sort get properties 0.682 secs finish merging 2.338 secs vector stable sort 0.262 secs write stdout 0.526 secs total time 3.810 secs $ cmp f.tmp big-3vec.tmp

Though slower, the Judy code uses less memory.

Further Update: Timings with 18 big files (1-Feb-2023)

Test run with big1.txt, big2.txt, big3.txt, big4.txt, big5.txt, big6.txt in my cwd.

$ NUM_THREADS=6 ./llil4vec2 big?.txt big?.txt big?.txt >f4vec.tmp llil4vec2 (fixed string length=6) start use OpenMP use boost sort get properties time : 1.9719 secs sort properties time : 2.27273 secs vector reduce time : 0.201546 secs vector stable sort time : 0.500317 secs write stdout time : 0.721466 secs total time : 5.66814 secs $ NUM_THREADS=6 ./llil4map big?.txt big?.txt big?.txt >f4map.tmp llil4map start use phmap::parallel_flat_hash_map use OpenMP use boost sort get properties time : 5.33992 secs finish merging time : 2.20465 secs vector stable sort time : 1.60465 secs write stdout time : 1.21258 secs total time : 10.3622 secs $ diff f4map.tmp f4vec.tmp $ ls -l f4map.tmp f4vec.tmp -rw-r--r-- 183490874 Feb 1 09:38 f4map.tmp -rw-r--r-- 183490874 Feb 1 09:37 f4vec.tmp $ ls -l big*.txt -rw-r--r-- 31636800 Jan 16 18:26 big1.txt -rw-r--r-- 31636800 Jan 16 18:26 big2.txt -rw-r--r-- 31636800 Jan 16 18:26 big3.txt -rw-r--r-- 31636800 Feb 1 09:33 big4.txt -rw-r--r-- 31636800 Feb 1 09:31 big5.txt -rw-r--r-- 31636800 Feb 1 09:31 big6.txt

Update: a later version with improved get_properties(): $ NUM_THREADS=6 ./llil4vec2 big?.txt big?.txt big?.txt big?.txt big?.t +xt big?.txt >f4vec.tmp llil4vec2 (fixed string length=6) start use OpenMP use boost sort get properties time : 4.392 secs sort properties time : 4.706 secs vector reduce time : 0.225 secs vector stable sort time : 0.491 secs write stdout time : 0.766 secs total time : 10.593 secs $ NUM_THREADS=6 ./llil4map big?.txt big?.txt big?.txt big?.txt big?.tx +t big?.txt >f4map.tmp llil4map start use phmap::parallel_flat_hash_map use OpenMP use boost sort get properties time : 10.6884 secs finish merging time : 2.11571 secs vector stable sort time : 1.83454 secs write stdout time : 2.55906 secs total time : 17.198 secs $ diff f4map.tmp f4vec.tmp

In summary, the vector sort is almost twice as fast as the set emplace sort: 0.4 secs vs 0.75 secs; it's also slightly faster to write a std::vector to stdout than a std::set.

The fastest sort on my laptop was the one you found, boost::sort::block_indirect_sort. As I have become sadly accustomed to during this long thread, stable sort (though theoretically looking good) was defeated yet again. :(

Update: Note that a DDR4 DIMM can hold up to 64 GB, while DDR5 octuples that to 512 GB ... which makes worrying about a program's memory usage seem much less important than in the good old days. :)

llil5vec-tbb.cpp

// llil5vec-tbb.cpp // Based on llil4vec-tbb.cpp in https://perlmonks.com/?node_id=1114968 +7 // // Vector version using the Intel TBB library // Note: TBB concurrent vector is best avoided, too much locking overh +ead // based on llil3vec-tbb-a.cpp https://perlmonks.com/?node_id=11149622 // 1. Capture time and diff via chrono. // 2. Key words null-terminated for MAX_STR_LEN_L. // 3. Concat strings using fast_io::concatln during output. // 4. Support NUM_THREADS environment variable. // 5. Merge local vector inside the parallel_for block. // This allows threads, completing get_properties, to merge imme +diately. // 6. Moved vec_loc instantiation inside for loop. // 7. Support running one thread, writing to propvec (no merging). // 8. Capture time for get and sort properties separately. // 9. Added define for using boost's parallel sort; requires clang+ ++. // // To obtain the fast_io library (required dependency): // git clone --depth=1 https://github.com/cppfastio/fast_io // // Compiled on Linux with: // clang++ -o llil5vec-tbb -std=c++20 -Wall -O3 -I "$HOME/local-fast +_io/fast_io/include" -I "$HOME/local-boost/boost_1_81_0" -I "$HOME/lo +cal-oneapi-tbb/oneapi-tbb-2021.7.0/include" -L "$HOME/local-oneapi-tb +b/oneapi-tbb-2021.7.0/lib" llil5vec-tbb.cpp -l tbb // Also works with g++ but throws some compiler warnings about depreca +ted C++20 features. // Seems to run slightly faster when compiled with clang++ rather than + g++ // Uncomment to use Intel TBB library #define USE_TBB_L 1 // Specify 0/1 to use boost's parallel sorting algorithm; faster than +__gnu_parallel::sort. // https://www.boost.org/doc/libs/1_81_0/libs/sort/doc/html/sort/paral +lel.html // This requires the boost header files: e.g. devpkg-boost bundle on C +lear Linux. // Note: I was able to build this program just by downloading and unpa +cking Boost locally // (no need to build it because the bits we use are header file only) #define USE_BOOST_PARALLEL_SORT 1 #include <chrono> #include <thread> // The fast_io header must come after chrono, else build error: // "no member named 'concatln' in namespace 'fast_io'" #include <fast_io.h> #include <cstdio> #include <cstddef> #include <cstdint> #include <cstdlib> #include <cstring> #include <ctime> #include <string> #include <array> #include <vector> #include <utility> #include <iterator> #if USE_BOOST_PARALLEL_SORT > 0 #include <boost/sort/sort.hpp> #else #endif #include <algorithm> #include <execution> #ifdef USE_TBB_L #include <tbb/global_control.h> #include <tbb/parallel_sort.h> #include <tbb/parallel_for.h> #include <tbb/spin_mutex.h> #endif #include <iostream> #include <fstream> #include <sstream> static_assert(sizeof(size_t) == sizeof(int64_t), "size_t too small, ne +ed a 64-bit compile"); // ------------------------------------------------------------------- +--------- // Crude hack to see Windows Private Bytes in Task Manager by sleeping + at // program end (see also sleep hack at end of main) // #include <chrono> // #include <thread> // ------------------------------------------------------------------- +--------- typedef long long llil_int_type; // Note: all words in big1.txt, big2.txt, big3.txt are <= 6 chars in l +ength // To use (limited length) fixed length strings uncomment the next lin +e // big.txt max word length is 6 // long.txt max word length is 208 // The standard big1.txt, big2.txt, big3.txt files all contain 3,515,2 +00 lines #define N_LINES_BIG1_L 3515200 // Based on rough benchmarking, the short fixed string hack below is o +nly // worth trying for MAX_STR_LEN_L up to about 22. // See also https://backlinko.com/google-keyword-study #define MAX_STR_LEN_L 6 #ifdef MAX_STR_LEN_L using str_type = std::array<char, MAX_STR_LEN_L + 1>; #else using str_type = std::string; #endif using str_int_type = std::pair<str_type, llil_int_type>; using int_str_type = std::pair<llil_int_type, str_type>; using vec_str_int_type = std::vector<str_int_type>; using vec_int_str_type = std::vector<int_str_type>; // fast_atoll64 ------------------------------------------------------ +-- // // https://stackoverflow.com/questions/16826422/ // c-most-efficient-way-to-convert-string-to-int-faster-than-atoi inline int64_t fast_atoll64( const char* str ) { int64_t val = 0; int sign = 0; if ( *str == '-' ) { sign = 1, ++str; } uint8_t digit; while ((digit = uint8_t(*str++ - '0')) <= 9) val = val * 10 + digit +; return sign ? -val : val; } // Mimic the Perl get_properties subroutine -------------------------- +-- // Limit line length and use ANSI C functions to try to boost performa +nce #define MAX_LINE_LEN_L 255 static void get_properties( const char* fname, // in: the input file name vec_int_str_type& vec_ret) // out: a vector of properties { FILE* fh; char line[MAX_LINE_LEN_L+1]; char* found; llil_int_type count; fh = ::fopen(fname, "r"); if (fh == NULL) { std::cerr << "Error opening '" << fname << "' : errno=" << errno + << "\n"; return; } while ( ::fgets(line, MAX_LINE_LEN_L, fh) != NULL ) { found = ::strchr(line, '\t'); // fast_atoll64 is a touch faster than ::atoll count = fast_atoll64( &line[found - line + 1] ); line[found - line] = '\0'; // word #ifdef MAX_STR_LEN_L str_type fixword { { '\0', '\0', '\0', '\0', '\0', '\0', '\0' } +}; ::memcpy( fixword.data(), line, found - line ); vec_ret.emplace_back( -count, fixword ); #else vec_ret.emplace_back( -count, line ); #endif } ::fclose(fh); } double elaspe_time( std::chrono::high_resolution_clock::time_point cend, std::chrono::high_resolution_clock::time_point cstart) { return double( std::chrono::duration_cast<std::chrono::microseconds>(cend - cst +art).count() ) * 1e-6; } // ------------------------------------------------------------------- +-- int main(int argc, char* argv[]) { if (argc < 2) { std::cerr << "usage: llil5vec-tbb file1 file2 ... >out.txt\n"; return 1; } #ifdef MAX_STR_LEN_L std::cerr << "llil5vec-tbb (fixed string length=" << MAX_STR_LEN_L +<< ") start\n"; #else std::cerr << "llil5vec-tbb start\n"; #endif #ifdef USE_TBB_L std::cerr << "use TBB\n"; #else std::cerr << "don't use TBB\n"; #endif #if USE_BOOST_PARALLEL_SORT == 0 std::cerr << "don't use boost sort\n"; #else std::cerr << "use boost sort\n"; #endif std::chrono::high_resolution_clock::time_point cstart1, cend1, csta +rt2, cend2, cstart3, cend3s, cend3; cstart1 = std::chrono::high_resolution_clock::now(); #ifdef USE_TBB_L // Determine the number of threads. const char* env_nthrs = std::getenv("NUM_THREADS"); int nthrs = (env_nthrs && strlen(env_nthrs)) ? ::atoi(env_nthrs) : +std::thread::hardware_concurrency(); tbb::global_control global_limit(tbb::global_control::max_allowed_p +arallelism, nthrs); #else int nthrs = 1; #endif // Get the list of input files from the command line int nfiles = argc - 1; char** fname = &argv[1]; // Create the vector of properties vec_int_str_type propvec; // propvec.reserve(N_LINES_BIG1_L * 3); // doesn't make much dif +ference // Run parallel, depending on the number of threads if ( nthrs == 1 ) { for (int i = 0; i < nfiles; ++i) get_properties( fname[i], propvec ); } #ifdef USE_TBB_L else { tbb::parallel_for( tbb::blocked_range<int>(0, nfiles, 1), [&](tbb::blocked_range<int> r) { for (int i = r.begin(); i < r.end(); ++i) { vec_int_str_type locvec; get_properties( fname[i], locvec ); // A static mutex that is shared across all threads static tbb::spin_mutex mtx; // Acquire a scoped lock tbb::spin_mutex::scoped_lock lock(mtx); // Append local vector to propvec propvec.insert( propvec.end(), locvec.begin(), locvec.end( +) ); } }); } #endif cend1 = std::chrono::high_resolution_clock::now(); double ctaken1 = elaspe_time(cend1, cstart1); std::cerr << "get properties time : " << ctaken1 << " secs\n"; cstart2 = std::chrono::high_resolution_clock::now(); // Needs to be sorted by word for later sum of adjacent count field +s to work #if USE_BOOST_PARALLEL_SORT == 0 #ifdef USE_TBB_L tbb::parallel_sort( #else std::sort( #endif propvec.begin(), propvec.end(), [](const int_str_type& left, const int_str_type& right) { return + left.second < right.second; } ); #else // Try building with clang++ if g++ emits errors boost::sort::block_indirect_sort( propvec.begin(), propvec.end(), [](const int_str_type& left, const int_str_type& right) { return + left.second < right.second; }, nthrs ); #endif cend2 = std::chrono::high_resolution_clock::now(); double ctaken2 = elaspe_time(cend2, cstart2); std::cerr << "sort properties time : " << ctaken2 << " secs\n"; cstart3 = std::chrono::high_resolution_clock::now(); // To sort, replace building a std::set with sorting a std::vector // Note: negative count gives desired ordering // Aside: consider how this loop might be parallelised vec_int_str_type myvec; auto it = propvec.cbegin(); str_type name_last = it->second; llil_int_type count = it->first; for (++it; it != propvec.cend(); ++it) { if ( it->second == name_last ) { count += it->first; } else { myvec.emplace_back( count, name_last ); name_last = it->second; count = it->first; } } myvec.emplace_back( count, name_last ); // As usual, stable_sort with a simpler sort function tends to be s +lower #if USE_BOOST_PARALLEL_SORT == 0 #ifdef USE_TBB_L tbb::parallel_sort( #else std::sort( #endif myvec.begin(), myvec.end(), [](const int_str_type& left, const int_str_type& right) { return + left.first != right.first ? left.first < right.first : left.second < + right.second; } // use this one for std::stable_sort // [](const int_str_type& left, const int_str_type& right) { r +eturn left.first < right.first; } ); #else // block_indirect_sort seems to be the fastest // Note: spinsort takes only 3 parameters (no nthrs parameter, unli +ke the other two) boost::sort::block_indirect_sort( // boost::sort::parallel_stable_sort( // boost::sort::spinsort( myvec.begin(), myvec.end(), [](const int_str_type& left, const int_str_type& right) { return + left.first != right.first ? left.first < right.first : left.second < + right.second; } // [](const int_str_type& left, const int_str_type& right) { ret +urn left.first < right.first; } , nthrs ); #endif cend3s = std::chrono::high_resolution_clock::now(); // Note: fix up negative count via -n.first #ifdef MAX_STR_LEN_L for ( auto const& n : myvec ) ::print(fast_io::concatln(std::string(n.second.data()), "\t", -n +.first)); #else for ( auto const& n : myvec ) ::print(fast_io::concatln(n.second, "\t", -n.first)); #endif cend3 = std::chrono::high_resolution_clock::now(); double ctaken = elaspe_time(cend3, cstart1); double ctaken3s = elaspe_time(cend3s, cstart3); double ctaken3o = elaspe_time(cend3, cend3s); std::cerr << "vector stable sort time : " << ctaken3s << " secs\n" +; std::cerr << "write stdout time : " << ctaken3o << " secs\n" +; std::cerr << "total time : " << ctaken << " secs\n" +; // Hack to see Private Bytes in Windows Task Manager (uncomment nex +t line so process doesn't exit too quickly) // std::this_thread::sleep_for(std::chrono::milliseconds(90000000 +)); return 0; }

Update: llil4vec2.cpp

Note: many updates were made to this version long after originally posted. Latest update: 10-Feb-2023

This is just a version of mario's llil4vec.cpp, using OpenMP instead of TBB, and with a more general reduce_vec function replacing some mainline code. I wrote this new function to allow me to play around with parallelisation, but everything I tried was slower. :-( Still, I think the code is a bit cleaner with the reduce_vec function and seems to run around the same speed (see timings above).

// llil4vec2.cpp // See also: perlmonks.com, node_id=11149754 // llil4vec.cpp with a reduce_vec function. // Based on: https://perlmonks.com/?node_id=11149545 // OpenMP Little Book - https://nanxiao.gitbooks.io/openmp-little-book +/content/ // // Vector version of llil2grt.pl. // based on llil3vec.cpp https://perlmonks.com/?node_id=11149482 // 1. Run get_properties in parallel. // 2. Capture time and diff via chrono. // 3. Threads: flush local vector periodically. // 4. Key words null-terminated for MAX_STR_LEN_L. // 5. Concat strings using fast_io::concatln during output. // 6. Support NUM_THREADS environment variable. // 7. Add FLUSH_VECTOR_PERIODICALLY define statement. // 8. Removed periodically flush, not what I expected. // 9. Simplified code, similar to the tbb implementation. // A. Capture time for get and sort properties separately. // B. Added define for using boost's parallel sort. // C. Replaced atoll with fast_atoll64. // D. Fast vector sorting - from llil5vec-tbb. // E. Reduce in-place, duplicate key names - tally count. // F. Exit early if no work; fast_io tweak writing to output; // fixword: ensure not more than MAX_LINE_LEN_L characters; // limit to 8 threads max for sorting. // G. Capture time for vector reduce separately. // I. Improved get_properties; set precision for timings. // // Obtain the fast_io library (required dependency): // git clone --depth=1 https://github.com/cppfastio/fast_io // // g++ compile on Linux: (boost header may need the -Wno-stringop-over +flow gcc option) // g++ -o llil4vec2 -std=c++20 -Wall -O3 llil4vec2.cpp -I ./fast_io +/include // g++ -o llil4vec2-omp -std=c++20 -fopenmp -Wall -O3 llil4vec2.cpp + -I ./fast_io/include // // This g++ command also works with mingw C++ compiler (https://source +forge.net/projects/mingw-w64) // that comes bundled with Strawberry Perl (C:\Strawberry\c\bin\g++.ex +e). // // clang++ compile: same args, without the -Wno-stringop-overflow opti +on // Seems to run slightly faster when compiled with clang++ instead of +g++ // // An example compile on Ubuntu Linux with some 3rd party (header) lib +raries unpacked to $HOME: // clang++ -o llil4vec2 -std=c++20 -fopenmp -Wall -O3 // -I "$HOME/local-fast_io/fast_io/include" // -I "$HOME/local-parallel-hashmap/parallel-hashmap" // -I "$HOME/local-boost/boost_1_81_0" // llil4vec2.cpp // // Obtain gen-llil.pl and gen-long-llil.pl from https://perlmonks.com/ +?node_id=11148681 // perl gen-llil.pl big1.txt 200 3 1 // perl gen-llil.pl big2.txt 200 3 1 // perl gen-llil.pl big3.txt 200 3 1 // perl gen-long-llil.pl long1.txt 600 // perl gen-long-llil.pl long2.txt 600 // perl gen-long-llil.pl long3.txt 600 // // To make random input, obtain shuffle.pl from https://perlmonks.com/ +?node_id=11149800 // perl shuffle.pl big1.txt >tmp && mv tmp big1.txt // perl shuffle.pl big2.txt >tmp && mv tmp big2.txt // perl shuffle.pl big3.txt >tmp && mv tmp big3.txt // // Example run: llil4vec2 big1.txt big2.txt big3.txt >out.txt // NUM_THREADS=3 llil4vec2-omp ... // ------------------------------------------------------------------- +--------- // Specify 0/1 to use boost's parallel sorting algorithm; faster than +__gnu_parallel::sort. // https://www.boost.org/doc/libs/1_81_0/libs/sort/doc/html/sort/paral +lel.html // This requires the boost header files: e.g. devpkg-boost bundle on C +lear Linux. // Note: Another option is downloading and unpacking Boost locally. // (no need to build it because the bits we use are header file only) #define USE_BOOST_PARALLEL_SORT 1 #include <chrono> #include <thread> // The fast_io header must come after chrono, else build error: // "no member named 'concatln' in namespace 'fast_io'" #include <fast_io.h> #include <cstdio> #include <cstddef> #include <cstdint> #include <cstdlib> #include <cstring> #include <ctime> #include <string> #include <array> #include <vector> #include <utility> #include <iterator> #include <execution> #include <algorithm> #if USE_BOOST_PARALLEL_SORT > 0 #include <boost/sort/sort.hpp> #endif #ifdef _OPENMP #include <omp.h> #endif #include <iostream> #include <iomanip> #include <fstream> #include <sstream> static_assert(sizeof(size_t) == sizeof(int64_t), "size_t too small, ne +ed a 64-bit compile"); // ------------------------------------------------------------------- +--------- typedef long long llil_int_type; // Note: all words in big1.txt, big2.txt, big3.txt are <= 6 chars in l +ength // big.txt max word length is 6 // long.txt max word length is 208 // Based on rough benchmarking, the short fixed string hack below is o +nly // worth trying for MAX_STR_LEN_L up to about 22. // See also https://backlinko.com/google-keyword-study // Note: if input data contains words longer than MAX_STR_LEN_L // the program may malfunction or even crash // To use (limited length) fixed length strings uncomment the next lin +e #define MAX_STR_LEN_L 6 #ifdef MAX_STR_LEN_L // Uncomment next line to use C memcmp function to compare fixed lengt +h strings #define USE_MEMCMP_L 1 // using str_type = std::array<char, MAX_STR_LEN_L + 1>; using str_type = std::array<char, MAX_STR_LEN_L>; #else // using str_type = std::string; using str_type = std::basic_string<char>; #endif using int_str_type = std::pair<llil_int_type, str_type>; using vec_int_str_type = std::vector<int_str_type>; // fast_atoll64 // https://stackoverflow.com/questions/16826422/ // c-most-efficient-way-to-convert-string-to-int-faster-than-atoi // Commenting out done below because llil spec [id://11148465] states: // each line must match : ^[a-z]+\t\d+$ // i.e. you may assume count >=0 inline int64_t fast_atoll64( const char* str ) { int64_t val = 0; // int sign = 0; // if ( *str == '-' ) { // sign = 1, ++str; // } uint8_t digit; while ((digit = uint8_t(*str++ - '0')) <= 9) val = val * 10 + digi +t; // return sign ? -val : val; return val; } // Mimic the Perl get_properties subroutine -------------------------- +-- // Limit line length and use ANSI C functions to try to boost performa +nce #define MAX_LINE_LEN_L 255 static void get_properties( const char* fname, // in: the input file name vec_int_str_type& vec_ret) // out: a vector of properties { FILE* fh; std::array<char, MAX_LINE_LEN_L + 1> line; char* found; llil_int_type count; fh = ::fopen(fname, "r"); if ( fh == NULL ) { std::cerr << "Error opening '" << fname << "' : " << strerror(er +rno) << "\n"; return; } while ( ::fgets( line.data(), static_cast<int>(MAX_LINE_LEN_L), fh +) != NULL ) { found = std::find( line.begin(), line.end(), '\t' ); count = fast_atoll64(found+1); #ifdef MAX_STR_LEN_L str_type fixword {}; // Note: {} initializes all elements of +fixword to '\0' std::copy( line.begin(), found, fixword.begin() ); vec_ret.emplace_back( count, fixword ); #else *found = '\0'; vec_ret.emplace_back( count, line.data() ); #endif } ::fclose(fh); } // Reduce a vector range (tally adjacent count fields of duplicate key + names) // Return the reduced length static vec_int_str_type::size_type reduce_vec( vec_int_str_type::iterator it1, // range of vector elements to +reduce vec_int_str_type::iterator it2 ) { auto itr = it1; auto itw = it1; llil_int_type count = itr->first; str_type name_last = itr->second; for ( ++itr; itr != it2; ++itr ) { #ifdef USE_MEMCMP_L if ( ::memcmp(itr->second.data(), name_last.data(), MAX_STR_LEN_ +L) == 0 ) { #else if ( itr->second == name_last ) { #endif count += itr->first; } else { itw->first = count; itw->second = name_last; ++itw; count = itr->first; name_last = itr->second; } } itw->first = count; itw->second = name_last; return std::distance(it1, ++itw); } typedef std::chrono::high_resolution_clock high_resolution_clock; typedef std::chrono::high_resolution_clock::time_point time_point; typedef std::chrono::milliseconds milliseconds; double elaspe_time( time_point cend, time_point cstart) { return double ( std::chrono::duration_cast<milliseconds>(cend - cstart).count() ) * 1e-3; } // ------------------------------------------------------------------- +-- int main(int argc, char* argv[]) { if (argc < 2) { std::cerr << "usage: llil4vec2 file1 file2 ... >out.txt\n"; return 1; } std::cerr << std::setprecision(3) << std::setiosflags(std::ios::fix +ed); #ifdef MAX_STR_LEN_L #ifdef USE_MEMCMP_L std::cerr << "llil4vec2 (fixed string length=" << MAX_STR_LEN_L << +") use memcmp start\n"; #else std::cerr << "llil4vec2 (fixed string length=" << MAX_STR_LEN_L << +") start\n"; #endif #else std::cerr << "llil4vec2 start\n"; #endif #ifdef _OPENMP std::cerr << "use OpenMP\n"; #else std::cerr << "don't use OpenMP\n"; #endif #if USE_BOOST_PARALLEL_SORT == 0 std::cerr << "don't use boost sort\n"; #else std::cerr << "use boost sort\n"; #endif time_point cstart1, cend1, cstart2, cend2, cstart3, cend3r, cend3s, + cend3; cstart1 = std::chrono::high_resolution_clock::now(); #ifdef _OPENMP // Determine the number of threads. const char* env_nthrs = std::getenv("NUM_THREADS"); int nthrs = (env_nthrs && strlen(env_nthrs)) ? ::atoi(env_nthrs) : +std::thread::hardware_concurrency(); omp_set_dynamic(false); omp_set_num_threads(nthrs); #else int nthrs = 1; #endif int nthrs_sort = ( std::thread::hardware_concurrency() < 12 ) ? std::thread::hardware_concurrency() : 12; // Get the list of input files from the command line int nfiles = argc - 1; char** fname = &argv[1]; // Create the vector of properties vec_int_str_type propvec; // Run parallel, depending on the number of threads if ( nthrs == 1 || nfiles == 1 ) { for (int i = 0; i < nfiles; ++i) get_properties( fname[i], propvec ); } #ifdef _OPENMP else { #pragma omp parallel for schedule(static, 1) for (int i = 0; i < nfiles; ++i) { vec_int_str_type locvec; get_properties( fname[i], locvec ); #pragma omp critical { // Append local vector to propvec propvec.insert( propvec.end(), locvec.begin(), locvec.end( +) ); } } } #endif if (!propvec.size()) { std::cerr << "No work, exiting...\n"; return 1; } cend1 = std::chrono::high_resolution_clock::now(); double ctaken1 = elaspe_time(cend1, cstart1); std::cerr << "get properties time : " << std::setw(8) << ctake +n1 << " secs\n"; cstart2 = std::chrono::high_resolution_clock::now(); // Needs to be sorted by word for later sum of adjacent count field +s to work #if USE_BOOST_PARALLEL_SORT == 0 std::sort( propvec.begin(), propvec.end(), [](const int_str_type& left, const int_str_type& right) { return #ifdef USE_MEMCMP_L ::memcmp(left.second.data(), right.second.data(), MAX_STR_ +LEN_L) < 0; #else left.second < right.second; #endif } ); #else boost::sort::block_indirect_sort( propvec.begin(), propvec.end(), [](const int_str_type& left, const int_str_type& right) { return #ifdef USE_MEMCMP_L ::memcmp(left.second.data(), right.second.data(), MAX_STR_ +LEN_L) < 0; #else left.second < right.second; #endif }, nthrs_sort ); #endif cend2 = std::chrono::high_resolution_clock::now(); double ctaken2 = elaspe_time(cend2, cstart2); std::cerr << "sort properties time : " << std::setw(8) << ctake +n2 << " secs\n"; cstart3 = std::chrono::high_resolution_clock::now(); // Reduce in-place (tally adjacent count fields of duplicate key na +mes) vec_int_str_type::size_type newsize = reduce_vec( propvec.begin(), +propvec.end() ); propvec.resize(newsize); cend3r = std::chrono::high_resolution_clock::now(); // Sort the vector by (count) in reverse order, (name) in lexical o +rder #if USE_BOOST_PARALLEL_SORT == 0 std::sort( // Standard sort propvec.begin(), propvec.end(), [](const int_str_type& left, const int_str_type& right) { #ifdef USE_MEMCMP_L return left.first != right.first ? left.first > right.first : ::memcmp(left.second.data(), right.second.data(), MAX_ST +R_LEN_L) < 0; #else return left.first != right.first ? left.first > right.first : left.second < right.second; #endif } ); #else boost::sort::block_indirect_sort( // Parallel sort propvec.begin(), propvec.end(), [](const int_str_type& left, const int_str_type& right) { #ifdef USE_MEMCMP_L return left.first != right.first ? left.first > right.first : ::memcmp(left.second.data(), right.second.data(), MAX_ST +R_LEN_L) < 0; #else return left.first != right.first ? left.first > right.first : left.second < right.second; #endif }, nthrs_sort ); #endif cend3s = std::chrono::high_resolution_clock::now(); // Output the sorted vector for ( auto const& n : propvec ) { #ifdef MAX_STR_LEN_L // Note: finding doco on mnp::os_c_str() is a challenge, but tes +ting shows this // prints non null-terminated strings of length MAX_STR_LEN_L co +rrectly // ::print(fast_io::concatln(fast_io::mnp::os_c_str(n.second.dat +a(), MAX_STR_LEN_L), "\t", n.first)); // ::println(fast_io::mnp::os_c_str(n.second.data(), MAX_STR_LEN +_L), "\t", n.first); fast_io::io::println(fast_io::mnp::os_c_str(n.second.data(), MAX +_STR_LEN_L), "\t", n.first); #else // ::print(fast_io::concatln(n.second, "\t", n.first)); // ::println(n.second, "\t", n.first); fast_io::io::println(n.second, "\t", n.first); #endif } cend3 = std::chrono::high_resolution_clock::now(); double ctaken = elaspe_time(cend3, cstart1); double ctaken3r = elaspe_time(cend3r, cstart3); double ctaken3s = elaspe_time(cend3s, cend3r); double ctaken3o = elaspe_time(cend3, cend3s); std::cerr << "vector reduce time : " << std::setw(8) << ctake +n3r << " secs\n"; std::cerr << "vector stable sort time : " << std::setw(8) << ctake +n3s << " secs\n"; std::cerr << "write stdout time : " << std::setw(8) << ctake +n3o << " secs\n"; std::cerr << "total time : " << std::setw(8) << ctake +n << " secs\n"; // Hack to see Private Bytes in Windows Task Manager (uncomment nex +t line so process doesn't exit too quickly) // std::this_thread::sleep_for(std::chrono::milliseconds(90000000 +)); return 0; }

Added Later

  • 2024 update: marioroy noted that std::mutex is slower in gcc-14 than gcc-13; he found another way to use spin locks via a small C++ class on the web.

Updated: changed ::atoll to fast_atoll64. Thanks marioroy. 28 Jan 2023: Added llil4vec2.cpp. Updated Timings section. Note: llil4vec2.cpp was updated multiple times, often based on improvements to llil4vec.cpp at Re^7: Rosetta Code: Long List is Long (faster - llil4vec - OpenMP code) and llil4vec-tbb.cpp at Re^9: Rosetta Code: Long List is Long (faster - llil4vec - TBB code) (full mario links at Re^2: Rosetta Code: Long List is Long - JudySL summary).

  • Comment on Re^10: Rosetta Code: Long List is Long (faster - llil4vec - llil4judy - TBB code - llil5vec - boost)
  • Select or Download Code

Replies are listed 'Best First'.
Re^11: Rosetta Code: Long List is Long (faster - llil4vec - TBB code - llil5vec)
by marioroy (Prior) on Jan 23, 2023 at 23:22 UTC
    I was then able to compile your program with both clang++ and g++. Though g++ spat out a number of warnings...

    The warnings began since using Boost. You can silent the warnings with the gcc -Wno-stringop-overflow option. Though, no issues using clang++.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11149754]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (10)
As of 2024-04-23 08:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found