Aloha eyepopslikeamosquito,

I made an update to your simpler llil3vec-tbb-a.cpp version. I captured results and posted them here. The llil4vec OpenMP impl can be found here, for comparison.

Before posting, I ran with 121 MB input files (many of them). I saw a 5 ~ 6 second gap in favor of OpenMP; e.g. 1 core running for a while between parallel_for and parallel_sort. This is resolved by having threads merge immediately after processing a file, similarly to the OpenMP implementation. Now, TBB performs well for small, big, or many input files.

llil4vec-tbb.cc

// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~ // llil4vec-tbb.cc // A std::vector demonstration using the Intel TBB library. // https://perlmonks.com/?node_id=11149687 // // April 25, 2024 // Based on llil3m.cpp https://perlmonks.com/?node_id=11149482 // Original challenge https://perlmonks.com/?node_id=11147822 // and summary https://perlmonks.com/?node_id=11150293 // Other demonstrations https://perlmonks.com/?node_id=11149907 // // Authors // Mario Roy - C++ demonstration with parallel capabilities // eyepopslikeamosquito - Co-author, learning C++ at PerlMonks.com // // See also, memory efficient variant // https://gist.github.com/marioroy/d02881b96b20fa1adde4388b3e21616 +3 // // ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~ // Getting Started with Intel® Threading Building Blocks (Intel® TBB) // https://www.intel.com/content/www/us/en/developer/articles/guide/ge +t-started-with-tbb.html // // Pro TBB - C++ Parallel Programming with Threading Building Blocks // https://link.springer.com/book/10.1007/978-1-4842-4398-5 // // Compile on Linux (clang++ or g++): // clang++ -o llil4vec-tbb -std=c++20 -Wall -O3 llil4vec-tbb.cc -l +tbb // clang++ -o llil4vec-tbb -std=c++20 -Wall -O3 llil4vec-tbb.cc -I +"$HOME/local-oneapi-tbb/oneapi-tbb-2021.7.0/include" -L "$HOME/local- +oneapi-tbb/oneapi-tbb-2021.7.0/lib" -l tbb // // The g++ command also works with mingw C++ compiler (https://sourcef +orge.net/projects/mingw-w64) // that comes bundled with Strawberry Perl (C:\Strawberry\c\bin\g++.ex +e). // // Obtain gen-llil.pl and gen-long-llil.pl from https://perlmonks.com/ +?node_id=11148681 // perl gen-llil.pl big1.txt 200 3 1 // perl gen-llil.pl big2.txt 200 3 1 // perl gen-llil.pl big3.txt 200 3 1 // // To make random input, obtain shuffle.pl from https://perlmonks.com/ +?node_id=11149800 // perl shuffle.pl big1.txt >tmp && mv tmp big1.txt // perl shuffle.pl big2.txt >tmp && mv tmp big2.txt // perl shuffle.pl big3.txt >tmp && mv tmp big3.txt // // Example run: llil4vec-tbb big1.txt big2.txt big3.txt >out.txt // NUM_THREADS=3 llil4vec-tbb ... // ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~ #include <cassert> #include <cstdio> #include <cstddef> #include <cstdint> #include <cstdlib> #include <cstring> #include <ctime> #include <compare> #include <chrono> #include <string> #include <string_view> #include <array> #include <vector> #include <thread> #include <execution> #include <iomanip> #include <iostream> #include <fstream> static_assert(sizeof(size_t) == sizeof(int64_t), "size_t too small, ne +ed a 64-bit compile"); // Specify 0/1 to use boost's parallel sorting algorithm; faster than +__gnu_parallel::sort. // https://www.boost.org/doc/libs/1_85_0/libs/sort/doc/html/sort/paral +lel.html // https://www.boost.org/doc/libs/1_85_0/libs/sort/doc/papers/block_in +direct_sort_en.pdf // This requires the boost header files: e.g. devpkg-boost bundle on C +lear Linux. // Note: Another option is downloading and unpacking Boost locally. // (no need to build it because the bits we use are header file only) #define USE_BOOST_PARALLEL_SORT 1 #if USE_BOOST_PARALLEL_SORT #ifdef __clang__ #pragma clang diagnostic push #pragma clang diagnostic ignored "-Wunused-parameter" #pragma clang diagnostic ignored "-Wshadow" #include <boost/sort/sort.hpp> #pragma clang diagnostic pop #else #include <boost/sort/sort.hpp> #endif #endif // Uncomment to use Intel TBB library #define USE_TBB_L 1 #ifdef USE_TBB_L #include <tbb/concurrent_priority_queue.h> #include <tbb/global_control.h> #include <tbb/parallel_sort.h> #include <tbb/parallel_for.h> #include <tbb/spin_mutex.h> #endif // ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~ typedef uint32_t int_type; // All words in big1.txt, big2.txt, big3.txt are <= 6 chars in length. // big.txt max word length is 6 // long.txt max word length is 208 // // Based on rough benchmarking, the short fixed string hack below is o +nly // worth trying for MAX_STR_LEN_L up to about 30. // See also https://backlinko.com/google-keyword-study // // To use (limited length) fixed length strings uncomment the next lin +e. #define MAX_STR_LEN_L (size_t) 12 #ifdef MAX_STR_LEN_L struct str_type : std::array<char, MAX_STR_LEN_L> { bool operator==( const str_type& o ) const { return ::memcmp(this->data(), o.data(), MAX_STR_LEN_L) == 0; } bool operator<( const str_type& o ) const { return ::memcmp(this->data(), o.data(), MAX_STR_LEN_L) < 0; } }; #else using str_type = std::basic_string<char>; #endif using str_int_type = std::pair<str_type, int_type>; using vec_str_int_type = std::vector<str_int_type>; // Mimic the Perl get_properties subroutine ~~~~~~~~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~ // convert positive number from string to uint32_t inline uint32_t fast_atoll64(const char* str) { uint32_t val = 0; uint8_t digit; while ((digit = uint8_t(*str++ - '0')) <= 9) val = val * 10 + digit; return val; } // Helper function to find a character. inline char* find_char(char* first, char* last, char c) { while (first != last) { if (*first == c) break; ++first; } return first; } static int64_t get_properties( const char* fname, // in: the input file name vec_str_int_type& vec_ret) // out: a vector of properties { int64_t num_lines = 0; std::ifstream fin(fname, std::ios::binary); if (!fin.is_open()) { std::cerr << "Error opening '" << fname << "' : " << strerror(er +rno) << '\n'; return num_lines; } fin.seekg(0, std::ios::end); // Get the size of the file std::streampos fsize = fin.tellg(); fin.seekg(0, std::ios::beg); std::vector<char> buf(1 + fsize); // Read the whole file into th +e buffer fin.read(&buf[0], fsize); fin.close(); char* first = &buf[0]; char* last = &buf[fsize]; // Iterate through the buffer while (first < last) { char* beg_ptr{first}; char* end_ptr{find_char(first, last, '\n')}; char* found = find_char(beg_ptr, end_ptr, '\t'); first = end_ptr + 1; if (found == end_ptr) continue; assert(*found == '\t'); int_type count = fast_atoll64(found + 1); size_t klen = found - beg_ptr; #ifdef MAX_STR_LEN_L str_type s {}; // {} initializes all elements of s to '\0' ::memcpy(s.data(), beg_ptr, std::min(MAX_STR_LEN_L, klen)); #else str_type s(beg_ptr, klen); #endif vec_ret.emplace_back(std::move(s), count); ++num_lines; } return num_lines; } // Output subroutine ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~ inline constexpr size_t CHUNK_SIZE = 32768; size_t divide_up(size_t dividend, size_t divisor) { if (dividend % divisor) return (size_t)(dividend / divisor) + 1; else return (size_t)(dividend / divisor); } static void out_properties( const int nthds, // in : the number of threads vec_str_int_type& vec) // in : the vector to output { size_t num_chunks = divide_up(vec.size(), CHUNK_SIZE); #ifdef USE_TBB_L int nthds_out = std::min(nthds, 32); tbb::global_control global_limit(tbb::global_control::max_allowed_p +arallelism, nthds_out); tbb::concurrent_priority_queue< std::pair<size_t, std::string>, decltype( [](const auto& lhs, const auto& rhs) { return lhs.firs +t > rhs.first; } ) > queue; tbb::parallel_for( tbb::blocked_range<int>(0, num_chunks, 1), [&](tbb::blocked_range<int> r) { for (size_t chunk_id = r.begin(); chunk_id < (size_t) r.end(); + ++chunk_id) { #else for (size_t chunk_id = 0; chunk_id < num_chunks; ++chunk_id) { #endif std::string str(""); str.reserve(2048 * 1024); auto it = vec.begin() + chunk_id * CHUNK_SIZE; auto it2 = vec.begin() + std::min(vec.size(), (chunk_id + 1) +* CHUNK_SIZE); for (; it != it2; ++it) { #ifdef MAX_STR_LEN_L str.append(it->first.data()); #else str.append(it->first.data(), it->first.size()); #endif str.append("\t", 1); str.append(std::to_string(it->second)); str.append("\n", 1); } #ifdef USE_TBB_L queue.push( std::make_pair(chunk_id, std::move(str)) ); #else std::cout << str << std::flush; #endif } #ifdef USE_TBB_L }); std::pair<size_t, std::string> data; while ( queue.try_pop(data) ) std::cout << data.second << std::flush; #endif } // Reduce a vector range (tally adjacent count fields of duplicate key + names) static void reduce_vec( auto& vec // vector elements to reduce ) { if (vec.size() > 0) { auto it1 = vec.begin(); auto itr = it1; auto itw = it1; auto it2 = vec.end(); auto curr = itr; for (++itr; itr != it2; ++itr) { if (itr->first == curr->first) { curr->second += itr->second; } else { if (itw != curr) *itw = std::move(*curr); ++itw; curr = itr; } } if (itw != curr) *itw = std::move(*curr); vec.resize(std::distance(it1, ++itw)); } } typedef std::chrono::high_resolution_clock high_resolution_clock; typedef std::chrono::high_resolution_clock::time_point time_point; typedef std::chrono::milliseconds milliseconds; double elaspe_time(time_point cend, time_point cstart) { return double ( std::chrono::duration_cast<milliseconds>(cend - cstart).count() ) * 1e-3; } // ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~ int main(int argc, char* argv[]) { if (argc < 2) { if (argc > 0) std::cerr << "usage: llil4vec-tbb file1 file2 ... >out.txt\n" +; return 1; } std::cerr << std::setprecision(3) << std::setiosflags(std::ios::fix +ed); #ifdef MAX_STR_LEN_L std::cerr << "llil4vec-tbb (fixed string length=" << MAX_STR_LEN_L +<< ") start\n"; #else std::cerr << "llil4vec-tbb start\n"; #endif #ifdef USE_TBB_L std::cerr << "use TBB\n"; #else std::cerr << "don't use TBB\n"; #endif #if USE_BOOST_PARALLEL_SORT == 0 std::cerr << "don't use boost sort\n"; #else std::cerr << "use boost sort\n"; #endif time_point cstart1, cend1, cstart2, cend2, cstart3, cend3r, cend3s, + cend3; cstart1 = high_resolution_clock::now(); #ifdef USE_TBB_L // Determine the number of threads. const char* env_nthds = std::getenv("NUM_THREADS"); int nthds = ( env_nthds && strlen(env_nthds) ) ? ::atoi(env_nthds) : std::thread::hardware_concurrency(); tbb::global_control global_limit(tbb::global_control::max_allowed_p +arallelism, nthds); #else int nthds = 1; #endif // Get the list of input files from the command line int nfiles = argc - 1; char** fname = &argv[1]; // Create the vector of properties vec_str_int_type propvec; int64_t num_lines = 0; // Run parallel, depending on the number of threads if (nthds == 1 || nfiles == 1) { for (int i = 0; i < nfiles; ++i) num_lines += get_properties(fname[i], propvec); } #ifdef USE_TBB_L else { num_lines = tbb::parallel_reduce( tbb::blocked_range<int>(0, nfiles, 1), 0, [&](tbb::blocked_range<int> r, int64_t running_num_lines) { for (int i = r.begin(); i < r.end(); ++i) { vec_str_int_type locvec; running_num_lines += get_properties(fname[i], locvec); // A static mutex that is shared across all threads static tbb::spin_mutex mtx; // Acquire a scoped lock tbb::spin_mutex::scoped_lock lock(mtx); // Append local vector to propvec (consumes more memory) // propvec.insert( // propvec.end(), // std::make_move_iterator(locvec.begin()), // std::make_move_iterator(locvec.end()) // ); // Append local vector to propvec (faster) auto it = locvec.begin(); auto it2 = locvec.end(); for (; it != it2; ++it) propvec.emplace_back(std::move(*it)); } return running_num_lines; }, std::plus<int64_t>() ); } #endif cend1 = high_resolution_clock::now(); double ctaken1 = elaspe_time(cend1, cstart1); std::cerr << "get properties " << std::setw(8) << ctaken1 << " + secs\n"; if (!propvec.size()) { std::cerr << "No work, exiting...\n"; return 1; } cstart2 = high_resolution_clock::now(); // Needs to be sorted by word for later sum of adjacent count field +s to work #if USE_BOOST_PARALLEL_SORT == 0 std::sort( propvec.begin(), propvec.end(), [](const str_int_type& left, const str_int_type& right) { return left.first < right.first; } ); #else boost::sort::block_indirect_sort( propvec.begin(), propvec.end(), [](const str_int_type& left, const str_int_type& right) { return left.first < right.first; }, std::min(nthds, 32) ); #endif cend2 = high_resolution_clock::now(); double ctaken2 = elaspe_time(cend2, cstart2); std::cerr << "sort properties " << std::setw(8) << ctaken2 << " + secs\n"; cstart3 = high_resolution_clock::now(); // Reduce in-place (tally adjacent count fields of duplicate key na +mes) reduce_vec(propvec); cend3r = high_resolution_clock::now(); // Sort the vector by (count) in reverse order, (name) in lexical o +rder auto reverse_order = [](const str_int_type& left, const str_int_typ +e& right) { return left.second != right.second ? left.second > right.second : left.first < right.first; }; #if USE_BOOST_PARALLEL_SORT == 0 // Standard sort std::sort(propvec.begin(), propvec.end(), reverse_order); #else // Parallel sort boost::sort::block_indirect_sort( propvec.begin(), propvec.end(), reverse_order, std::min(nthds, 32) ); #endif cend3s = high_resolution_clock::now(); // Output the sorted vector out_properties(nthds, propvec); cend3 = high_resolution_clock::now(); double ctaken = elaspe_time(cend3, cstart1); double ctaken3r = elaspe_time(cend3r, cstart3); double ctaken3s = elaspe_time(cend3s, cend3r); double ctaken3o = elaspe_time(cend3, cend3s); std::cerr << "vector reduce " << std::setw(8) << ctaken3r << +" secs\n"; std::cerr << "vector stable sort " << std::setw(8) << ctaken3s << +" secs\n"; std::cerr << "write stdout " << std::setw(8) << ctaken3o << +" secs\n"; std::cerr << "total time " << std::setw(8) << ctaken << +" secs\n"; std::cerr << " count lines " << num_lines << "\n"; std::cerr << " count unique " << propvec.size() << "\n"; // Hack to see Private Bytes in Windows Task Manager // (uncomment next line so process doesn't exit too quickly) // std::this_thread::sleep_for(milliseconds(9000)); return 0; }


In reply to Re^9: Rosetta Code: Long List is Long (faster - llil4vec - TBB code) by marioroy
in thread Rosetta Code: Long List is Long by eyepopslikeamosquito

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.