comment on

Aloha eyepopslikeamosquito,

I made an update to your simpler llil3vec-tbb-a.cpp version. I captured results and posted them here. The llil4vec OpenMP impl can be found here, for comparison.

Before posting, I ran with 121 MB input files (many of them). I saw a 5 ~ 6 second gap in favor of OpenMP; e.g. 1 core running for a while between parallel_for and parallel_sort. This is resolved by having threads merge immediately after processing a file, similarly to the OpenMP implementation. Now, TBB performs well for small, big, or many input files.

llil4vec-tbb.cc

// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~
// llil4vec-tbb.cc
// A std::vector demonstration using the Intel TBB library.
// https://perlmonks.com/?node_id=11149687
//
// April 25, 2024
//    Based on llil3m.cpp  https://perlmonks.com/?node_id=11149482
//    Original challenge   https://perlmonks.com/?node_id=11147822
//         and summary     https://perlmonks.com/?node_id=11150293
//    Other demonstrations https://perlmonks.com/?node_id=11149907
//
// Authors
//    Mario Roy - C++ demonstration with parallel capabilities
//    eyepopslikeamosquito - Co-author, learning C++ at PerlMonks.com
//
// See also, memory efficient variant
//    https://gist.github.com/marioroy/d02881b96b20fa1adde4388b3e21616
+3
//
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~
// Getting Started with Intel® Threading Building Blocks (Intel® TBB)
// https://www.intel.com/content/www/us/en/developer/articles/guide/ge
+t-started-with-tbb.html
//
// Pro TBB - C++ Parallel Programming with Threading Building Blocks
// https://link.springer.com/book/10.1007/978-1-4842-4398-5
//
// Compile on Linux (clang++ or g++):
//    clang++ -o llil4vec-tbb -std=c++20 -Wall -O3 llil4vec-tbb.cc -l 
+tbb
//    clang++ -o llil4vec-tbb -std=c++20 -Wall -O3 llil4vec-tbb.cc -I 
+"$HOME/local-oneapi-tbb/oneapi-tbb-2021.7.0/include" -L "$HOME/local-
+oneapi-tbb/oneapi-tbb-2021.7.0/lib" -l tbb
//
// The g++ command also works with mingw C++ compiler (https://sourcef
+orge.net/projects/mingw-w64)
// that comes bundled with Strawberry Perl (C:\Strawberry\c\bin\g++.ex
+e).
//
// Obtain gen-llil.pl and gen-long-llil.pl from https://perlmonks.com/
+?node_id=11148681
//    perl gen-llil.pl big1.txt 200 3 1
//    perl gen-llil.pl big2.txt 200 3 1
//    perl gen-llil.pl big3.txt 200 3 1
//
// To make random input, obtain shuffle.pl from https://perlmonks.com/
+?node_id=11149800
//    perl shuffle.pl big1.txt >tmp && mv tmp big1.txt
//    perl shuffle.pl big2.txt >tmp && mv tmp big2.txt
//    perl shuffle.pl big3.txt >tmp && mv tmp big3.txt
//
// Example run:  llil4vec-tbb big1.txt big2.txt big3.txt >out.txt
// NUM_THREADS=3 llil4vec-tbb ...
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~

#include <cassert>
#include <cstdio>
#include <cstddef>
#include <cstdint>
#include <cstdlib>
#include <cstring>
#include <ctime>
#include <compare>
#include <chrono>

#include <string>
#include <string_view>

#include <array>
#include <vector>

#include <thread>
#include <execution>

#include <iomanip>
#include <iostream>
#include <fstream>

static_assert(sizeof(size_t) == sizeof(int64_t), "size_t too small, ne
+ed a 64-bit compile");

// Specify 0/1 to use boost's parallel sorting algorithm; faster than 
+__gnu_parallel::sort.
// https://www.boost.org/doc/libs/1_85_0/libs/sort/doc/html/sort/paral
+lel.html
// https://www.boost.org/doc/libs/1_85_0/libs/sort/doc/papers/block_in
+direct_sort_en.pdf
// This requires the boost header files: e.g. devpkg-boost bundle on C
+lear Linux.
// Note: Another option is downloading and unpacking Boost locally.
// (no need to build it because the bits we use are header file only)
#define USE_BOOST_PARALLEL_SORT 1

#if USE_BOOST_PARALLEL_SORT
   #ifdef __clang__
      #pragma clang diagnostic push
      #pragma clang diagnostic ignored "-Wunused-parameter"
      #pragma clang diagnostic ignored "-Wshadow"
      #include <boost/sort/sort.hpp>
      #pragma clang diagnostic pop
   #else
      #include <boost/sort/sort.hpp>
   #endif
#endif

// Uncomment to use Intel TBB library
#define USE_TBB_L 1

#ifdef USE_TBB_L
    #include <tbb/concurrent_priority_queue.h>
    #include <tbb/global_control.h>
    #include <tbb/parallel_sort.h>
    #include <tbb/parallel_for.h>
    #include <tbb/spin_mutex.h>
#endif

// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~

typedef uint32_t int_type;

// All words in big1.txt, big2.txt, big3.txt are <= 6 chars in length.
// big.txt  max word length is 6
// long.txt max word length is 208
//
// Based on rough benchmarking, the short fixed string hack below is o
+nly
// worth trying for MAX_STR_LEN_L up to about 30.
// See also https://backlinko.com/google-keyword-study
//
// To use (limited length) fixed length strings uncomment the next lin
+e.
#define MAX_STR_LEN_L (size_t) 12

#ifdef MAX_STR_LEN_L
struct str_type : std::array<char, MAX_STR_LEN_L> {
   bool operator==( const str_type& o ) const {
      return ::memcmp(this->data(), o.data(), MAX_STR_LEN_L) == 0;
   }
   bool operator<( const str_type& o ) const {
      return ::memcmp(this->data(), o.data(), MAX_STR_LEN_L) < 0;
   }
};
#else
using str_type = std::basic_string<char>;
#endif

using str_int_type     = std::pair<str_type, int_type>;
using vec_str_int_type = std::vector<str_int_type>;

// Mimic the Perl get_properties subroutine ~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~

// convert positive number from string to uint32_t
inline uint32_t fast_atoll64(const char* str)
{
   uint32_t val = 0;
   uint8_t digit;
   while ((digit = uint8_t(*str++ - '0')) <= 9)
      val = val * 10 + digit;
   return val;
}

// Helper function to find a character.
inline char* find_char(char* first, char* last, char c)
{
   while (first != last) {
      if (*first == c) break;
      ++first;
   }
   return first;
}

static int64_t get_properties(
   const char*        fname,       //  in: the input file name
   vec_str_int_type&  vec_ret)     // out: a vector of properties
{
   int64_t num_lines = 0;
   std::ifstream fin(fname, std::ios::binary);
   if (!fin.is_open()) {
      std::cerr << "Error opening '" << fname << "' : " << strerror(er
+rno) << '\n';
      return num_lines;
   }
   fin.seekg(0, std::ios::end);         // Get the size of the file
   std::streampos fsize = fin.tellg();
   fin.seekg(0, std::ios::beg);

   std::vector<char> buf(1 + fsize);    // Read the whole file into th
+e buffer
   fin.read(&buf[0], fsize);
   fin.close();

   char* first = &buf[0];
   char* last = &buf[fsize];
                                        // Iterate through the buffer
   while (first < last) {
      char* beg_ptr{first};
      char* end_ptr{find_char(first, last, '\n')};
      char* found = find_char(beg_ptr, end_ptr, '\t');

      first = end_ptr + 1;
      if (found == end_ptr)
         continue;

      assert(*found == '\t');
      int_type count = fast_atoll64(found + 1);
      size_t klen = found - beg_ptr;

#ifdef MAX_STR_LEN_L
      str_type s {};  // {} initializes all elements of s to '\0'
      ::memcpy(s.data(), beg_ptr, std::min(MAX_STR_LEN_L, klen));
#else
      str_type s(beg_ptr, klen);
#endif
      vec_ret.emplace_back(std::move(s), count);

      ++num_lines;
   }

   return num_lines;
}

// Output subroutine ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~

inline constexpr size_t CHUNK_SIZE = 32768;

size_t divide_up(size_t dividend, size_t divisor)
{
   if (dividend % divisor)
      return (size_t)(dividend / divisor) + 1;
   else
      return (size_t)(dividend / divisor);
}

static void out_properties(
   const int          nthds,     // in   : the number of threads
   vec_str_int_type&  vec)       // in   : the vector to output
{
   size_t num_chunks = divide_up(vec.size(), CHUNK_SIZE); 

#ifdef USE_TBB_L
   int nthds_out = std::min(nthds, 32);
   tbb::global_control global_limit(tbb::global_control::max_allowed_p
+arallelism, nthds_out);

   tbb::concurrent_priority_queue<
      std::pair<size_t, std::string>,
      decltype( [](const auto& lhs, const auto& rhs) { return lhs.firs
+t > rhs.first; } )
   > queue;

   tbb::parallel_for( tbb::blocked_range<int>(0, num_chunks, 1),
                       [&](tbb::blocked_range<int> r)
   {
      for (size_t chunk_id = r.begin(); chunk_id < (size_t) r.end(); +
++chunk_id) {
#else
      for (size_t chunk_id = 0; chunk_id < num_chunks; ++chunk_id) {
#endif
         std::string str(""); str.reserve(2048 * 1024);
         auto it  = vec.begin() + chunk_id * CHUNK_SIZE;
         auto it2 = vec.begin() + std::min(vec.size(), (chunk_id + 1) 
+* CHUNK_SIZE);

         for (; it != it2; ++it) {
            #ifdef MAX_STR_LEN_L
            str.append(it->first.data());
            #else
            str.append(it->first.data(), it->first.size());
            #endif
            str.append("\t", 1);
            str.append(std::to_string(it->second));
            str.append("\n", 1);
         }

         #ifdef USE_TBB_L
         queue.push( std::make_pair(chunk_id, std::move(str)) );
         #else
         std::cout << str << std::flush;
         #endif
      }
#ifdef USE_TBB_L
   });

   std::pair<size_t, std::string> data;
   while ( queue.try_pop(data) )
      std::cout << data.second << std::flush;
#endif
}

// Reduce a vector range (tally adjacent count fields of duplicate key
+ names)
static void reduce_vec(
   auto& vec    // vector elements to reduce
)
{
   if (vec.size() > 0) {
      auto it1 = vec.begin(); auto itr = it1; auto itw = it1;
      auto it2 = vec.end();
      auto curr = itr;

      for (++itr; itr != it2; ++itr) {
         if (itr->first == curr->first) {
            curr->second += itr->second;
         }
         else {
            if (itw != curr)
               *itw = std::move(*curr);
            ++itw;
            curr = itr;
         }
      }

      if (itw != curr)
         *itw = std::move(*curr);

      vec.resize(std::distance(it1, ++itw));
   }
}

typedef std::chrono::high_resolution_clock high_resolution_clock;
typedef std::chrono::high_resolution_clock::time_point time_point;
typedef std::chrono::milliseconds milliseconds;

double elaspe_time(time_point cend, time_point cstart) {
   return double (
      std::chrono::duration_cast<milliseconds>(cend - cstart).count()
   ) * 1e-3;
}

// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~

int main(int argc, char* argv[])
{
   if (argc < 2) {
      if (argc > 0)
         std::cerr << "usage: llil4vec-tbb file1 file2 ... >out.txt\n"
+;
      return 1;
   }

   std::cerr << std::setprecision(3) << std::setiosflags(std::ios::fix
+ed);
#ifdef MAX_STR_LEN_L
   std::cerr << "llil4vec-tbb (fixed string length=" << MAX_STR_LEN_L 
+<< ") start\n";
#else
   std::cerr << "llil4vec-tbb start\n";
#endif
#ifdef USE_TBB_L
   std::cerr << "use TBB\n";
#else
   std::cerr << "don't use TBB\n";
#endif
#if USE_BOOST_PARALLEL_SORT == 0
   std::cerr << "don't use boost sort\n";
#else
   std::cerr << "use boost sort\n";
#endif
   time_point cstart1, cend1, cstart2, cend2, cstart3, cend3r, cend3s,
+ cend3;
   cstart1 = high_resolution_clock::now();

#ifdef USE_TBB_L
   // Determine the number of threads.
   const char* env_nthds = std::getenv("NUM_THREADS");
   int nthds = ( env_nthds && strlen(env_nthds) )
      ? ::atoi(env_nthds)
      : std::thread::hardware_concurrency();
   tbb::global_control global_limit(tbb::global_control::max_allowed_p
+arallelism, nthds);
#else
   int nthds = 1;
#endif

   // Get the list of input files from the command line
   int    nfiles = argc - 1;
   char** fname  = &argv[1];

   // Create the vector of properties
   vec_str_int_type propvec;
   int64_t num_lines = 0;

   // Run parallel, depending on the number of threads
   if (nthds == 1 || nfiles == 1) {
      for (int i = 0; i < nfiles; ++i)
         num_lines += get_properties(fname[i], propvec);
   }
#ifdef USE_TBB_L
   else {
      num_lines = tbb::parallel_reduce(
         tbb::blocked_range<int>(0, nfiles, 1), 0,
         [&](tbb::blocked_range<int> r, int64_t running_num_lines)
      {
         for (int i = r.begin(); i < r.end(); ++i) {
            vec_str_int_type locvec;
            running_num_lines += get_properties(fname[i], locvec);

            // A static mutex that is shared across all threads
            static tbb::spin_mutex mtx;

            // Acquire a scoped lock
            tbb::spin_mutex::scoped_lock lock(mtx);

            // Append local vector to propvec (consumes more memory)
            // propvec.insert(
            //    propvec.end(),
            //    std::make_move_iterator(locvec.begin()),
            //    std::make_move_iterator(locvec.end())
            // );

            // Append local vector to propvec (faster)
            auto it  = locvec.begin();
            auto it2 = locvec.end();
            for (; it != it2; ++it)
               propvec.emplace_back(std::move(*it));
         }

         return running_num_lines;
      }, std::plus<int64_t>() );
   }
#endif

   cend1 = high_resolution_clock::now();
   double ctaken1 = elaspe_time(cend1, cstart1);
   std::cerr << "get properties      " << std::setw(8) << ctaken1 << "
+ secs\n";

   if (!propvec.size()) {
      std::cerr << "No work, exiting...\n";
      return 1;
   }

   cstart2 = high_resolution_clock::now();

   // Needs to be sorted by word for later sum of adjacent count field
+s to work
#if USE_BOOST_PARALLEL_SORT == 0
   std::sort(
      propvec.begin(), propvec.end(),
      [](const str_int_type& left, const str_int_type& right) {
         return left.first < right.first;
      }
   );
#else
   boost::sort::block_indirect_sort(
      propvec.begin(), propvec.end(),
      [](const str_int_type& left, const str_int_type& right) {
         return left.first < right.first;
      },
      std::min(nthds, 32)
   );
#endif

   cend2 = high_resolution_clock::now();
   double ctaken2 = elaspe_time(cend2, cstart2);
   std::cerr << "sort properties     " << std::setw(8) << ctaken2 << "
+ secs\n";

   cstart3 = high_resolution_clock::now();

   // Reduce in-place (tally adjacent count fields of duplicate key na
+mes)
   reduce_vec(propvec);

   cend3r = high_resolution_clock::now();

   // Sort the vector by (count) in reverse order, (name) in lexical o
+rder
   auto reverse_order = [](const str_int_type& left, const str_int_typ
+e& right) {
      return left.second != right.second
         ? left.second > right.second
         : left.first  < right.first;
   };
#if USE_BOOST_PARALLEL_SORT == 0
   // Standard sort
   std::sort(propvec.begin(), propvec.end(), reverse_order);
#else
   // Parallel sort
   boost::sort::block_indirect_sort(
      propvec.begin(), propvec.end(), reverse_order,
      std::min(nthds, 32)
   );
#endif

   cend3s = high_resolution_clock::now();

   // Output the sorted vector
   out_properties(nthds, propvec);
   cend3 = high_resolution_clock::now();

   double ctaken   = elaspe_time(cend3, cstart1);
   double ctaken3r = elaspe_time(cend3r, cstart3);
   double ctaken3s = elaspe_time(cend3s, cend3r);
   double ctaken3o = elaspe_time(cend3, cend3s);

   std::cerr << "vector reduce       " << std::setw(8) << ctaken3r << 
+" secs\n";
   std::cerr << "vector stable sort  " << std::setw(8) << ctaken3s << 
+" secs\n";
   std::cerr << "write stdout        " << std::setw(8) << ctaken3o << 
+" secs\n";
   std::cerr << "total time          " << std::setw(8) << ctaken   << 
+" secs\n";
   std::cerr << "    count lines     " << num_lines << "\n";
   std::cerr << "    count unique    " << propvec.size() << "\n";

   // Hack to see Private Bytes in Windows Task Manager
   // (uncomment next line so process doesn't exit too quickly)
   // std::this_thread::sleep_for(milliseconds(9000));

   return 0;
}
[download]

In reply to Re^9: Rosetta Code: Long List is Long (faster - llil4vec - TBB code) by marioroy
in thread Rosetta Code: Long List is Long by eyepopslikeamosquito

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.