Re^9: Rosetta Code: Long List is Long (faster - llil4vec

Aloha eyepopslikeamosquito,

I made an update to your simpler llil3vec-tbb-a.cpp version. I captured results and posted them here. The llil4vec OpenMP impl can be found here, for comparison.

Before posting, I ran with 121 MB input files (many of them). I saw a 5 ~ 6 second gap in favor of OpenMP; e.g. 1 core running for a while between parallel_for and parallel_sort. This is resolved by having threads merge immediately after processing a file, similarly to the OpenMP implementation. Now, TBB performs well for small, big, or many input files.

llil4vec-tbb.cc

// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~
// llil4vec-tbb.cc
// A std::vector demonstration using the Intel TBB library.
// https://perlmonks.com/?node_id=11149687
//
// April 25, 2024
//    Based on llil3m.cpp  https://perlmonks.com/?node_id=11149482
//    Original challenge   https://perlmonks.com/?node_id=11147822
//         and summary     https://perlmonks.com/?node_id=11150293
//    Other demonstrations https://perlmonks.com/?node_id=11149907
//
// Authors
//    Mario Roy - C++ demonstration with parallel capabilities
//    eyepopslikeamosquito - Co-author, learning C++ at PerlMonks.com
//
// See also, memory efficient variant
//    https://gist.github.com/marioroy/d02881b96b20fa1adde4388b3e21616
+3
//
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~
// Getting Started with Intel® Threading Building Blocks (Intel® TBB)
// https://www.intel.com/content/www/us/en/developer/articles/guide/ge
+t-started-with-tbb.html
//
// Pro TBB - C++ Parallel Programming with Threading Building Blocks
// https://link.springer.com/book/10.1007/978-1-4842-4398-5
//
// Compile on Linux (clang++ or g++):
//    clang++ -o llil4vec-tbb -std=c++20 -Wall -O3 llil4vec-tbb.cc -l 
+tbb
//    clang++ -o llil4vec-tbb -std=c++20 -Wall -O3 llil4vec-tbb.cc -I 
+"$HOME/local-oneapi-tbb/oneapi-tbb-2021.7.0/include" -L "$HOME/local-
+oneapi-tbb/oneapi-tbb-2021.7.0/lib" -l tbb
//
// The g++ command also works with mingw C++ compiler (https://sourcef
+orge.net/projects/mingw-w64)
// that comes bundled with Strawberry Perl (C:\Strawberry\c\bin\g++.ex
+e).
//
// Obtain gen-llil.pl and gen-long-llil.pl from https://perlmonks.com/
+?node_id=11148681
//    perl gen-llil.pl big1.txt 200 3 1
//    perl gen-llil.pl big2.txt 200 3 1
//    perl gen-llil.pl big3.txt 200 3 1
//
// To make random input, obtain shuffle.pl from https://perlmonks.com/
+?node_id=11149800
//    perl shuffle.pl big1.txt >tmp && mv tmp big1.txt
//    perl shuffle.pl big2.txt >tmp && mv tmp big2.txt
//    perl shuffle.pl big3.txt >tmp && mv tmp big3.txt
//
// Example run:  llil4vec-tbb big1.txt big2.txt big3.txt >out.txt
// NUM_THREADS=3 llil4vec-tbb ...
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~

#include <cassert>
#include <cstdio>
#include <cstddef>
#include <cstdint>
#include <cstdlib>
#include <cstring>
#include <ctime>
#include <compare>
#include <chrono>

#include <string>
#include <string_view>

#include <array>
#include <vector>

#include <thread>
#include <execution>

#include <iomanip>
#include <iostream>
#include <fstream>

static_assert(sizeof(size_t) == sizeof(int64_t), "size_t too small, ne
+ed a 64-bit compile");

// Specify 0/1 to use boost's parallel sorting algorithm; faster than 
+__gnu_parallel::sort.
// https://www.boost.org/doc/libs/1_85_0/libs/sort/doc/html/sort/paral
+lel.html
// https://www.boost.org/doc/libs/1_85_0/libs/sort/doc/papers/block_in
+direct_sort_en.pdf
// This requires the boost header files: e.g. devpkg-boost bundle on C
+lear Linux.
// Note: Another option is downloading and unpacking Boost locally.
// (no need to build it because the bits we use are header file only)
#define USE_BOOST_PARALLEL_SORT 1

#if USE_BOOST_PARALLEL_SORT
   #ifdef __clang__
      #pragma clang diagnostic push
      #pragma clang diagnostic ignored "-Wunused-parameter"
      #pragma clang diagnostic ignored "-Wshadow"
      #include <boost/sort/sort.hpp>
      #pragma clang diagnostic pop
   #else
      #include <boost/sort/sort.hpp>
   #endif
#endif

// Uncomment to use Intel TBB library
#define USE_TBB_L 1

#ifdef USE_TBB_L
    #include <tbb/concurrent_priority_queue.h>
    #include <tbb/global_control.h>
    #include <tbb/parallel_sort.h>
    #include <tbb/parallel_for.h>
    #include <tbb/spin_mutex.h>
#endif

// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~

typedef uint32_t int_type;

// All words in big1.txt, big2.txt, big3.txt are <= 6 chars in length.
// big.txt  max word length is 6
// long.txt max word length is 208
//
// Based on rough benchmarking, the short fixed string hack below is o
+nly
// worth trying for MAX_STR_LEN_L up to about 30.
// See also https://backlinko.com/google-keyword-study
//
// To use (limited length) fixed length strings uncomment the next lin
+e.
#define MAX_STR_LEN_L (size_t) 12

#ifdef MAX_STR_LEN_L
struct str_type : std::array<char, MAX_STR_LEN_L> {
   bool operator==( const str_type& o ) const {
      return ::memcmp(this->data(), o.data(), MAX_STR_LEN_L) == 0;
   }
   bool operator<( const str_type& o ) const {
      return ::memcmp(this->data(), o.data(), MAX_STR_LEN_L) < 0;
   }
};
#else
using str_type = std::basic_string<char>;
#endif

using str_int_type     = std::pair<str_type, int_type>;
using vec_str_int_type = std::vector<str_int_type>;

// Mimic the Perl get_properties subroutine ~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~

// convert positive number from string to uint32_t
inline uint32_t fast_atoll64(const char* str)
{
   uint32_t val = 0;
   uint8_t digit;
   while ((digit = uint8_t(*str++ - '0')) <= 9)
      val = val * 10 + digit;
   return val;
}

// Helper function to find a character.
inline char* find_char(char* first, char* last, char c)
{
   while (first != last) {
      if (*first == c) break;
      ++first;
   }
   return first;
}

static int64_t get_properties(
   const char*        fname,       //  in: the input file name
   vec_str_int_type&  vec_ret)     // out: a vector of properties
{
   int64_t num_lines = 0;
   std::ifstream fin(fname, std::ios::binary);
   if (!fin.is_open()) {
      std::cerr << "Error opening '" << fname << "' : " << strerror(er
+rno) << '\n';
      return num_lines;
   }
   fin.seekg(0, std::ios::end);         // Get the size of the file
   std::streampos fsize = fin.tellg();
   fin.seekg(0, std::ios::beg);

   std::vector<char> buf(1 + fsize);    // Read the whole file into th
+e buffer
   fin.read(&buf[0], fsize);
   fin.close();

   char* first = &buf[0];
   char* last = &buf[fsize];
                                        // Iterate through the buffer
   while (first < last) {
      char* beg_ptr{first};
      char* end_ptr{find_char(first, last, '\n')};
      char* found = find_char(beg_ptr, end_ptr, '\t');

      first = end_ptr + 1;
      if (found == end_ptr)
         continue;

      assert(*found == '\t');
      int_type count = fast_atoll64(found + 1);
      size_t klen = found - beg_ptr;

#ifdef MAX_STR_LEN_L
      str_type s {};  // {} initializes all elements of s to '\0'
      ::memcpy(s.data(), beg_ptr, std::min(MAX_STR_LEN_L, klen));
#else
      str_type s(beg_ptr, klen);
#endif
      vec_ret.emplace_back(std::move(s), count);

      ++num_lines;
   }

   return num_lines;
}

// Output subroutine ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~

inline constexpr size_t CHUNK_SIZE = 32768;

size_t divide_up(size_t dividend, size_t divisor)
{
   if (dividend % divisor)
      return (size_t)(dividend / divisor) + 1;
   else
      return (size_t)(dividend / divisor);
}

static void out_properties(
   const int          nthds,     // in   : the number of threads
   vec_str_int_type&  vec)       // in   : the vector to output
{
   size_t num_chunks = divide_up(vec.size(), CHUNK_SIZE); 

#ifdef USE_TBB_L
   int nthds_out = std::min(nthds, 32);
   tbb::global_control global_limit(tbb::global_control::max_allowed_p
+arallelism, nthds_out);

   tbb::concurrent_priority_queue<
      std::pair<size_t, std::string>,
      decltype( [](const auto& lhs, const auto& rhs) { return lhs.firs
+t > rhs.first; } )
   > queue;

   tbb::parallel_for( tbb::blocked_range<int>(0, num_chunks, 1),
                       [&](tbb::blocked_range<int> r)
   {
      for (size_t chunk_id = r.begin(); chunk_id < (size_t) r.end(); +
++chunk_id) {
#else
      for (size_t chunk_id = 0; chunk_id < num_chunks; ++chunk_id) {
#endif
         std::string str(""); str.reserve(2048 * 1024);
         auto it  = vec.begin() + chunk_id * CHUNK_SIZE;
         auto it2 = vec.begin() + std::min(vec.size(), (chunk_id + 1) 
+* CHUNK_SIZE);

         for (; it != it2; ++it) {
            #ifdef MAX_STR_LEN_L
            str.append(it->first.data());
            #else
            str.append(it->first.data(), it->first.size());
            #endif
            str.append("\t", 1);
            str.append(std::to_string(it->second));
            str.append("\n", 1);
         }

         #ifdef USE_TBB_L
         queue.push( std::make_pair(chunk_id, std::move(str)) );
         #else
         std::cout << str << std::flush;
         #endif
      }
#ifdef USE_TBB_L
   });

   std::pair<size_t, std::string> data;
   while ( queue.try_pop(data) )
      std::cout << data.second << std::flush;
#endif
}

// Reduce a vector range (tally adjacent count fields of duplicate key
+ names)
static void reduce_vec(
   auto& vec    // vector elements to reduce
)
{
   if (vec.size() > 0) {
      auto it1 = vec.begin(); auto itr = it1; auto itw = it1;
      auto it2 = vec.end();
      auto curr = itr;

      for (++itr; itr != it2; ++itr) {
         if (itr->first == curr->first) {
            curr->second += itr->second;
         }
         else {
            if (itw != curr)
               *itw = std::move(*curr);
            ++itw;
            curr = itr;
         }
      }

      if (itw != curr)
         *itw = std::move(*curr);

      vec.resize(std::distance(it1, ++itw));
   }
}

typedef std::chrono::high_resolution_clock high_resolution_clock;
typedef std::chrono::high_resolution_clock::time_point time_point;
typedef std::chrono::milliseconds milliseconds;

double elaspe_time(time_point cend, time_point cstart) {
   return double (
      std::chrono::duration_cast<milliseconds>(cend - cstart).count()
   ) * 1e-3;
}

// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~

int main(int argc, char* argv[])
{
   if (argc < 2) {
      if (argc > 0)
         std::cerr << "usage: llil4vec-tbb file1 file2 ... >out.txt\n"
+;
      return 1;
   }

   std::cerr << std::setprecision(3) << std::setiosflags(std::ios::fix
+ed);
#ifdef MAX_STR_LEN_L
   std::cerr << "llil4vec-tbb (fixed string length=" << MAX_STR_LEN_L 
+<< ") start\n";
#else
   std::cerr << "llil4vec-tbb start\n";
#endif
#ifdef USE_TBB_L
   std::cerr << "use TBB\n";
#else
   std::cerr << "don't use TBB\n";
#endif
#if USE_BOOST_PARALLEL_SORT == 0
   std::cerr << "don't use boost sort\n";
#else
   std::cerr << "use boost sort\n";
#endif
   time_point cstart1, cend1, cstart2, cend2, cstart3, cend3r, cend3s,
+ cend3;
   cstart1 = high_resolution_clock::now();

#ifdef USE_TBB_L
   // Determine the number of threads.
   const char* env_nthds = std::getenv("NUM_THREADS");
   int nthds = ( env_nthds && strlen(env_nthds) )
      ? ::atoi(env_nthds)
      : std::thread::hardware_concurrency();
   tbb::global_control global_limit(tbb::global_control::max_allowed_p
+arallelism, nthds);
#else
   int nthds = 1;
#endif

   // Get the list of input files from the command line
   int    nfiles = argc - 1;
   char** fname  = &argv[1];

   // Create the vector of properties
   vec_str_int_type propvec;
   int64_t num_lines = 0;

   // Run parallel, depending on the number of threads
   if (nthds == 1 || nfiles == 1) {
      for (int i = 0; i < nfiles; ++i)
         num_lines += get_properties(fname[i], propvec);
   }
#ifdef USE_TBB_L
   else {
      num_lines = tbb::parallel_reduce(
         tbb::blocked_range<int>(0, nfiles, 1), 0,
         [&](tbb::blocked_range<int> r, int64_t running_num_lines)
      {
         for (int i = r.begin(); i < r.end(); ++i) {
            vec_str_int_type locvec;
            running_num_lines += get_properties(fname[i], locvec);

            // A static mutex that is shared across all threads
            static tbb::spin_mutex mtx;

            // Acquire a scoped lock
            tbb::spin_mutex::scoped_lock lock(mtx);

            // Append local vector to propvec (consumes more memory)
            // propvec.insert(
            //    propvec.end(),
            //    std::make_move_iterator(locvec.begin()),
            //    std::make_move_iterator(locvec.end())
            // );

            // Append local vector to propvec (faster)
            auto it  = locvec.begin();
            auto it2 = locvec.end();
            for (; it != it2; ++it)
               propvec.emplace_back(std::move(*it));
         }

         return running_num_lines;
      }, std::plus<int64_t>() );
   }
#endif

   cend1 = high_resolution_clock::now();
   double ctaken1 = elaspe_time(cend1, cstart1);
   std::cerr << "get properties      " << std::setw(8) << ctaken1 << "
+ secs\n";

   if (!propvec.size()) {
      std::cerr << "No work, exiting...\n";
      return 1;
   }

   cstart2 = high_resolution_clock::now();

   // Needs to be sorted by word for later sum of adjacent count field
+s to work
#if USE_BOOST_PARALLEL_SORT == 0
   std::sort(
      propvec.begin(), propvec.end(),
      [](const str_int_type& left, const str_int_type& right) {
         return left.first < right.first;
      }
   );
#else
   boost::sort::block_indirect_sort(
      propvec.begin(), propvec.end(),
      [](const str_int_type& left, const str_int_type& right) {
         return left.first < right.first;
      },
      std::min(nthds, 32)
   );
#endif

   cend2 = high_resolution_clock::now();
   double ctaken2 = elaspe_time(cend2, cstart2);
   std::cerr << "sort properties     " << std::setw(8) << ctaken2 << "
+ secs\n";

   cstart3 = high_resolution_clock::now();

   // Reduce in-place (tally adjacent count fields of duplicate key na
+mes)
   reduce_vec(propvec);

   cend3r = high_resolution_clock::now();

   // Sort the vector by (count) in reverse order, (name) in lexical o
+rder
   auto reverse_order = [](const str_int_type& left, const str_int_typ
+e& right) {
      return left.second != right.second
         ? left.second > right.second
         : left.first  < right.first;
   };
#if USE_BOOST_PARALLEL_SORT == 0
   // Standard sort
   std::sort(propvec.begin(), propvec.end(), reverse_order);
#else
   // Parallel sort
   boost::sort::block_indirect_sort(
      propvec.begin(), propvec.end(), reverse_order,
      std::min(nthds, 32)
   );
#endif

   cend3s = high_resolution_clock::now();

   // Output the sorted vector
   out_properties(nthds, propvec);
   cend3 = high_resolution_clock::now();

   double ctaken   = elaspe_time(cend3, cstart1);
   double ctaken3r = elaspe_time(cend3r, cstart3);
   double ctaken3s = elaspe_time(cend3s, cend3r);
   double ctaken3o = elaspe_time(cend3, cend3s);

   std::cerr << "vector reduce       " << std::setw(8) << ctaken3r << 
+" secs\n";
   std::cerr << "vector stable sort  " << std::setw(8) << ctaken3s << 
+" secs\n";
   std::cerr << "write stdout        " << std::setw(8) << ctaken3o << 
+" secs\n";
   std::cerr << "total time          " << std::setw(8) << ctaken   << 
+" secs\n";
   std::cerr << "    count lines     " << num_lines << "\n";
   std::cerr << "    count unique    " << propvec.size() << "\n";

   // Hack to see Private Bytes in Windows Task Manager
   // (uncomment next line so process doesn't exit too quickly)
   // std::this_thread::sleep_for(milliseconds(9000));

   return 0;
}
[download]

Comment on Re^9: Rosetta Code: Long List is Long (faster - llil4vec - TBB code) Download Code

Replies are listed 'Best First'.
Re^10: Rosetta Code: Long List is Long (faster - llil4vec - llil4judy - TBB code - llil5vec - boost) by eyepopslikeamosquito (Archbishop) on Jan 22, 2023 at 08:05 UTC
G'day marioroy, So much learning in this thread! ... OpenMp, Intel TBB, fast_io ... and now good old Boost. Since boost is mostly a header-only library, I installed the latest Dev version manually into my Ubuntu `$HOME` directory: `cd $HOME/local-boost wget https://boostorg.jfrog.io/artifactory/main/release/1.81.0/source/ +boost_1_81_0_rc1.tar.gz.json wget https://boostorg.jfrog.io/artifactory/main/release/1.81.0/source/ +boost_1_81_0_rc1.tar.gz tar -xzf boost_1_81_0_rc1.tar.gz` [download] Then adjusted my build command: `clang++ -o llil5vec-tbb -std=c++20 -Wall -O3 -I "$HOME/llil/cmdlinux/f +ast_io/include" -I "$HOME/local-boost/boost_1_81_0" -I "$HOME/local-o +neapi-tbb/oneapi-tbb-2021.7.0/include" -L "$HOME/local-oneapi-tbb/one +api-tbb-2021.7.0/lib" llil5vec-tbb.cpp -l tbb` [download] I was then able to compile your program with both `clang++` and `g++`. Though `g++` spat out a number of warnings similar to: `block_indirect_sort.hpp:274:23: warning: implicit capture of ‘this’ vi +a ‘[=]’ is deprecated in C++20 [-Wdeprecated]` [download] the generated executable seemed to work fine, albeit a touch slower than `clang++`. As you might expect, I was then unable to restrain myself from copying your `llil4vec-tbb.cpp` masterwork to `llil5vec-tbb.cpp` to try out a few changes. I was pleasantly surprised that my first attempted change, replacing the `emplace std::set` sort with a `vector` sort gave a significant speed up on my laptop, allowing me to break the 1.5 second barrier for the first time! Woo hoo! Never thought I'd break the 1.5 second barrier ... though surely the magical one second barrier will prove out of reach. This little test proved again that when it comes to C++ performance, vector always wins. :) Or at least should always be tried, it's true that `std::map` soundly beats `std::vector` when dealing with very long name strings of variable length (e.g. `long1.txt long2.txt long3.txt`). You may also notice from the timings below, that I've switched to running with `NUM_THREADS=6` (seemed hard to beat on my little laptop). Plus, thanks to your excellent `std::chrono::high_resolution_clock` improvements, I don't bother with the Linux `time` command any more. Timings A baseline, using the original `std::set` emplace sort: `$ NUM_THREADS=6 ./llil4vec-tbb big1.txt big2.txt big3.txt >f.tmp llil4vec-tbb (fixed string length=6) start use TBB get properties time : 0.399307 secs sort properties time : 0.277303 secs emplace set sort time : 0.764016 secs write stdout time : 0.51597 secs total time : 1.95691 secs` [download] With the new vector sort: `$ NUM_THREADS=6 ./llil5vec-tbb big1.txt big2.txt big3.txt >big-5vec.tm +p llil5vec-tbb (fixed string length=6) start use TBB use boost sort get properties time : 0.367148 secs sort properties time : 0.274697 secs vector stable sort time : 0.385357 secs write stdout time : 0.390569 secs total time : 1.4179 secs $ diff big-5vec.tmp big-3vec.tmp` [download] Updated timings of running updated `llil4vec2.cpp` below With vector reduce timed separately and non-negative hack in `fast_atoll64`: `$ NUM_THREADS=6 ./llil4vec2 big1.txt big2.txt big3.txt >f.tmp llil4vec2 (fixed string length=6) start use OpenMP use boost sort get properties time : 0.353 secs sort properties time : 0.271 secs vector reduce time : 0.074 secs vector stable sort time : 0.197 secs write stdout time : 0.322 secs total time : 1.219 secs` [download] With new get_properties function (using `std::array` and `std::find`): `$ NUM_THREADS=6 ./llil4vec2 big1.txt big2.txt big3.txt >f.tmp llil4vec2 (fixed string length=6) start use OpenMP use boost sort get properties time : 0.321 secs sort properties time : 0.268 secs vector reduce time : 0.078 secs vector stable sort time : 0.194 secs write stdout time : 0.329 secs total time : 1.192 secs` [download] Slight improvement to 1.165 secs after adding new `std::copy` trick :) (4 Feb 2023): Surprising new record of 1.120 secs (10 Feb 2023) after adding new `USE_MEMCMP_L` define. The vector reduce time was halved to a ridiculous 0.03 secs. `$ NUM_THREADS=6 ./llil4vec2 big1.txt big2.txt big3.txt >f.tmp llil4vec2 (fixed string length=6) use memcmp start use OpenMP use boost sort get properties time : 0.311 secs sort properties time : 0.261 secs vector reduce time : 0.030 secs vector stable sort time : 0.199 secs write stdout time : 0.316 secs total time : 1.120 secs` [download] Update: 18/03/2023. Thanks to marioroy, finally broke the magical one second barrier! ... via a combination of mimalloc and memory-mapped-io, as demonstrated in llil4vec. Precise details are complicated and will be added later: $ LD_PRELOAD=/usr/local/lib/libmimalloc.so NUM_THREADS=6 ./llil4vec-ne +w big1.txt big2.txt big3.txt >f.tmp llil4vec (fixed string length=6) start use OpenMP use boost sort nthrs=6 get properties 0.192 secs sort properties 0.261 secs vector reduce 0.033 secs vector stable sort 0.186 secs write stdout 0.311 secs total time 0.986 secs $ LD_PRELOAD=/usr/local/lib/libmimalloc.so NUM_THREADS=6 ./llil4vec-ne +w big1.txt big2.txt big3.txt >f.tmp llil4vec (fixed string length=6) start use OpenMP use boost sort nthrs=6 get properties 0.201 secs sort properties 0.257 secs vector reduce 0.032 secs vector stable sort 0.194 secs write stdout 0.300 secs total time 0.986 secs $ cmp f.tmp big-3vec.tmp [download] Further Update: Timings with llil4map (31-Jan-2023). `$ NUM_THREADS=6 ./llil4map big1.txt big2.txt big3.txt >f.tmp llil4map start use phmap::parallel_flat_hash_map use OpenMP use boost sort get properties time : 1.6778 secs finish merging time : 0.754921 secs vector stable sort time : 0.76791 secs write stdout time : 0.382316 secs total time : 3.5832 secs` [download] Further Update: Timings with llil4judy (April-2023). Built Judy C Library as described at Re: need help with judy array searching (Judy Array References). `$ NUM_THREADS=6 LD_LIBRARY_PATH=/usr/local/lib ./llil4judy big1.txt bi +g2.txt big3.txt >f.tmp llil4judy (fixed string length=6) start use OpenMP use boost sort get properties 0.682 secs finish merging 2.338 secs vector stable sort 0.262 secs write stdout 0.526 secs total time 3.810 secs $ cmp f.tmp big-3vec.tmp` [download] Though slower, the Judy code uses less memory. Further Update: Timings with 18 big files (1-Feb-2023) Test run with big1.txt, big2.txt, big3.txt, big4.txt, big5.txt, big6.txt in my cwd. $ NUM_THREADS=6 ./llil4vec2 big?.txt big?.txt big?.txt >f4vec.tmp llil4vec2 (fixed string length=6) start use OpenMP use boost sort get properties time : 1.9719 secs sort properties time : 2.27273 secs vector reduce time : 0.201546 secs vector stable sort time : 0.500317 secs write stdout time : 0.721466 secs total time : 5.66814 secs $ NUM_THREADS=6 ./llil4map big?.txt big?.txt big?.txt >f4map.tmp llil4map start use phmap::parallel_flat_hash_map use OpenMP use boost sort get properties time : 5.33992 secs finish merging time : 2.20465 secs vector stable sort time : 1.60465 secs write stdout time : 1.21258 secs total time : 10.3622 secs $ diff f4map.tmp f4vec.tmp $ ls -l f4map.tmp f4vec.tmp -rw-r--r-- 183490874 Feb 1 09:38 f4map.tmp -rw-r--r-- 183490874 Feb 1 09:37 f4vec.tmp $ ls -l big.txt -rw-r--r-- 31636800 Jan 16 18:26 big1.txt -rw-r--r-- 31636800 Jan 16 18:26 big2.txt -rw-r--r-- 31636800 Jan 16 18:26 big3.txt -rw-r--r-- 31636800 Feb 1 09:33 big4.txt -rw-r--r-- 31636800 Feb 1 09:31 big5.txt -rw-r--r-- 31636800 Feb 1 09:31 big6.txt [download] Update: a later version with improved get_properties(): $ NUM_THREADS=6 ./llil4vec2 big?.txt big?.txt big?.txt big?.txt big?.t +xt big?.txt >f4vec.tmp llil4vec2 (fixed string length=6) start use OpenMP use boost sort get properties time : 4.392 secs sort properties time : 4.706 secs vector reduce time : 0.225 secs vector stable sort time : 0.491 secs write stdout time : 0.766 secs total time : 10.593 secs $ NUM_THREADS=6 ./llil4map big?.txt big?.txt big?.txt big?.txt big?.tx +t big?.txt >f4map.tmp llil4map start use phmap::parallel_flat_hash_map use OpenMP use boost sort get properties time : 10.6884 secs finish merging time : 2.11571 secs vector stable sort time : 1.83454 secs write stdout time : 2.55906 secs total time : 17.198 secs $ diff f4map.tmp f4vec.tmp [download] In summary, the vector sort is almost twice as fast as the set emplace sort: 0.4 secs vs 0.75 secs; it's also slightly faster to write a `std::vector` to stdout than a `std::set`. The fastest sort on my laptop was the one you found, `boost::sort::block_indirect_sort`. As I have become sadly accustomed to during this long thread, stable sort (though theoretically looking good) was defeated yet again. :( Update: Note that a DDR4 DIMM can hold up to 64 GB, while DDR5 octuples that to 512 GB ... which makes worrying about a program's memory usage seem much less important than in the good old days. :) llil5vec-tbb.cpp* // llil5vec-tbb.cpp // Based on llil4vec-tbb.cpp in https://perlmonks.com/?node_id=1114968 +7 // // Vector version using the Intel TBB library // Note: TBB concurrent vector is best avoided, too much locking overh +ead // based on llil3vec-tbb-a.cpp https://perlmonks.com/?node_id=11149622 // 1. Capture time and diff via chrono. // 2. Key words null-terminated for MAX_STR_LEN_L. // 3. Concat strings using fast_io::concatln during output. // 4. Support NUM_THREADS environment variable. // 5. Merge local vector inside the parallel_for block. // This allows threads, completing get_properties, to merge imme +diately. // 6. Moved vec_loc instantiation inside for loop. // 7. Support running one thread, writing to propvec (no merging). // 8. Capture time for get and sort properties separately. // 9. Added define for using boost's parallel sort; requires clang+ ++. // // To obtain the fast_io library (required dependency): // git clone --depth=1 https://github.com/cppfastio/fast_io // // Compiled on Linux with: // clang++ -o llil5vec-tbb -std=c++20 -Wall -O3 -I "$HOME/local-fast +_io/fast_io/include" -I "$HOME/local-boost/boost_1_81_0" -I "$HOME/lo +cal-oneapi-tbb/oneapi-tbb-2021.7.0/include" -L "$HOME/local-oneapi-tb +b/oneapi-tbb-2021.7.0/lib" llil5vec-tbb.cpp -l tbb // Also works with g++ but throws some compiler warnings about depreca +ted C++20 features. // Seems to run slightly faster when compiled with clang++ rather than + g++ // Uncomment to use Intel TBB library #define USE_TBB_L 1 // Specify 0/1 to use boost's parallel sorting algorithm; faster than +__gnu_parallel::sort. // https://www.boost.org/doc/libs/1_81_0/libs/sort/doc/html/sort/paral +lel.html // This requires the boost header files: e.g. devpkg-boost bundle on C +lear Linux. // Note: I was able to build this program just by downloading and unpa +cking Boost locally // (no need to build it because the bits we use are header file only) #define USE_BOOST_PARALLEL_SORT 1 #include <chrono> #include <thread> // The fast_io header must come after chrono, else build error: // "no member named 'concatln' in namespace 'fast_io'" #include <fast_io.h> #include <cstdio> #include <cstddef> #include <cstdint> #include <cstdlib> #include <cstring> #include <ctime> #include <string> #include <array> #include <vector> #include <utility> #include <iterator> #if USE_BOOST_PARALLEL_SORT > 0 #include <boost/sort/sort.hpp> #else #endif #include <algorithm> #include <execution> #ifdef USE_TBB_L #include <tbb/global_control.h> #include <tbb/parallel_sort.h> #include <tbb/parallel_for.h> #include <tbb/spin_mutex.h> #endif #include <iostream> #include <fstream> #include <sstream> static_assert(sizeof(size_t) == sizeof(int64_t), "size_t too small, ne +ed a 64-bit compile"); // ------------------------------------------------------------------- +--------- // Crude hack to see Windows Private Bytes in Task Manager by sleeping + at // program end (see also sleep hack at end of main) // #include <chrono> // #include <thread> // ------------------------------------------------------------------- +--------- typedef long long llil_int_type; // Note: all words in big1.txt, big2.txt, big3.txt are <= 6 chars in l +ength // To use (limited length) fixed length strings uncomment the next lin +e // big.txt max word length is 6 // long.txt max word length is 208 // The standard big1.txt, big2.txt, big3.txt files all contain 3,515,2 +00 lines #define N_LINES_BIG1_L 3515200 // Based on rough benchmarking, the short fixed string hack below is o +nly // worth trying for MAX_STR_LEN_L up to about 22. // See also https://backlinko.com/google-keyword-study #define MAX_STR_LEN_L 6 #ifdef MAX_STR_LEN_L using str_type = std::array<char, MAX_STR_LEN_L + 1>; #else using str_type = std::string; #endif using str_int_type = std::pair<str_type, llil_int_type>; using int_str_type = std::pair<llil_int_type, str_type>; using vec_str_int_type = std::vector<str_int_type>; using vec_int_str_type = std::vector<int_str_type>; // fast_atoll64 ------------------------------------------------------ +-- // // https://stackoverflow.com/questions/16826422/ // c-most-efficient-way-to-convert-string-to-int-faster-than-atoi inline int64_t fast_atoll64( const char* str ) { int64_t val = 0; int sign = 0; if ( str == '-' ) { sign = 1, ++str; } uint8_t digit; while ((digit = uint8_t(str++ - '0')) <= 9) val = val * 10 + digit +; return sign ? -val : val; } // Mimic the Perl get_properties subroutine -------------------------- +-- // Limit line length and use ANSI C functions to try to boost performa +nce #define MAX_LINE_LEN_L 255 static void get_properties( const char* fname, // in: the input file name vec_int_str_type& vec_ret) // out: a vector of properties { FILE* fh; char line[MAX_LINE_LEN_L+1]; char* found; llil_int_type count; fh = ::fopen(fname, "r"); if (fh == NULL) { std::cerr << "Error opening '" << fname << "' : errno=" << errno + << "\n"; return; } while ( ::fgets(line, MAX_LINE_LEN_L, fh) != NULL ) { found = ::strchr(line, '\t'); // fast_atoll64 is a touch faster than ::atoll count = fast_atoll64( &line[found - line + 1] ); line[found - line] = '\0'; // word #ifdef MAX_STR_LEN_L str_type fixword { { '\0', '\0', '\0', '\0', '\0', '\0', '\0' } +}; ::memcpy( fixword.data(), line, found - line ); vec_ret.emplace_back( -count, fixword ); #else vec_ret.emplace_back( -count, line ); #endif } ::fclose(fh); } double elaspe_time( std::chrono::high_resolution_clock::time_point cend, std::chrono::high_resolution_clock::time_point cstart) { return double( std::chrono::duration_cast<std::chrono::microseconds>(cend - cst +art).count() ) * 1e-6; } // ------------------------------------------------------------------- +-- int main(int argc, char* argv[]) { if (argc < 2) { std::cerr << "usage: llil5vec-tbb file1 file2 ... >out.txt\n"; return 1; } #ifdef MAX_STR_LEN_L std::cerr << "llil5vec-tbb (fixed string length=" << MAX_STR_LEN_L +<< ") start\n"; #else std::cerr << "llil5vec-tbb start\n"; #endif #ifdef USE_TBB_L std::cerr << "use TBB\n"; #else std::cerr << "don't use TBB\n"; #endif #if USE_BOOST_PARALLEL_SORT == 0 std::cerr << "don't use boost sort\n"; #else std::cerr << "use boost sort\n"; #endif std::chrono::high_resolution_clock::time_point cstart1, cend1, csta +rt2, cend2, cstart3, cend3s, cend3; cstart1 = std::chrono::high_resolution_clock::now(); #ifdef USE_TBB_L // Determine the number of threads. const char* env_nthrs = std::getenv("NUM_THREADS"); int nthrs = (env_nthrs && strlen(env_nthrs)) ? ::atoi(env_nthrs) : +std::thread::hardware_concurrency(); tbb::global_control global_limit(tbb::global_control::max_allowed_p +arallelism, nthrs); #else int nthrs = 1; #endif // Get the list of input files from the command line int nfiles = argc - 1; char** fname = &argv[1]; // Create the vector of properties vec_int_str_type propvec; // propvec.reserve(N_LINES_BIG1_L * 3); // doesn't make much dif +ference // Run parallel, depending on the number of threads if ( nthrs == 1 ) { for (int i = 0; i < nfiles; ++i) get_properties( fname[i], propvec ); } #ifdef USE_TBB_L else { tbb::parallel_for( tbb::blocked_range<int>(0, nfiles, 1), [&](tbb::blocked_range<int> r) { for (int i = r.begin(); i < r.end(); ++i) { vec_int_str_type locvec; get_properties( fname[i], locvec ); // A static mutex that is shared across all threads static tbb::spin_mutex mtx; // Acquire a scoped lock tbb::spin_mutex::scoped_lock lock(mtx); // Append local vector to propvec propvec.insert( propvec.end(), locvec.begin(), locvec.end( +) ); } }); } #endif cend1 = std::chrono::high_resolution_clock::now(); double ctaken1 = elaspe_time(cend1, cstart1); std::cerr << "get properties time : " << ctaken1 << " secs\n"; cstart2 = std::chrono::high_resolution_clock::now(); // Needs to be sorted by word for later sum of adjacent count field +s to work #if USE_BOOST_PARALLEL_SORT == 0 #ifdef USE_TBB_L tbb::parallel_sort( #else std::sort( #endif propvec.begin(), propvec.end(), [](const int_str_type& left, const int_str_type& right) { return + left.second < right.second; } ); #else // Try building with clang++ if g++ emits errors boost::sort::block_indirect_sort( propvec.begin(), propvec.end(), [](const int_str_type& left, const int_str_type& right) { return + left.second < right.second; }, nthrs ); #endif cend2 = std::chrono::high_resolution_clock::now(); double ctaken2 = elaspe_time(cend2, cstart2); std::cerr << "sort properties time : " << ctaken2 << " secs\n"; cstart3 = std::chrono::high_resolution_clock::now(); // To sort, replace building a std::set with sorting a std::vector // Note: negative count gives desired ordering // Aside: consider how this loop might be parallelised vec_int_str_type myvec; auto it = propvec.cbegin(); str_type name_last = it->second; llil_int_type count = it->first; for (++it; it != propvec.cend(); ++it) { if ( it->second == name_last ) { count += it->first; } else { myvec.emplace_back( count, name_last ); name_last = it->second; count = it->first; } } myvec.emplace_back( count, name_last ); // As usual, stable_sort with a simpler sort function tends to be s +lower #if USE_BOOST_PARALLEL_SORT == 0 #ifdef USE_TBB_L tbb::parallel_sort( #else std::sort( #endif myvec.begin(), myvec.end(), [](const int_str_type& left, const int_str_type& right) { return + left.first != right.first ? left.first < right.first : left.second < + right.second; } // use this one for std::stable_sort // [](const int_str_type& left, const int_str_type& right) { r +eturn left.first < right.first; } ); #else // block_indirect_sort seems to be the fastest // Note: spinsort takes only 3 parameters (no nthrs parameter, unli +ke the other two) boost::sort::block_indirect_sort( // boost::sort::parallel_stable_sort( // boost::sort::spinsort( myvec.begin(), myvec.end(), [](const int_str_type& left, const int_str_type& right) { return + left.first != right.first ? left.first < right.first : left.second < + right.second; } // [](const int_str_type& left, const int_str_type& right) { ret +urn left.first < right.first; } , nthrs ); #endif cend3s = std::chrono::high_resolution_clock::now(); // Note: fix up negative count via -n.first #ifdef MAX_STR_LEN_L for ( auto const& n : myvec ) ::print(fast_io::concatln(std::string(n.second.data()), "\t", -n +.first)); #else for ( auto const& n : myvec ) ::print(fast_io::concatln(n.second, "\t", -n.first)); #endif cend3 = std::chrono::high_resolution_clock::now(); double ctaken = elaspe_time(cend3, cstart1); double ctaken3s = elaspe_time(cend3s, cstart3); double ctaken3o = elaspe_time(cend3, cend3s); std::cerr << "vector stable sort time : " << ctaken3s << " secs\n" +; std::cerr << "write stdout time : " << ctaken3o << " secs\n" +; std::cerr << "total time : " << ctaken << " secs\n" +; // Hack to see Private Bytes in Windows Task Manager (uncomment nex +t line so process doesn't exit too quickly) // std::this_thread::sleep_for(std::chrono::milliseconds(90000000 +)); return 0; } [download] Update: llil4vec2.cpp Note: many updates were made to this version long after originally posted. Latest update: 10-Feb-2023 This is just a version of mario's llil4vec.cpp, using OpenMP instead of TBB, and with a more general `reduce_vec` function replacing some mainline code. I wrote this new function to allow me to play around with parallelisation, but everything I tried was slower. :-( Still, I think the code is a bit cleaner with the `reduce_vec` function and seems to run around the same speed (see timings above). // llil4vec2.cpp // See also: perlmonks.com, node_id=11149754 // llil4vec.cpp with a reduce_vec function. // Based on: https://perlmonks.com/?node_id=11149545 // OpenMP Little Book - https://nanxiao.gitbooks.io/openmp-little-book +/content/ // // Vector version of llil2grt.pl. // based on llil3vec.cpp https://perlmonks.com/?node_id=11149482 // 1. Run get_properties in parallel. // 2. Capture time and diff via chrono. // 3. Threads: flush local vector periodically. // 4. Key words null-terminated for MAX_STR_LEN_L. // 5. Concat strings using fast_io::concatln during output. // 6. Support NUM_THREADS environment variable. // 7. Add FLUSH_VECTOR_PERIODICALLY define statement. // 8. Removed periodically flush, not what I expected. // 9. Simplified code, similar to the tbb implementation. // A. Capture time for get and sort properties separately. // B. Added define for using boost's parallel sort. // C. Replaced atoll with fast_atoll64. // D. Fast vector sorting - from llil5vec-tbb. // E. Reduce in-place, duplicate key names - tally count. // F. Exit early if no work; fast_io tweak writing to output; // fixword: ensure not more than MAX_LINE_LEN_L characters; // limit to 8 threads max for sorting. // G. Capture time for vector reduce separately. // I. Improved get_properties; set precision for timings. // // Obtain the fast_io library (required dependency): // git clone --depth=1 https://github.com/cppfastio/fast_io // // g++ compile on Linux: (boost header may need the -Wno-stringop-over +flow gcc option) // g++ -o llil4vec2 -std=c++20 -Wall -O3 llil4vec2.cpp -I ./fast_io +/include // g++ -o llil4vec2-omp -std=c++20 -fopenmp -Wall -O3 llil4vec2.cpp + -I ./fast_io/include // // This g++ command also works with mingw C++ compiler (https://source +forge.net/projects/mingw-w64) // that comes bundled with Strawberry Perl (C:\Strawberry\c\bin\g++.ex +e). // // clang++ compile: same args, without the -Wno-stringop-overflow opti +on // Seems to run slightly faster when compiled with clang++ instead of +g++ // // An example compile on Ubuntu Linux with some 3rd party (header) lib +raries unpacked to $HOME: // clang++ -o llil4vec2 -std=c++20 -fopenmp -Wall -O3 // -I "$HOME/local-fast_io/fast_io/include" // -I "$HOME/local-parallel-hashmap/parallel-hashmap" // -I "$HOME/local-boost/boost_1_81_0" // llil4vec2.cpp // // Obtain gen-llil.pl and gen-long-llil.pl from https://perlmonks.com/ +?node_id=11148681 // perl gen-llil.pl big1.txt 200 3 1 // perl gen-llil.pl big2.txt 200 3 1 // perl gen-llil.pl big3.txt 200 3 1 // perl gen-long-llil.pl long1.txt 600 // perl gen-long-llil.pl long2.txt 600 // perl gen-long-llil.pl long3.txt 600 // // To make random input, obtain shuffle.pl from https://perlmonks.com/ +?node_id=11149800 // perl shuffle.pl big1.txt >tmp && mv tmp big1.txt // perl shuffle.pl big2.txt >tmp && mv tmp big2.txt // perl shuffle.pl big3.txt >tmp && mv tmp big3.txt // // Example run: llil4vec2 big1.txt big2.txt big3.txt >out.txt // NUM_THREADS=3 llil4vec2-omp ... // ------------------------------------------------------------------- +--------- // Specify 0/1 to use boost's parallel sorting algorithm; faster than +__gnu_parallel::sort. // https://www.boost.org/doc/libs/1_81_0/libs/sort/doc/html/sort/paral +lel.html // This requires the boost header files: e.g. devpkg-boost bundle on C +lear Linux. // Note: Another option is downloading and unpacking Boost locally. // (no need to build it because the bits we use are header file only) #define USE_BOOST_PARALLEL_SORT 1 #include <chrono> #include <thread> // The fast_io header must come after chrono, else build error: // "no member named 'concatln' in namespace 'fast_io'" #include <fast_io.h> #include <cstdio> #include <cstddef> #include <cstdint> #include <cstdlib> #include <cstring> #include <ctime> #include <string> #include <array> #include <vector> #include <utility> #include <iterator> #include <execution> #include <algorithm> #if USE_BOOST_PARALLEL_SORT > 0 #include <boost/sort/sort.hpp> #endif #ifdef _OPENMP #include <omp.h> #endif #include <iostream> #include <iomanip> #include <fstream> #include <sstream> static_assert(sizeof(size_t) == sizeof(int64_t), "size_t too small, ne +ed a 64-bit compile"); // ------------------------------------------------------------------- +--------- typedef long long llil_int_type; // Note: all words in big1.txt, big2.txt, big3.txt are <= 6 chars in l +ength // big.txt max word length is 6 // long.txt max word length is 208 // Based on rough benchmarking, the short fixed string hack below is o +nly // worth trying for MAX_STR_LEN_L up to about 22. // See also https://backlinko.com/google-keyword-study // Note: if input data contains words longer than MAX_STR_LEN_L // the program may malfunction or even crash // To use (limited length) fixed length strings uncomment the next lin +e #define MAX_STR_LEN_L 6 #ifdef MAX_STR_LEN_L // Uncomment next line to use C memcmp function to compare fixed lengt +h strings #define USE_MEMCMP_L 1 // using str_type = std::array<char, MAX_STR_LEN_L + 1>; using str_type = std::array<char, MAX_STR_LEN_L>; #else // using str_type = std::string; using str_type = std::basic_string<char>; #endif using int_str_type = std::pair<llil_int_type, str_type>; using vec_int_str_type = std::vector<int_str_type>; // fast_atoll64 // https://stackoverflow.com/questions/16826422/ // c-most-efficient-way-to-convert-string-to-int-faster-than-atoi // Commenting out done below because llil spec [id://11148465] states: // each line must match : ^[a-z]+\t\d+$ // i.e. you may assume count >=0 inline int64_t fast_atoll64( const char* str ) { int64_t val = 0; // int sign = 0; // if ( str == '-' ) { // sign = 1, ++str; // } uint8_t digit; while ((digit = uint8_t(str++ - '0')) <= 9) val = val * 10 + digi +t; // return sign ? -val : val; return val; } // Mimic the Perl get_properties subroutine -------------------------- +-- // Limit line length and use ANSI C functions to try to boost performa +nce #define MAX_LINE_LEN_L 255 static void get_properties( const char* fname, // in: the input file name vec_int_str_type& vec_ret) // out: a vector of properties { FILE* fh; std::array<char, MAX_LINE_LEN_L + 1> line; char* found; llil_int_type count; fh = ::fopen(fname, "r"); if ( fh == NULL ) { std::cerr << "Error opening '" << fname << "' : " << strerror(er +rno) << "\n"; return; } while ( ::fgets( line.data(), static_cast<int>(MAX_LINE_LEN_L), fh +) != NULL ) { found = std::find( line.begin(), line.end(), '\t' ); count = fast_atoll64(found+1); #ifdef MAX_STR_LEN_L str_type fixword {}; // Note: {} initializes all elements of +fixword to '\0' std::copy( line.begin(), found, fixword.begin() ); vec_ret.emplace_back( count, fixword ); #else found = '\0'; vec_ret.emplace_back( count, line.data() ); #endif } ::fclose(fh); } // Reduce a vector range (tally adjacent count fields of duplicate key + names) // Return the reduced length static vec_int_str_type::size_type reduce_vec( vec_int_str_type::iterator it1, // range of vector elements to +reduce vec_int_str_type::iterator it2 ) { auto itr = it1; auto itw = it1; llil_int_type count = itr->first; str_type name_last = itr->second; for ( ++itr; itr != it2; ++itr ) { #ifdef USE_MEMCMP_L if ( ::memcmp(itr->second.data(), name_last.data(), MAX_STR_LEN_ +L) == 0 ) { #else if ( itr->second == name_last ) { #endif count += itr->first; } else { itw->first = count; itw->second = name_last; ++itw; count = itr->first; name_last = itr->second; } } itw->first = count; itw->second = name_last; return std::distance(it1, ++itw); } typedef std::chrono::high_resolution_clock high_resolution_clock; typedef std::chrono::high_resolution_clock::time_point time_point; typedef std::chrono::milliseconds milliseconds; double elaspe_time( time_point cend, time_point cstart) { return double ( std::chrono::duration_cast<milliseconds>(cend - cstart).count() ) 1e-3; } // ------------------------------------------------------------------- +-- int main(int argc, char* argv[]) { if (argc < 2) { std::cerr << "usage: llil4vec2 file1 file2 ... >out.txt\n"; return 1; } std::cerr << std::setprecision(3) << std::setiosflags(std::ios::fix +ed); #ifdef MAX_STR_LEN_L #ifdef USE_MEMCMP_L std::cerr << "llil4vec2 (fixed string length=" << MAX_STR_LEN_L << +") use memcmp start\n"; #else std::cerr << "llil4vec2 (fixed string length=" << MAX_STR_LEN_L << +") start\n"; #endif #else std::cerr << "llil4vec2 start\n"; #endif #ifdef _OPENMP std::cerr << "use OpenMP\n"; #else std::cerr << "don't use OpenMP\n"; #endif #if USE_BOOST_PARALLEL_SORT == 0 std::cerr << "don't use boost sort\n"; #else std::cerr << "use boost sort\n"; #endif time_point cstart1, cend1, cstart2, cend2, cstart3, cend3r, cend3s, + cend3; cstart1 = std::chrono::high_resolution_clock::now(); #ifdef _OPENMP // Determine the number of threads. const char* env_nthrs = std::getenv("NUM_THREADS"); int nthrs = (env_nthrs && strlen(env_nthrs)) ? ::atoi(env_nthrs) : +std::thread::hardware_concurrency(); omp_set_dynamic(false); omp_set_num_threads(nthrs); #else int nthrs = 1; #endif int nthrs_sort = ( std::thread::hardware_concurrency() < 12 ) ? std::thread::hardware_concurrency() : 12; // Get the list of input files from the command line int nfiles = argc - 1; char fname = &argv[1]; // Create the vector of properties vec_int_str_type propvec; // Run parallel, depending on the number of threads if ( nthrs == 1 \|\| nfiles == 1 ) { for (int i = 0; i < nfiles; ++i) get_properties( fname[i], propvec ); } #ifdef _OPENMP else { #pragma omp parallel for schedule(static, 1) for (int i = 0; i < nfiles; ++i) { vec_int_str_type locvec; get_properties( fname[i], locvec ); #pragma omp critical { // Append local vector to propvec propvec.insert( propvec.end(), locvec.begin(), locvec.end( +) ); } } } #endif if (!propvec.size()) { std::cerr << "No work, exiting...\n"; return 1; } cend1 = std::chrono::high_resolution_clock::now(); double ctaken1 = elaspe_time(cend1, cstart1); std::cerr << "get properties time : " << std::setw(8) << ctake +n1 << " secs\n"; cstart2 = std::chrono::high_resolution_clock::now(); // Needs to be sorted by word for later sum of adjacent count field +s to work #if USE_BOOST_PARALLEL_SORT == 0 std::sort( propvec.begin(), propvec.end(), [](const int_str_type& left, const int_str_type& right) { return #ifdef USE_MEMCMP_L ::memcmp(left.second.data(), right.second.data(), MAX_STR_ +LEN_L) < 0; #else left.second < right.second; #endif } ); #else boost::sort::block_indirect_sort( propvec.begin(), propvec.end(), [](const int_str_type& left, const int_str_type& right) { return #ifdef USE_MEMCMP_L ::memcmp(left.second.data(), right.second.data(), MAX_STR_ +LEN_L) < 0; #else left.second < right.second; #endif }, nthrs_sort ); #endif cend2 = std::chrono::high_resolution_clock::now(); double ctaken2 = elaspe_time(cend2, cstart2); std::cerr << "sort properties time : " << std::setw(8) << ctake +n2 << " secs\n"; cstart3 = std::chrono::high_resolution_clock::now(); // Reduce in-place (tally adjacent count fields of duplicate key na +mes) vec_int_str_type::size_type newsize = reduce_vec( propvec.begin(), +propvec.end() ); propvec.resize(newsize); cend3r = std::chrono::high_resolution_clock::now(); // Sort the vector by (count) in reverse order, (name) in lexical o +rder #if USE_BOOST_PARALLEL_SORT == 0 std::sort( // Standard sort propvec.begin(), propvec.end(), [](const int_str_type& left, const int_str_type& right) { #ifdef USE_MEMCMP_L return left.first != right.first ? left.first > right.first : ::memcmp(left.second.data(), right.second.data(), MAX_ST +R_LEN_L) < 0; #else return left.first != right.first ? left.first > right.first : left.second < right.second; #endif } ); #else boost::sort::block_indirect_sort( // Parallel sort propvec.begin(), propvec.end(), [](const int_str_type& left, const int_str_type& right) { #ifdef USE_MEMCMP_L return left.first != right.first ? left.first > right.first : ::memcmp(left.second.data(), right.second.data(), MAX_ST +R_LEN_L) < 0; #else return left.first != right.first ? left.first > right.first : left.second < right.second; #endif }, nthrs_sort ); #endif cend3s = std::chrono::high_resolution_clock::now(); // Output the sorted vector for ( auto const& n : propvec ) { #ifdef MAX_STR_LEN_L // Note: finding doco on mnp::os_c_str() is a challenge, but tes +ting shows this // prints non null-terminated strings of length MAX_STR_LEN_L co +rrectly // ::print(fast_io::concatln(fast_io::mnp::os_c_str(n.second.dat +a(), MAX_STR_LEN_L), "\t", n.first)); // ::println(fast_io::mnp::os_c_str(n.second.data(), MAX_STR_LEN +_L), "\t", n.first); fast_io::io::println(fast_io::mnp::os_c_str(n.second.data(), MAX +_STR_LEN_L), "\t", n.first); #else // ::print(fast_io::concatln(n.second, "\t", n.first)); // ::println(n.second, "\t", n.first); fast_io::io::println(n.second, "\t", n.first); #endif } cend3 = std::chrono::high_resolution_clock::now(); double ctaken = elaspe_time(cend3, cstart1); double ctaken3r = elaspe_time(cend3r, cstart3); double ctaken3s = elaspe_time(cend3s, cend3r); double ctaken3o = elaspe_time(cend3, cend3s); std::cerr << "vector reduce time : " << std::setw(8) << ctake +n3r << " secs\n"; std::cerr << "vector stable sort time : " << std::setw(8) << ctake +n3s << " secs\n"; std::cerr << "write stdout time : " << std::setw(8) << ctake +n3o << " secs\n"; std::cerr << "total time : " << std::setw(8) << ctake +n << " secs\n"; // Hack to see Private Bytes in Windows Task Manager (uncomment nex +t line so process doesn't exit too quickly) // std::this_thread::sleep_for(std::chrono::milliseconds(90000000 +)); return 0; } [download] Added Later** 2024 update: marioroy noted that `std::mutex` is slower in gcc-14 than gcc-13; he found another way to use spin locks via a small C++ class on the web. Updated: changed `::atoll` to `fast_atoll64`. Thanks marioroy. 28 Jan 2023: Added `llil4vec2.cpp`. Updated Timings section. Note: `llil4vec2.cpp` was updated multiple times, often based on improvements to `llil4vec.cpp` at Re^7: Rosetta Code: Long List is Long (faster - llil4vec - OpenMP code) and `llil4vec-tbb.cpp` at Re^9: Rosetta Code: Long List is Long (faster - llil4vec - TBB code) (full mario links at Re^2: Rosetta Code: Long List is Long - JudySL summary).	[reply] [d/l] [select]
Re^11: Rosetta Code: Long List is Long (faster - llil4vec - TBB code - llil5vec) by marioroy (Prior) on Jan 23, 2023 at 23:22 UTC
I was then able to compile your program with both clang++ and g++. Though g++ spat out a number of warnings... The warnings began since using Boost. You can silent the warnings with the gcc -Wno-stringop-overflow option. Though, no issues using clang++.	[reply]


Welcome to the Monastery
	PerlMonks

Re^9: Rosetta Code: Long List is Long (faster - llil4vec - TBB code)