Re^5: Rosetta Code: Long List is Long (faster - llil4map

Nice work Mario! As you might expect, I couldn't restrain myself from playing around with a few minor changes:
I gave boost::hash_range a try because there's less syntactic claptrap around the specialization of std::hash -- I've #if 0'ed it out though because it seems to be a touch slower than good old std::hash.
Generally, I pull a face whenever I see a function updating a non-const global variable, so I changed the get_properties interface to pass map_upd as a function parameter.
My usual simplification of parsing in get_properties by assuming the input file/s match the llil spec ^[a-z]+\t\d+$ (and without any long names when using MAX_STR_LEN :).
// llil4map.cpp
// https://perlmonks.com/?node_id=11149643
// OpenMP Little Book - https://nanxiao.gitbooks.io/openmp-little-book
//
// C++ version inspired by Mario's famous dualvar two-sort Perl soluti
+on.
// based on llil3m.cpp https://perlmonks.com/?node_id=11149482
//    1. Run get_properties in parallel.
//    2. Capture time and diff via chrono.
//    3. Threads: Flush local vector periodically; speed-up for std::m
+ap.
//    4. Key words null-terminated for MAX_STR_LEN_L.
//    5. Concat strings using fast_io::concatln during output.
//    6. Support NUM_THREADS environment variable.
//    7. Added define for using boost's parallel sort.
//    8. Removed parallel get_properties, not helpful due to overhead.
//    9. Removed MAX_STR_LEN_L for simply std::map.
//    A. Replaced atoll with fast_atoll64.
//    B. Fast vector sorting - from llil5vec-tbb.
//    C. Re-enabled parallel get_properties.
//    D. Exit early if no work; fast_io tweak writing to output;
//       limit to 8 threads max for sorting.
//    E. Switch to parallel hashmap; using SSE2 instructions.
//    F. Support std::map, std::unordered_map, or parallel-hashmap via
+ define.
//    G. Refactored OpenMP code to merge immediately.
//    H. Improved get_properties; set precision for timings.
//    I. Support fixed-length key words using basic_static_string.
//    J. Remove support for std::map and std::unordered_map.
//    K. Opt-in to use (limited length) fixed length strings.
//       Uncomment line #define MAX_STR_LEN_L and update value.
//    L. Improved the fixed length path with custom default_hash/eq fu
+nctions.
//    M. Define str_type struct as std::array, overriding == and < ope
+rators.
//
// Obtain the fast_io library (required dependency):
//    git clone --depth=1 https://github.com/cppfastio/fast_io
//
// Obtain the parallel hashmap library (required dependency):
//    git clone --depth=1 https://github.com/greg7mdp/parallel-hashmap
//
// g++ compile on Linux: (boost header may need the -Wno-stringop-over
+flow gcc option)
//    g++ -o llil4map -std=c++20 -Wall -O3 llil4map.cpp -I ./fast_io/i
+nclude -I ./parallel-hashmap
//    g++ -o llil4map-omp -std=c++20 -fopenmp -Wall -O3 llil4map.cpp -
+I ./fast_io/include -I ./parallel-hashmap
//
// This g++ command also works with mingw C++ compiler (https://source
+forge.net/projects/mingw-w64)
// that comes bundled with Strawberry Perl (C:\Strawberry\c\bin\g++.ex
+e).
//
// clang++ compile: same args, without the -Wno-stringop-overflow opti
+on
// Seems to run slightly faster when compiled with clang++ instead of 
+g++
//
// Obtain gen-llil.pl and gen-long-llil.pl from https://perlmonks.com/
+?node_id=11148681
//    perl gen-llil.pl big1.txt 200 3 1
//    perl gen-llil.pl big2.txt 200 3 1
//    perl gen-llil.pl big3.txt 200 3 1
//    perl gen-long-llil.pl long1.txt 600
//    perl gen-long-llil.pl long2.txt 600
//    perl gen-long-llil.pl long3.txt 600
//
// To make random input, obtain shuffle.pl from https://perlmonks.com/
+?node_id=11149800
//    perl shuffle.pl big1.txt >tmp && mv tmp big1.txt
//    perl shuffle.pl big2.txt >tmp && mv tmp big2.txt
//    perl shuffle.pl big3.txt >tmp && mv tmp big3.txt
//
// Example run:  llil4map big1.txt big2.txt big3.txt >out.txt
// NUM_THREADS=3 llil4map-omp ...
//         Note: Lesser NUM_THREADS is preferred, else merge time incr
+eases
//               Start with MIN( CPU cores, 1 + log2( number of input 
+files ) )
// -------------------------------------------------------------------
+---------

// Specify 0/1 to use boost's parallel sorting algorithm; faster than 
+__gnu_parallel::sort.
// https://www.boost.org/doc/libs/1_81_0/libs/sort/doc/html/sort/paral
+lel.html
// This requires the boost header files: e.g. devpkg-boost bundle on C
+lear Linux.
// Note: Another option is downloading and unpacking Boost locally.
// (no need to build it because the bits we use are header file only)
#define USE_BOOST_PARALLEL_SORT  1

#include <chrono>
#include <thread>

// The fast_io header must come after chrono, else build error:
// "no member named 'concatln' in namespace 'fast_io'"
#include <fast_io.h>
#include <parallel_hashmap/phmap.h>

#include <cstdio>
#include <cstddef>
#include <cstdint>
#include <cstdlib>
#include <cstring>
#include <ctime>

#include <string>
#include <array>
#include <vector>

#include <utility>
#include <iterator>
#include <execution>
#include <algorithm>

#include <boost/container_hash/hash.hpp>

#if USE_BOOST_PARALLEL_SORT > 0
#include <boost/sort/sort.hpp>
#endif

#ifdef _OPENMP
#include <omp.h>
#endif

#include <iostream>
#include <iomanip>
#include <fstream>
#include <sstream>

static_assert(sizeof(size_t) == sizeof(int64_t), "size_t too small, ne
+ed a 64-bit compile");

// -------------------------------------------------------------------
+---------

typedef long long llil_int_type;

// All words in big1.txt, big2.txt, big3.txt are <= 6 chars in length.
// big.txt  max word length is 6
// long.txt max word length is 208
//
// Based on rough benchmarking, the short fixed string hack below is o
+nly
// worth trying for MAX_STR_LEN_L up to about 22.
// See also https://backlinko.com/google-keyword-study
//
// To use (limited length) fixed length strings uncomment the next lin
+e.
#define MAX_STR_LEN_L 6

#ifdef MAX_STR_LEN_L
struct str_type : std::array<char, MAX_STR_LEN_L> {
   bool operator==( const str_type& o ) const {
      return ::memcmp(this->data(), o.data(), MAX_STR_LEN_L) == 0;
   }
   bool operator<( const str_type& o ) const {
      return ::memcmp(this->data(), o.data(), MAX_STR_LEN_L) < 0;
   }
};
#else
struct str_type : std::basic_string<char> {
   bool operator==( const str_type& o ) const {
      return ::strcmp(this->data(), o.data()) == 0;
   }
   bool operator<( const str_type& o ) const {
      return ::strcmp(this->data(), o.data()) < 0;
   }
};
#endif

using str_int_type     = std::pair<str_type, llil_int_type>;
using vec_str_int_type = std::vector<str_int_type>;

// inject specialization of std::hash for str_type into namespace std
namespace std {
   template<> struct hash<str_type> {
      std::size_t operator()( str_type const& v ) const noexcept {
#if 0
         return boost::hash_range( v.cbegin(), v.cend() );
#else
         std::basic_string_view<char> bv {
            reinterpret_cast<const char*>(v.data()), v.size() * sizeof
+(char) };
         return std::hash<std::basic_string_view<char>>()(bv);
#endif
      }
   };
}

// create the parallel_flat_hash_map without internal mutexes
using map_str_int_type = phmap::parallel_flat_hash_map<
   str_type, llil_int_type,
   phmap::priv::hash_default_hash<str_type>,
   phmap::priv::hash_default_eq<str_type>,
   phmap::priv::Allocator<phmap::priv::Pair<const str_type, llil_int_t
+ype>>,
   8, phmap::NullMutex
>;

// fast_atoll64
// https://stackoverflow.com/questions/16826422/
// c-most-efficient-way-to-convert-string-to-int-faster-than-atoi

inline int64_t fast_atoll64( const char* str )
{
   int64_t val  = 0;
   // int     sign = 0;
   // if ( *str == '-' ) {
   //    sign = 1, ++str;
   // }
   uint8_t digit;
   while ((digit = uint8_t(*str++ - '0')) <= 9) val = val * 10 + digit
+;
   // return sign ? -val : val;
   return val;
}

// Mimic the Perl get_properties subroutine --------------------------
+--

// Limit line length and use ANSI C functions to try to boost performa
+nce
#define MAX_LINE_LEN_L 255
#define MAX_THREADS    32

static void get_properties(
   const char*        fname,     // in   : the input file name
   map_str_int_type&  map_upd    // inout: the map to be updated
)
{
   FILE* fh;
   std::array<char, MAX_LINE_LEN_L + 1> line;
   char* found;
   llil_int_type count;

   fh = ::fopen(fname, "r");
   if ( fh == NULL ) {
      std::cerr << "Error opening '" << fname << "' : " << strerror(er
+rno) << "\n";
      return;
   }
   while ( ::fgets( line.data(), static_cast<int>(MAX_LINE_LEN_L), fh 
+) != NULL ) {
      found = std::find( line.begin(), line.end(), '\t' );
      count = fast_atoll64( found + 1 );
      // Note: using emplace() is faster than map_upd[fixword] += coun
+t;
#ifdef MAX_STR_LEN_L
      str_type fixword {};  // {} initializes all elements of fixword 
+to '\0'
      std::copy( line.begin(), found, fixword.begin() );
      const auto [it, success] = map_upd.emplace(fixword, count);
#else
      *found = '\0';
      const auto [it, success] = map_upd.emplace(str_type{ line.data()
+ }, count);
#endif
      if ( !success ) it->second += count;
   }
   ::fclose(fh);
}

typedef std::chrono::high_resolution_clock high_resolution_clock;
typedef std::chrono::high_resolution_clock::time_point time_point;
typedef std::chrono::milliseconds milliseconds;

double elaspe_time(
   time_point cend,
   time_point cstart)
{
   return double (
      std::chrono::duration_cast<milliseconds>(cend - cstart).count()
   ) * 1e-3;
}

// -------------------------------------------------------------------
+--

static map_str_int_type Map[MAX_THREADS];

int main(int argc, char* argv[])
{
   if (argc < 2) {
      std::cerr << "usage: llil4map file1 file2 ... >out.txt\n";
      return 1;
   }

   std::cerr << std::setprecision(3) << std::setiosflags(std::ios::fix
+ed);
#ifdef MAX_STR_LEN_L
   std::cerr << "llil4map (fixed string length=" << MAX_STR_LEN_L << "
+) start\n";
#else
   std::cerr << "llil4map start\n";
#endif
#ifdef _OPENMP
   std::cerr << "use OpenMP\n";
#else
   std::cerr << "don't use OpenMP\n";
#endif
#if USE_BOOST_PARALLEL_SORT == 0
   std::cerr << "don't use boost sort\n";
#else
   std::cerr << "use boost sort\n";
#endif
   time_point cstart1, cend1, cstart2, cend2, cstart3, cend3s, cend3;
   cstart1 = high_resolution_clock::now();

#ifdef _OPENMP
   // Determine the number of threads.
   const char* env_nthrs = std::getenv("NUM_THREADS");
   int nthrs = ( env_nthrs && strlen(env_nthrs) )
      ? ::atoi(env_nthrs)
      : std::thread::hardware_concurrency();
   if ( nthrs > MAX_THREADS ) nthrs = MAX_THREADS;
   omp_set_dynamic(false);
   omp_set_num_threads(nthrs);
#else
   int nthrs = 1;
#endif

   int nthrs_sort = ( std::thread::hardware_concurrency() < 12 )
      ? std::thread::hardware_concurrency()
      : 12;

   // Get the list of input files from the command line
   int    nfiles = argc - 1;
   char** fname  = &argv[1];
   int  merge_id = 0;

   // Run parallel, depending on the number of threads
   if ( nthrs == 1 || nfiles == 1 ) {
      for ( int i = 0; i < nfiles; ++i ) {
         get_properties(fname[i], Map[0]);
      }
      cend1 = high_resolution_clock::now();
      double ctaken1 = elaspe_time(cend1, cstart1);
      std::cerr << "get properties      " << std::setw(8) << ctaken1 <
+< " secs\n";
      cstart2 = high_resolution_clock::now();
   }
#ifdef _OPENMP
   else {
      merge_id = -1;

      int nfiles_processed = 0;
      int gettime_rendered = 0;

      #pragma omp parallel
      {
         int tid = omp_get_thread_num();

         #pragma omp for nowait schedule(static, 1)
         for ( int i = 0; i < nfiles; ++i ) {
            get_properties(fname[i], Map[tid]);
            #pragma omp atomic
            nfiles_processed += 1;
         }

         #pragma omp critical
         {
            // report time for get properties
            if ( nfiles_processed == nfiles && !gettime_rendered ) {
               cend1 = high_resolution_clock::now();
               double ctaken1 = elaspe_time(cend1, cstart1);
               std::cerr << "get properties      " << std::setw(8) << 
+ctaken1 << " secs\n";
               cstart2 = high_resolution_clock::now();
               gettime_rendered = 1;
            }
            // merge array
            if ( Map[tid].size() > 0 ) {
               if ( merge_id < 0 ) {
                  // container id other threads merge to
                  merge_id = tid;
               }
               else {
                  for ( auto const& x : Map[tid] ) {
                     // Map[merge_id][x.first] += x.second;
                     const auto [it, success] = Map[merge_id].emplace(
+x.first, x.second);
                     if ( !success ) it->second += x.second;
                  }
                  Map[tid].clear();
               }
            }
         }
      }
   }
#endif

   if ( merge_id < 0 || Map[merge_id].size() == 0 ) {
      std::cerr << "No work, exiting...\n";
      return 1;
   }

   // Store the properties into a vector
   vec_str_int_type propvec( Map[merge_id].begin(), Map[merge_id].end(
+) );
   Map[merge_id].clear();

   cend2 = high_resolution_clock::now();
   double ctaken2 = elaspe_time(cend2, cstart2);
   std::cerr << "finish merging      " << std::setw(8) << ctaken2 << "
+ secs\n";

   cstart3 = high_resolution_clock::now();

   // Sort the vector by (count) in reverse order, (name) in lexical o
+rder
#if USE_BOOST_PARALLEL_SORT == 0
   std::sort(
      // Standard sort
      propvec.begin(), propvec.end(),
      [](const str_int_type& left, const str_int_type& right) {
         return left.second != right.second
            ? left.second > right.second
            : left.first  < right.first;
      }
   );
#else
   boost::sort::block_indirect_sort(
      // Parallel sort
      propvec.begin(), propvec.end(),
      [](const str_int_type& left, const str_int_type& right) {
         return left.second != right.second
            ? left.second > right.second
            : left.first  < right.first;
      },
      nthrs_sort
   );
#endif

   cend3s = high_resolution_clock::now();

   // Output the sorted vector
#ifdef MAX_STR_LEN_L
   for ( auto const& n : propvec )
      ::print(fast_io::concatln(fast_io::mnp::os_c_str(n.first.data(),
+ MAX_STR_LEN_L), "\t", n.second));
#else
   for ( auto const& n : propvec )
      ::print(fast_io::concatln(n.first, "\t", n.second));
#endif

   cend3 = high_resolution_clock::now();

   double ctaken   = elaspe_time(cend3, cstart1);
   double ctaken3s = elaspe_time(cend3s, cstart3);
   double ctaken3o = elaspe_time(cend3, cend3s);

   std::cerr << "vector stable sort  " << std::setw(8) << ctaken3s << 
+" secs\n";
   std::cerr << "write stdout        " << std::setw(8) << ctaken3o << 
+" secs\n";
   std::cerr << "total time          " << std::setw(8) << ctaken   << 
+" secs\n";

   // Hack to see Private Bytes in Windows Task Manager
   // (uncomment next line so process doesn't exit too quickly)
   // std::this_thread::sleep_for(milliseconds(9000));

   return 0;
}
[download]
Parallel HashMap References Added Later
Solving the Long List is Long challenge, finally? by marioroy (July 2023)
Re^12: Risque Romantic Rosetta Roman Race - All in One - OpenMP by marioroy (May 2023)
Re^2: Rosetta Code: Long List is Long - JudySL summary by marioroy (Jan 2023)
parallel-hashmap question from mario with replies from Gregory Popovitch (June 2023)
2024 update: marioroy noted that std::mutex is slower in gcc-14 than gcc-13; he found another way to use spin locks via a small C++ class on the web.
Comment on Re^5: Rosetta Code: Long List is Long (faster - llil4map - OpenMP code) Select or Download Code

Don't ask to ask, just ask
	PerlMonks
Re^5: Rosetta Code: Long List is Long (faster - llil4map - OpenMP code)