Re^6: Rosetta Code: Long List is Long (faster - vec

So far I haven't found any statistically significant speed-ups from employing this library ...

After closer look, it was the -std=c++20 language mode that enables faster vectors by 0.2 ~ 0.4 seconds versus -std=c++11.

Update 1: Support variable length words.
Update 2: Enable parallel sort. See Using Parallel Mode.

I tried OpenMP. Unfortunately, strtok is not thread-safe e.g. strtok(NULL, "\n") causing segfault. So I factored out strtok. The OpenMP result improved by 0.1 seconds. That's because the actual reading is already fast. It takes 2 threads minimally to run faster than non-OpenMP results due to populating vec_rec from local copies.

Building:

clang++ -o llil4vec -std=c++11 -Wall -O3 llil4vec.cpp
clang++ -o llil4vec -std=c++20 -Wall -O3 llil4vec.cpp  # faster

# enable parallel via -fopenmp
clang++ -o llil4vec-omp -std=c++11 -fopenmp -Wall -O3 llil4vec.cpp
clang++ -o llil4vec-omp -std=c++20 -fopenmp -Wall -O3 llil4vec.cpp  # 
+faster
[download]

Running - Real time results:

$ time ./llil4vec big1.txt big2.txt big3.txt >out.txt

std:c++11:  2.901 secs
std:c++20:  2.850 secs

$ time OMP_NUM_THREADS=3 ./llil4vec-omp big1.txt big2.txt big3.txt >ou
+t.txt

std:c++11:  2.308 secs
std:c++20:  2.267 secs
[download]

llil4vec.cpp modification, OpenMP-aware:

#if defined(_OPENMP)
#include <omp.h>
#include <parallel/algorithm>
#else
#include <algorithm>
#endif

...

static void get_properties(
   int                nfiles,      //  in: the number of input files
   char*              fname[],     //  in: the input file names
   vec_int_str_type&  vec_ret)     // out: a vector of properties
{
#if defined(_OPENMP)
   omp_set_dynamic(0);
   omp_set_max_active_levels(1);

   #pragma omp parallel
   {
      vec_int_str_type vec_loc;    // thread local copy

      #pragma omp for nowait schedule(static,1)
#endif
      for (int i = 0; i < nfiles; ++i) {
         char line[MAX_LINE_LEN_L+1];
         char* found;
         llil_int_type count;
         FILE* fh = ::fopen(fname[i], "r");
         if (fh == NULL) {
            std::cerr << "Error opening '" << fname[i] << "' : errno="
+ << errno << "\n";
            continue;
         }
         while ( ::fgets(line, MAX_LINE_LEN_L, fh) != NULL ) {
            found = ::strchr(line, '\t');
            count = ::atoll( &line[found - line + 1] );
            line[found - line] = '\0';  // word
#ifdef MAX_STR_LEN_L
            str_type fixword { { '\0', '\0', '\0', '\0', '\0', '\0' } 
+};
            ::memcpy( fixword.data(), line, found - line );
           #if defined(_OPENMP)
            vec_loc.emplace_back( -count, fixword );
           #else
            vec_ret.emplace_back( -count, fixword );
           #endif
#else
           #if defined(_OPENMP)
            vec_loc.emplace_back( -count, line );
           #else
            vec_ret.emplace_back( -count, line );
           #endif
#endif
         }
         ::fclose(fh);
      }

#if defined(_OPENMP)
      #pragma omp critical
      vec_ret.insert(vec_ret.end(), vec_loc.begin(), vec_loc.end());
   }
#endif

   // Needs to be sorted by word for later sum of adjacent count field
+s to work
#if defined(_OPENMP)
   __gnu_parallel::sort( vec_ret.begin(), vec_ret.end(),
      [](const int_str_type& left, const int_str_type& right) { return
+ left.second < right.second; }
   );
#else
   std::sort( vec_ret.begin(), vec_ret.end(),
      [](const int_str_type& left, const int_str_type& right) { return
+ left.second < right.second; }
   );
#endif
}
[download]

Comment on Re^6: Rosetta Code: Long List is Long (faster - vec - OpenMP) Select or Download Code

Replies are listed 'Best First'.
Re^7: Rosetta Code: Long List is Long (faster - llil4vec - OpenMP code) by marioroy (Prior) on Jan 12, 2023 at 09:55 UTC
Helpful, OpenMP Little Book which introduces OpenMP C/C++ concurrency programming. OpenMP spawns a thread per logical core by default. Therefore, I set the desired number of threads via the OMP_NUM_THREADS environment variable (... or NUM_THREADS ...) to minimize extra threads creation and reaping. `$ NUM_THREADS=3 ./llil4vec big1.txt big2.txt big3.txt >out.txt llil4vec (fixed string length=12) start use OpenMP use boost sort get properties 0.131 secs sort properties 0.191 secs vector reduce 0.036 secs vector stable sort 0.267 secs write stdout 0.191 secs total time 0.818 secs` [download] llil4vec.cc: Read more... (17 kB)	[reply] [d/l] [select]
Re^8: Rosetta Code: Long List is Long (faster - llil4vec - OpenMP code) by marioroy (Prior) on Jan 12, 2023 at 11:59 UTC
Which is more efficient w.r.t. CPU utilization between g++ and clang++? Let's find out with 30 input files and pay close attention to user times. g++ wins for parallel sort as lesser user time is better considering real times are similar. cp big1.txt big4.txt cp big2.txt big5.txt cp big3.txt big6.txt cp big1.txt big7.txt cp big2.txt big8.txt cp big3.txt big9.txt cp big1.txt biga.txt cp big2.txt bigb.txt cp big3.txt bigc.txt cp big1.txt bigd.txt cp big2.txt bige.txt cp big3.txt bigf.txt cp big1.txt bigg.txt cp big2.txt bigh.txt cp big3.txt bigi.txt cp big1.txt bigj.txt cp big2.txt bigk.txt cp big3.txt bigl.txt cp big1.txt bigm.txt cp big2.txt bign.txt cp big3.txt bigo.txt cp big1.txt bigp.txt cp big2.txt bigq.txt cp big3.txt bigr.txt cp big1.txt bigs.txt cp big2.txt bigt.txt cp big3.txt bigu.txt [download] g++ `$ g++ -o llil4vec-omp -std=c++20 -fopenmp -Wall -O3 llil4vec.cpp -I./f +ast_io/include $ time OMP_NUM_THREADS=30 ./llil4vec-omp big?.txt >out.txt llil4vec (fixed string length=6) start use OpenMP get properties time : 0.937948 secs sort properties time : 0.494592 secs emplace set sort time : 1.02668 secs write stdout time : 0.50569 secs total time : 2.96497 secs real 0m3.271s <-- total wall time, performs similarly user 0m16.823s <-- nothing unusual, fans seem quiet sys 0m4.188s` [download] clang++ `$ clang++ -o llil4vec-omp -std=c++20 -fopenmp -Wall -O3 llil4vec.cpp - +I./fast_io/include $ time OMP_NUM_THREADS=30 ./llil4vec-omp big?.txt >out.txt llil4vec (fixed string length=6) start use OpenMP get properties time : 1.07873 secs sort properties time : 0.717261 secs emplace set sort time : 0.606499 secs write stdout time : 0.437287 secs total time : 2.83984 secs real 0m3.201s <-- total wall time, performs similarly user 0m45.889s <-- system fans noticeably louder sys 0m5.934s` [download] Next, I tried J Script with AVX2 enabled for comparison. 30 threads, specified in the script. `$ time ~/j904/bin/jconsole llil4_p.ijs big?.txt out.txt Read and parse input: 2.43996 Classify, sum, sort: 4.63914 Format and write output: 0.848965 Total time: 7.92807 real 0m7.934s <-- total wall time user 0m24.782s <-- this too, system fans roaring sys 0m26.764s <-- because user and sys combined` [download] I find it interesting for clang++'s user time to be nearly 3x that of g++.	[reply] [d/l] [select]
Re^8: Rosetta Code: Long List is Long (faster - vec - OpenMP code - tbb) by eyepopslikeamosquito (Archbishop) on Jan 17, 2023 at 03:23 UTC
Excellent work as always from marioroy. Much appreciated. While struggling to learn OpenMP I stumbled upon Intel's OneAPI Threading Building Blocks (aka `oneTBB`). For some reason, I found this library easier to understand, so decided to give it a try. After downloading the `oneapi-tbb-2021.7.0-lin.tgz` release package from oneTBB 2021.7.0 and unpacking it under my Ubuntu `$HOME` dir and running: `. $HOME/local-oneapi-tbb/oneapi-tbb-2021.7.0/env/vars.sh` [download] to set the oneTBB variables, I was up and running. This is the command I used to compile C++ programs: `g++ -o llil3vec-tbb -std=c++20 -Wall -O3 -I "$HOME/local-oneapi-tbb/on +eapi-tbb-2021.7.0/include" -L "$HOME/local-oneapi-tbb/oneapi-tbb-2021 +.7.0/lib" llil3vec-tbb.cpp -l tbb` [download] Update: Much later, I hit a problem with locales: `$ locale -a C C.utf8 POSIX` [download] Fixed crashing par1.cpp in pgatram dir by changing: `// std::cout.imbue(std::locale{"en_US.UTF8"}); std::cout.imbue(std::locale{"C.utf8"});` [download] in sample code from transform_reduce. What attracted to me to `TBB` was the ease of trying out updating a `std::map` from multiple threads with minimal changes simply by changing from: `using map_str_int_type = std::map<str_type, llil_int_type>;` [download] to: `using map_str_int_type = tbb::concurrent_map<str_type, llil_int_type>;` [download] While that worked, it ran a little bit slower, presumably due to the locking overhead associated with updating the `tbb::concurrent_map` variable (`hash_ret[word] -= count`) from multiple threads. To avoid crashes, I further needed to break down the `get_properties` function so that each thread operated on a different input file (see `get_properties_one_file` function below). I was able to get a minor speedup (similar to what I saw with OpenMP) by using this library without any locking on a vector, as shown in the sample code below: Read more... (12 kB) Timings of OpenMP vs OneTbb on my machine `OpenMp version: $ time ./llil4vec_p big1.txt big2.txt big3.txt >vec.tmp llil2vec (fixed string length=6) start get_properties time : 0.853868 secs emplace set sort time : 0.972116 secs write stdout time : 0.875946 secs total time : 2.70229 secs real 0m3.041s user 0m5.553s sys 0m0.745s` [download] `OneTbb version: $ time ./llil3vec-tbb big1.txt big2.txt big3.txt >f.tmp llil3vec-tbb (fixed string length=6) start use TBB get_properties CPU time : 3.30477 secs emplace set sort CPU time : 0.825981 secs write stdout CPU time : 0.866964 secs total CPU time : 4.99808 secs total wall clock time : 3 secs real 0m2.890s user 0m4.687s sys 0m0.646s` [download] The real time reported by the Linux `time` command when running the `tbb` version of `2.890s` compares favourably with `3.041s` of the OpenMP version. Apart from performing better on modern CPU caches, `std::vector` seems to also outperform `std::map` in multi-threaded programs, due to the locking overhead of updating a global map from multiple threads. Update: Timings for `llil3vec-tbb-a.cpp` below, built with `clang++` and `fast_io` are slightly faster: `$ time ./llil3vec-tbb-a big1.txt big2.txt big3.txt >f.tmp llil3vec-tbb-a (fixed string length=6) start use TBB get_properties CPU time : 2.9722 secs emplace set sort CPU time : 0.61156 secs write stdout CPU time : 0.828023 secs total CPU time : 4.41257 secs total wall clock time : 3 secs real 0m2.552s user 0m4.304s sys 0m0.438s` [download] Updated Simpler Version llil3vec-tbb-a.cpp Update: built with: `clang++ -o llil3vec-tbb-a -std=c++20 -Wall -O3 -I "$HOME/llil/cmdlinux +/fast_io/include" -I "$HOME/local-oneapi-tbb/oneapi-tbb-2021.7.0/incl +ude" -L "$HOME/local-oneapi-tbb/oneapi-tbb-2021.7.0/lib" llil3vec-tbb +-a.cpp -l tbb` [download] Oh, and see `llil4vec-tbb.cpp` in Re^9: Rosetta Code: Long List is Long (faster - llil4vec - TBB code) by marioroy for a cleaner way to merge the local array `locvec[i]` into `vec_ret`, via a scoped lock and a mutex, thus eliminating the ugly `locvec[MAX_INPUT_FILES_L]` array. Updated timings for `llil4vec-tbb` on my machine can be found here. Read more... (12 kB) Updated: Added simpler `llil3vec-tbb-a.cpp` version.	[reply] [d/l] [select]
Re^9: Rosetta Code: Long List is Long (faster - vec - OpenMP code - tbb) by marioroy (Prior) on Jan 17, 2023 at 10:33 UTC
Thank you, eyepopslikeamosquito. How nice to have a TBB implementation! OS vendors may have the TBB development package. I'm running Clear Linux and the TBB runtime libraries are installed with the OS. All I needed was the TBB dev bundle.	[reply]
Re^9: Rosetta Code: Long List is Long (faster - llil4vec - TBB code) by marioroy (Prior) on Jan 19, 2023 at 15:42 UTC
Aloha eyepopslikeamosquito, I made an update to your simpler llil3vec-tbb-a.cpp version. I captured results and posted them here. The llil4vec OpenMP impl can be found here, for comparison. Before posting, I ran with 121 MB input files (many of them). I saw a 5 ~ 6 second gap in favor of OpenMP; e.g. 1 core running for a while between parallel_for and parallel_sort. This is resolved by having threads merge immediately after processing a file, similarly to the OpenMP implementation. Now, TBB performs well for small, big, or many input files. llil4vec-tbb.cc Read more... (19 kB)	[reply] [d/l]
Re^10: Rosetta Code: Long List is Long (faster - llil4vec - llil4judy - TBB code - llil5vec - boost) by eyepopslikeamosquito (Archbishop) on Jan 22, 2023 at 08:05 UTC
Re^11: Rosetta Code: Long List is Long (faster - llil4vec - TBB code - llil5vec) by marioroy (Prior) on Jan 23, 2023 at 23:22 UTC
Re^7: Rosetta Code: Long List is Long (faster - vec - OpenMP) by eyepopslikeamosquito (Archbishop) on Jan 11, 2023 at 20:59 UTC
Very interesting! This will help me get started with OpenMP, which I've been meaning to do for a while now. Unfortunately, `strtok` is not thread-safe e.g. `strtok(NULL, "\n")` causing segfault. So I factored out `strtok`. Ha ha, that I used `strtok` at all is an indication of how desperate I was, given I singled this function out for a special dishonourable mention at On Interfaces and APIs. :) To round out this section, notice that the ANSI C strtok function has a shocker of an interface: it surprises by writing NULLs into the input string being tokenized; the first call has different semantics to subsequent calls (distinguished by special-case logic on the value of its first parameter); and it stores state between calls so that only one sequence of calls can be active at a time (bad enough in a single-threaded environment, but so intolerable in *multi-threaded and signal handler* (reentrant) environments that the POSIX threads committee invented a replacement `strtok_r` function).	[reply] [d/l] [select]
Re^8: Rosetta Code: Long List is Long (faster - vec - OpenMP) by marioroy (Prior) on Jan 11, 2023 at 22:07 UTC
It was weird experiencing the segfault due to strtok(NULL, "\n"). Great! What is strtok_r() function in C language? I compared strtok_r with context against strchr. After testing, I updated the OpenMP demonstration to find the tab character within the string. Now passing: limited length and variable length words, including OpenMP. Update 2 enables parallel sort. `# 15 input files, 1 thread, clang++ -std=c++20 strtok_r 9.655 seconds strchr 8.713 seconds` [download] `while ( ::fgets(line, MAX_LINE_LEN_L, fh) != NULL ) { found = ::strchr(line, '\t'); count = ::atoll( &line[found - line + 1] ); line[found - line] = '\0'; // word ... }` [download] Also, I replaced strlen(...) with found - line. `::memcpy( fixword.data(), line, found - line );` [download]	[reply] [d/l] [select]