in reply to Re^4: Rosetta Code: Long List is Long (faster - vec - fast_io)
in thread Rosetta Code: Long List is Long
Thanks so much for finding this github fast_io library and for posting such clear instructions on how to use it! This made it ridiculously easy for me to play around with it.
So far I haven't found any statistically significant speed-ups from employing this library (hardly surprising given the mega man years of effort poured into g++ and clang++), but I will keep my eye on it.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^6: Rosetta Code: Long List is Long (faster - vec - OpenMP)
by marioroy (Prior) on Jan 11, 2023 at 16:51 UTC | |
So far I haven't found any statistically significant speed-ups from employing this library ... After closer look, it was the -std=c++20 language mode that enables faster vectors by 0.2 ~ 0.4 seconds versus -std=c++11. Update 1: Support variable length words. I tried OpenMP. Unfortunately, strtok is not thread-safe e.g. strtok(NULL, "\n") causing segfault. So I factored out strtok. The OpenMP result improved by 0.1 seconds. That's because the actual reading is already fast. It takes 2 threads minimally to run faster than non-OpenMP results due to populating vec_rec from local copies. Building:
Running - Real time results:
llil4vec.cpp modification, OpenMP-aware:
| [reply] [d/l] [select] |
by marioroy (Prior) on Jan 12, 2023 at 09:55 UTC | |
Helpful, OpenMP Little Book which introduces OpenMP C/C++ concurrency programming. OpenMP spawns a thread per logical core by default. Therefore, I set the desired number of threads via the OMP_NUM_THREADS environment variable (... or NUM_THREADS ...) to minimize extra threads creation and reaping.
llil4vec.cc:
Read more... (17 kB)
| [reply] [d/l] [select] |
by marioroy (Prior) on Jan 12, 2023 at 11:59 UTC | |
Which is more efficient w.r.t. CPU utilization between g++ and clang++? Let's find out with 30 input files and pay close attention to user times. g++ wins for parallel sort as lesser user time is better considering real times are similar.
g++
clang++
Next, I tried J Script with AVX2 enabled for comparison. 30 threads, specified in the script.
I find it interesting for clang++'s user time to be nearly 3x that of g++. | [reply] [d/l] [select] |
by eyepopslikeamosquito (Archbishop) on Jan 17, 2023 at 03:23 UTC | |
Excellent work as always from marioroy. Much appreciated. While struggling to learn OpenMP I stumbled upon Intel's OneAPI Threading Building Blocks (aka oneTBB). For some reason, I found this library easier to understand, so decided to give it a try. After downloading the oneapi-tbb-2021.7.0-lin.tgz release package from oneTBB 2021.7.0 and unpacking it under my Ubuntu $HOME dir and running: to set the oneTBB variables, I was up and running. This is the command I used to compile C++ programs:
Update: Much later, I hit a problem with locales: Fixed crashing par1.cpp in pgatram dir by changing: in sample code from transform_reduce. What attracted to me to TBB was the ease of trying out updating a std::map from multiple threads with minimal changes simply by changing from: to:
While that worked, it ran a little bit slower, presumably due to the locking overhead associated with updating the tbb::concurrent_map variable (hash_ret[word] -= count) from multiple threads. To avoid crashes, I further needed to break down the get_properties function so that each thread operated on a different input file (see get_properties_one_file function below). I was able to get a minor speedup (similar to what I saw with OpenMP) by using this library without any locking on a vector, as shown in the sample code below:
Read more... (12 kB)
Timings of OpenMP vs OneTbb on my machine
The real time reported by the Linux time command when running the tbb version of 2.890s compares favourably with 3.041s of the OpenMP version. Apart from performing better on modern CPU caches, std::vector seems to also outperform std::map in multi-threaded programs, due to the locking overhead of updating a global map from multiple threads. Update: Timings for llil3vec-tbb-a.cpp below, built with clang++ and fast_io are slightly faster:
Updated Simpler Version llil3vec-tbb-a.cpp Update: built with:
Oh, and see llil4vec-tbb.cpp in Re^9: Rosetta Code: Long List is Long (faster - llil4vec - TBB code) by marioroy for a cleaner way to merge the local array locvec[i] into vec_ret, via a scoped lock and a mutex, thus eliminating the ugly locvec[MAX_INPUT_FILES_L] array. Updated timings for llil4vec-tbb on my machine can be found here.
Read more... (12 kB)
Updated: Added simpler llil3vec-tbb-a.cpp version. | [reply] [d/l] [select] |
by marioroy (Prior) on Jan 17, 2023 at 10:33 UTC | |
by marioroy (Prior) on Jan 19, 2023 at 15:42 UTC | |
by eyepopslikeamosquito (Archbishop) on Jan 22, 2023 at 08:05 UTC | |
| |
by eyepopslikeamosquito (Archbishop) on Jan 11, 2023 at 20:59 UTC | |
Very interesting! This will help me get started with OpenMP, which I've been meaning to do for a while now.
Unfortunately, strtok is not thread-safe e.g. strtok(NULL, "\n") causing segfault. So I factored out strtok. Ha ha, that I used strtok at all is an indication of how desperate I was, given I singled this function out for a special dishonourable mention at On Interfaces and APIs. :)
To round out this section, notice that the ANSI C strtok function has a shocker of an interface: it surprises by writing NULLs into the input string being tokenized; the first call has different semantics to subsequent calls (distinguished by special-case logic on the value of its first parameter); and it stores state between calls so that only one sequence of calls can be active at a time (bad enough in a single-threaded environment, but so intolerable in multi-threaded and signal handler (reentrant) environments that the POSIX threads committee invented a replacement strtok_r function). | [reply] [d/l] [select] |
by marioroy (Prior) on Jan 11, 2023 at 22:07 UTC | |
It was weird experiencing the segfault due to strtok(NULL, "\n"). Great! What is strtok_r() function in C language? I compared strtok_r with context against strchr. After testing, I updated the OpenMP demonstration to find the tab character within the string. Now passing: limited length and variable length words, including OpenMP. Update 2 enables parallel sort.
Also, I replaced strlen(...) with found - line.
| [reply] [d/l] [select] |