Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re: Rosetta Code: Long List is Long - GNU Parallel

by marioroy (Prior)
on Feb 09, 2023 at 09:05 UTC ( [id://11150254]=note: print w/replies, xml ) Need Help??


in reply to Rosetta Code: Long List is Long

The SQLite demonstrations were interesting. This is about GNU Parallel, which includes parsort. Something like this is possible:

Linux box:

Note: See this thread for generating the big input files.

LC_ALL=C sort -k1 big{1,2,3}.txt | awk -f tally-count.awk | LC_ALL=C s +ort -k2nr >out.txt real time: 12.671 seconds LC_ALL=C parsort -k1 big{1,2,3}.txt | awk -f tally-count.awk | LC_ALL= +C parsort -k2nr >out.txt real time: 6.561 seconds LC_ALL=C parsort -k1 big{1,2,3}.txt | ./tally-count | LC_ALL=C parsort + -k2nr >out.txt real time: 3.615 seconds

Bigger box:

LC_ALL=C parsort -k1 big{1,2,3}.txt | awk -f tally-count.awk | LC_ALL= +C parsort -k2nr >out.txt real time: 6.520 seconds LC_ALL=C parsort -k1 big{1,2,3}.txt | ./tally-count | LC_ALL=C parsort + -k2nr >out.txt real time: 2.398 seconds

The big files are two column key-value pairs delimited by a tab. The output from the first sort command contains duplicate key names. Next, the tally-count command sums adjacent count fields of duplicate names. Its output contains unique names. Finally, parsort is called again to sort by sum in descending order and keyname in ascending order.

q - Text as Data:

I came across q - Run SQL directly on delimited files and multi-file sqlite databases on GitHub. Specify -tT for tab-delimited input and output. The dash indicates to read from standard input. The q tool is interesting.

cat big{1,2,3}.txt | ./linux-q -tT "select c1, sum(c2) from - group by + c1 order by sum(c2) desc, c1" >out.txt real time: 84.027 seconds ./linux-q -tT "select c1, sum(c2) from big[1-3].txt group by c1 order +by sum(c2) desc, c1" >out.txt real time: 89.539 seconds

tally-count.awk:

The Awk script becomes a bottleneck for parallel sort. Notice above, no improvement on a bigger machine.

# Tally adjacent count fields of duplicate key names. BEGIN { OFS = FS = "\t" key = "" sum = 0 flg = 1 } { if ( flg ) { flg = 0 key = $1 } if ( $1 == key ) { sum += $2 } else { print key, sum key = $1 sum = $2 } } END { print key, sum }

tally-count.cpp:

The C++ solution is noticeably faster. The great thing about the parsort/tally-count combination is being able to process hundreds of list files and not worry about exhausting memory capacity.

// ------------------------------------------------------------------- +--------- // tally-count.cpp // Tally adjacent count fields of duplicate key names. // // Obtain the fast_io library (required dependency): // git clone --depth=1 https://github.com/cppfastio/fast_io // // Obtain GNU Parallel (check first, your OS vendor may have installer + package): // https://www.gnu.org/software/parallel/ // https://www.gnu.org/software/parallel/sphinx.html // // g++ compile on Linux: // g++ -o tally-count -std=c++20 -Wall -O3 tally-count.cpp -I ./fas +t_io/include // // This g++ command also works with mingw C++ compiler (https://source +forge.net/projects/mingw-w64) // that comes bundled with Strawberry Perl (C:\Strawberry\c\bin\g++.ex +e). // // clang++ compile: // Seems to run slightly faster when compiled with clang++ instead of +g++ // // Example run: // LC_ALL=C parsort -k1 big{1,2,3}.txt | ./tally-count | LC_ALL=C p +arsort -k2nr > out.txt // ------------------------------------------------------------------- +--------- #include <chrono> // The fast_io header must come after chrono, else build error: // "no member named 'concatln' in namespace 'fast_io'" #include <fast_io.h> #include <cstdio> #include <cstddef> #include <cstdint> #include <cstdlib> #include <cstring> #include <array> #include <execution> // fast_atoll64 // https://stackoverflow.com/questions/16826422/ // c-most-efficient-way-to-convert-string-to-int-faster-than-atoi inline int64_t fast_atoll64( const char* str ) { int64_t val = 0; int sign = 0; if ( *str == '-' ) { sign = 1, ++str; } uint8_t digit; while ((digit = uint8_t(*str++ - '0')) <= 9) val = val * 10 + digit +; return sign ? -val : val; } #define MAX_LINE_LEN_L 255 int main(int argc, char* argv[]) { std::array<char, MAX_LINE_LEN_L + 1> line, name; char* found; long long count, sum; int flag = 0; // obtain the first key-value pair delimited by tab while ( ::fgets( line.data(), static_cast<int>(MAX_LINE_LEN_L), ::s +tdin ) != NULL ) { if ( ( found = std::find( line.begin(), line.end(), '\t' ) ) == +line.end() ) continue; sum = fast_atoll64( found + 1 ); *found = '\0'; // key name ::memcpy( name.data(), line.data(), found - line.data() + 1 ); flag = 1; break; } // process the rest of standard input while ( ::fgets( line.data(), static_cast<int>(MAX_LINE_LEN_L), ::s +tdin ) != NULL ) { if ( ( found = std::find( line.begin(), line.end(), '\t' ) ) == +line.end() ) continue; count = fast_atoll64( found + 1 ); *found = '\0'; // key name if ( ! ::strcmp( line.data(), name.data() ) ) { sum += count; } else { ::print( fast_io::concatln( fast_io::mnp::os_c_str(name.data( +)), "\t", sum ) ); ::memcpy( name.data(), line.data(), found - line.data() + 1 ) +; sum = count; } } if ( flag ) ::print( fast_io::concatln( fast_io::mnp::os_c_str(name.data()), + "\t", sum ) ); return 0; }

See also Grouped aggregate utility (like SQL GROUP BY)? on Stack Exchange.

GNU Parallel Citation

Tange, O. (2023, January 22). GNU Parallel 20230122 ('Bolsonaristas').
Zenodo. https://doi.org/10.5281/zenodo.7558957

Replies are listed 'Best First'.
Re^2: Rosetta Code: Long List is Long - GNU Parallel
by marioroy (Prior) on Feb 09, 2023 at 09:59 UTC

    At the time of testing, I was hoping to specify the number of threads using parsort but was unable to in a consistent fashion. So, I created a wrapper script named "parallel" that is placed in a path (/usr/local/bin) before (/usr/bin) where "parallel" itself resides.

    #!/usr/bin/env bash # Wrapper script for parallel. # Whoa!!! GNU Parallel assumes you want to consume all CPU cores. # Unfortunately, one cannot specify the number of threads for parsort. CMD="/usr/bin/parallel" if [[ -z "$PARALLEL_NUM_THREADS" ]]; then exec "$CMD" "$@" elif [[ "$#" -eq 1 && "$1" == "--number-of-threads" ]]; then echo $PARALLEL_NUM_THREADS; exit 0 elif [[ "$1" == "-j" ]]; then shift; shift; exec "$CMD" -j $PARALLEL_NUM_THREADS "$@" else exec "$CMD" -j $PARALLEL_NUM_THREADS "$@" fi

    The environment variable prevents parsort (/usr/bin/parallel, behind the scene) from going semi-wild on a big machine with many CPU cores.

    export PARALLEL_NUM_THREADS=12 LC_ALL=C parsort -k1 big{1,2,3}.txt | ./tally-count | LC_ALL=C parsort + -k2nr >out.txt

    The GNU Parallel "parsort" / "tally-count" combination may be useful for Chuma in the originating Long list is long thread. Chuma wrote, "the files are pretty large (up to a couple hundred MB), there are quite a few of them (2064)."

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11150254]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (5)
As of 2024-04-23 16:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found