The SQLite demonstrations were interesting. This is about GNU Parallel, which includes parsort. Something like this is possible:

Linux box:

Note: See this thread for generating the big input files.

LC_ALL=C sort -k1 big{1,2,3}.txt | awk -f tally-count.awk | LC_ALL=C s +ort -k2nr >out.txt real time: 12.671 seconds LC_ALL=C parsort -k1 big{1,2,3}.txt | awk -f tally-count.awk | LC_ALL= +C parsort -k2nr >out.txt real time: 6.561 seconds LC_ALL=C parsort -k1 big{1,2,3}.txt | ./tally-count | LC_ALL=C parsort + -k2nr >out.txt real time: 3.615 seconds

Bigger box:

LC_ALL=C parsort -k1 big{1,2,3}.txt | awk -f tally-count.awk | LC_ALL= +C parsort -k2nr >out.txt real time: 6.520 seconds LC_ALL=C parsort -k1 big{1,2,3}.txt | ./tally-count | LC_ALL=C parsort + -k2nr >out.txt real time: 2.398 seconds

The big files are two column key-value pairs delimited by a tab. The output from the first sort command contains duplicate key names. Next, the tally-count command sums adjacent count fields of duplicate names. Its output contains unique names. Finally, parsort is called again to sort by sum in descending order and keyname in ascending order.

q - Text as Data:

I came across q - Run SQL directly on delimited files and multi-file sqlite databases on GitHub. Specify -tT for tab-delimited input and output. The dash indicates to read from standard input. The q tool is interesting.

cat big{1,2,3}.txt | ./linux-q -tT "select c1, sum(c2) from - group by + c1 order by sum(c2) desc, c1" >out.txt real time: 84.027 seconds ./linux-q -tT "select c1, sum(c2) from big[1-3].txt group by c1 order +by sum(c2) desc, c1" >out.txt real time: 89.539 seconds

tally-count.awk:

The Awk script becomes a bottleneck for parallel sort. Notice above, no improvement on a bigger machine.

# Tally adjacent count fields of duplicate key names. BEGIN { OFS = FS = "\t" key = "" sum = 0 flg = 1 } { if ( flg ) { flg = 0 key = $1 } if ( $1 == key ) { sum += $2 } else { print key, sum key = $1 sum = $2 } } END { print key, sum }

tally-count.cpp:

The C++ solution is noticeably faster. The great thing about the parsort/tally-count combination is being able to process hundreds of list files and not worry about exhausting memory capacity.

// ------------------------------------------------------------------- +--------- // tally-count.cpp // Tally adjacent count fields of duplicate key names. // // Obtain the fast_io library (required dependency): // git clone --depth=1 https://github.com/cppfastio/fast_io // // clang++ compile on Linux: // clang++ -o tally-count -std=c++20 -Wall -O3 tally-count.cpp // // Example run: // LC_ALL=C parsort -k1 big{1,2,3}.txt | ./tally-count |\ // LC_ALL=C parsort -k2nr > out.txt // ------------------------------------------------------------------- +--------- #include <chrono> // The fast_io header must come after chrono, else build error: // "no member named 'concatln' in namespace 'fast_io'" #include "fast_io/include/fast_io.h" #include <cstdio> #include <cstddef> #include <cstdint> #include <cstdlib> #include <cstring> #include <array> #include <execution> // fast_atoll64 // https://stackoverflow.com/questions/16826422/ // c-most-efficient-way-to-convert-string-to-int-faster-than-atoi inline int64_t fast_atoll64( const char* str ) { int64_t val = 0; int sign = 0; if ( *str == '-' ) { sign = 1, ++str; } uint8_t digit; while ((digit = uint8_t(*str++ - '0')) <= 9) val = val * 10 + digit +; return sign ? -val : val; } #define MAX_LINE_LEN_L 255 int main(int argc, char* argv[]) { std::array<char, MAX_LINE_LEN_L + 1> line, name; char* found; long long count, sum; int flag = 0; // obtain the first key-value pair delimited by tab while ( ::fgets( line.data(), static_cast<int>(MAX_LINE_LEN_L), ::s +tdin ) != NULL ) { if ( ( found = std::find( line.begin(), line.end(), '\t' ) ) == +line.end() ) continue; sum = fast_atoll64( found + 1 ); *found = '\0'; // key name ::memcpy( name.data(), line.data(), found - line.data() + 1 ); flag = 1; break; } // process the rest of standard input while ( ::fgets( line.data(), static_cast<int>(MAX_LINE_LEN_L), ::s +tdin ) != NULL ) { if ( ( found = std::find( line.begin(), line.end(), '\t' ) ) == +line.end() ) continue; count = fast_atoll64( found + 1 ); *found = '\0'; // key name if ( ! ::strcmp( line.data(), name.data() ) ) { sum += count; } else { fast_io::io::println( fast_io::mnp::os_c_str(name.data()), "\ +t", sum ); ::memcpy( name.data(), line.data(), found - line.data() + 1 ) +; sum = count; } } if ( flag ) fast_io::io::println( fast_io::mnp::os_c_str(name.data()), "\t", + sum ); return 0; }

See also Grouped aggregate utility (like SQL GROUP BY)? on Stack Exchange.

GNU Parallel Citation

Tange, O. (2023, January 22). GNU Parallel 20230122 ('Bolsonaristas').
Zenodo. https://doi.org/10.5281/zenodo.7558957

In reply to Re: Rosetta Code: Long List is Long - GNU Parallel by marioroy
in thread Rosetta Code: Long List is Long by eyepopslikeamosquito

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.