comment on

The SQLite demonstrations were interesting. This is about GNU Parallel, which includes parsort. Something like this is possible:

Linux box:

Note: See this thread for generating the big input files.

LC_ALL=C sort -k1 big{1,2,3}.txt | awk -f tally-count.awk | LC_ALL=C s
+ort -k2nr >out.txt

real time: 12.671 seconds

LC_ALL=C parsort -k1 big{1,2,3}.txt | awk -f tally-count.awk | LC_ALL=
+C parsort -k2nr >out.txt

real time:  6.561 seconds

LC_ALL=C parsort -k1 big{1,2,3}.txt | ./tally-count | LC_ALL=C parsort
+ -k2nr >out.txt

real time:  3.615 seconds
[download]

Bigger box:

LC_ALL=C parsort -k1 big{1,2,3}.txt | awk -f tally-count.awk | LC_ALL=
+C parsort -k2nr >out.txt

real time:  6.520 seconds

LC_ALL=C parsort -k1 big{1,2,3}.txt | ./tally-count | LC_ALL=C parsort
+ -k2nr >out.txt

real time:  2.398 seconds
[download]

The big files are two column key-value pairs delimited by a tab. The output from the first sort command contains duplicate key names. Next, the tally-count command sums adjacent count fields of duplicate names. Its output contains unique names. Finally, parsort is called again to sort by sum in descending order and keyname in ascending order.

q - Text as Data:

I came across q - Run SQL directly on delimited files and multi-file sqlite databases on GitHub. Specify -tT for tab-delimited input and output. The dash indicates to read from standard input. The q tool is interesting.

cat big{1,2,3}.txt | ./linux-q -tT "select c1, sum(c2) from - group by
+ c1 order by sum(c2) desc, c1" >out.txt

real time: 84.027 seconds

./linux-q -tT "select c1, sum(c2) from big[1-3].txt group by c1 order 
+by sum(c2) desc, c1" >out.txt

real time: 89.539 seconds
[download]

tally-count.awk:

The Awk script becomes a bottleneck for parallel sort. Notice above, no improvement on a bigger machine.

# Tally adjacent count fields of duplicate key names.

BEGIN {
   OFS = FS = "\t"
   key = ""
   sum = 0
   flg = 1
}

{
   if ( flg ) {
      flg = 0
      key = $1
   }
   if ( $1 == key ) {
      sum += $2
   } else {
      print key, sum
      key = $1
      sum = $2
   }
}

END {
   print key, sum
}
[download]

tally-count.cpp:

The C++ solution is noticeably faster. The great thing about the parsort/tally-count combination is being able to process hundreds of list files and not worry about exhausting memory capacity.

// -------------------------------------------------------------------
+---------
// tally-count.cpp
// Tally adjacent count fields of duplicate key names.
//
// Obtain the fast_io library (required dependency):
//    git clone --depth=1 https://github.com/cppfastio/fast_io
//
// clang++ compile on Linux:
//    clang++ -o tally-count -std=c++20 -Wall -O3 tally-count.cpp
//
// Example run:
//    LC_ALL=C parsort -k1 big{1,2,3}.txt | ./tally-count |\
//    LC_ALL=C parsort -k2nr > out.txt
// -------------------------------------------------------------------
+---------

#include <chrono>

// The fast_io header must come after chrono, else build error:
// "no member named 'concatln' in namespace 'fast_io'"
#include "fast_io/include/fast_io.h"

#include <cstdio>
#include <cstddef>
#include <cstdint>
#include <cstdlib>
#include <cstring>

#include <array>
#include <execution>

// fast_atoll64
// https://stackoverflow.com/questions/16826422/
// c-most-efficient-way-to-convert-string-to-int-faster-than-atoi

inline int64_t fast_atoll64( const char* str )
{
   int64_t val  = 0;
   int     sign = 0;
   if ( *str == '-' ) {
      sign = 1, ++str;
   }
   uint8_t digit;
   while ((digit = uint8_t(*str++ - '0')) <= 9) val = val * 10 + digit
+;
   return sign ? -val : val;
}

#define MAX_LINE_LEN_L 255

int main(int argc, char* argv[])
{
   std::array<char, MAX_LINE_LEN_L + 1> line, name;
   char* found;
   long long count, sum;
   int flag = 0;

   // obtain the first key-value pair delimited by tab
   while ( ::fgets( line.data(), static_cast<int>(MAX_LINE_LEN_L), ::s
+tdin ) != NULL ) {
      if ( ( found = std::find( line.begin(), line.end(), '\t' ) ) == 
+line.end() )
         continue;

      sum    = fast_atoll64( found + 1 );
      *found = '\0';  // key name
      ::memcpy( name.data(), line.data(), found - line.data() + 1 );
      flag = 1;

      break;
   }

   // process the rest of standard input
   while ( ::fgets( line.data(), static_cast<int>(MAX_LINE_LEN_L), ::s
+tdin ) != NULL ) {
      if ( ( found = std::find( line.begin(), line.end(), '\t' ) ) == 
+line.end() )
         continue;

      count  = fast_atoll64( found + 1 );
      *found = '\0';  // key name

      if ( ! ::strcmp( line.data(), name.data() ) ) {
         sum += count;
      } else {
         fast_io::io::println( fast_io::mnp::os_c_str(name.data()), "\
+t", sum );
         ::memcpy( name.data(), line.data(), found - line.data() + 1 )
+;
         sum = count;
      }
   }

   if ( flag )
      fast_io::io::println( fast_io::mnp::os_c_str(name.data()), "\t",
+ sum );

   return 0;
}
[download]

See also Grouped aggregate utility (like SQL GROUP BY)? on Stack Exchange.

GNU Parallel Citation

Tange, O. (2023, January 22). GNU Parallel 20230122 ('Bolsonaristas').
Zenodo. https://doi.org/10.5281/zenodo.7558957

In reply to Re: Rosetta Code: Long List is Long - GNU Parallel by marioroy
in thread Rosetta Code: Long List is Long by eyepopslikeamosquito

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.