Re: Rosetta Code: Long List is Long

The SQLite demonstrations were interesting. This is about GNU Parallel, which includes parsort. Something like this is possible:

Linux box:

Note: See this thread for generating the big input files.

LC_ALL=C sort -k1 big{1,2,3}.txt | awk -f tally-count.awk | LC_ALL=C s
+ort -k2nr >out.txt

real time: 12.671 seconds

LC_ALL=C parsort -k1 big{1,2,3}.txt | awk -f tally-count.awk | LC_ALL=
+C parsort -k2nr >out.txt

real time:  6.561 seconds

LC_ALL=C parsort -k1 big{1,2,3}.txt | ./tally-count | LC_ALL=C parsort
+ -k2nr >out.txt

real time:  3.615 seconds
[download]

Bigger box:

LC_ALL=C parsort -k1 big{1,2,3}.txt | awk -f tally-count.awk | LC_ALL=
+C parsort -k2nr >out.txt

real time:  6.520 seconds

LC_ALL=C parsort -k1 big{1,2,3}.txt | ./tally-count | LC_ALL=C parsort
+ -k2nr >out.txt

real time:  2.398 seconds
[download]

The big files are two column key-value pairs delimited by a tab. The output from the first sort command contains duplicate key names. Next, the tally-count command sums adjacent count fields of duplicate names. Its output contains unique names. Finally, parsort is called again to sort by sum in descending order and keyname in ascending order.

q - Text as Data:

I came across q - Run SQL directly on delimited files and multi-file sqlite databases on GitHub. Specify -tT for tab-delimited input and output. The dash indicates to read from standard input. The q tool is interesting.

cat big{1,2,3}.txt | ./linux-q -tT "select c1, sum(c2) from - group by
+ c1 order by sum(c2) desc, c1" >out.txt

real time: 84.027 seconds

./linux-q -tT "select c1, sum(c2) from big[1-3].txt group by c1 order 
+by sum(c2) desc, c1" >out.txt

real time: 89.539 seconds
[download]

tally-count.awk:

The Awk script becomes a bottleneck for parallel sort. Notice above, no improvement on a bigger machine.

# Tally adjacent count fields of duplicate key names.

BEGIN {
   OFS = FS = "\t"
   key = ""
   sum = 0
   flg = 1
}

{
   if ( flg ) {
      flg = 0
      key = $1
   }
   if ( $1 == key ) {
      sum += $2
   } else {
      print key, sum
      key = $1
      sum = $2
   }
}

END {
   print key, sum
}
[download]

tally-count.cpp:

The C++ solution is noticeably faster. The great thing about the parsort/tally-count combination is being able to process hundreds of list files and not worry about exhausting memory capacity.

// -------------------------------------------------------------------
+---------
// tally-count.cpp
// Tally adjacent count fields of duplicate key names.
//
// Obtain the fast_io library (required dependency):
//    git clone --depth=1 https://github.com/cppfastio/fast_io
//
// Obtain GNU Parallel (check first, your OS vendor may have installer
+ package):
//    https://www.gnu.org/software/parallel/
//    https://www.gnu.org/software/parallel/sphinx.html
//
// g++ compile on Linux:
//    g++ -o tally-count -std=c++20 -Wall -O3 tally-count.cpp -I ./fas
+t_io/include
//
// This g++ command also works with mingw C++ compiler (https://source
+forge.net/projects/mingw-w64)
// that comes bundled with Strawberry Perl (C:\Strawberry\c\bin\g++.ex
+e).
//
// clang++ compile:
// Seems to run slightly faster when compiled with clang++ instead of 
+g++
//
// Example run:
//    LC_ALL=C parsort -k1 big{1,2,3}.txt | ./tally-count | LC_ALL=C p
+arsort -k2nr > out.txt
// -------------------------------------------------------------------
+---------

#include <chrono>

// The fast_io header must come after chrono, else build error:
// "no member named 'concatln' in namespace 'fast_io'"
#include <fast_io.h>

#include <cstdio>
#include <cstddef>
#include <cstdint>
#include <cstdlib>
#include <cstring>

#include <array>
#include <execution>

// fast_atoll64
// https://stackoverflow.com/questions/16826422/
// c-most-efficient-way-to-convert-string-to-int-faster-than-atoi

inline int64_t fast_atoll64( const char* str )
{
   int64_t val  = 0;
   int     sign = 0;
   if ( *str == '-' ) {
      sign = 1, ++str;
   }
   uint8_t digit;
   while ((digit = uint8_t(*str++ - '0')) <= 9) val = val * 10 + digit
+;
   return sign ? -val : val;
}

#define MAX_LINE_LEN_L 255

int main(int argc, char* argv[])
{
   std::array<char, MAX_LINE_LEN_L + 1> line, name;
   char* found;
   long long count, sum;
   int flag = 0;

   // obtain the first key-value pair delimited by tab
   while ( ::fgets( line.data(), static_cast<int>(MAX_LINE_LEN_L), ::s
+tdin ) != NULL ) {
      if ( ( found = std::find( line.begin(), line.end(), '\t' ) ) == 
+line.end() )
         continue;

      sum    = fast_atoll64( found + 1 );
      *found = '\0';  // key name
      ::memcpy( name.data(), line.data(), found - line.data() + 1 );
      flag = 1;

      break;
   }

   // process the rest of standard input
   while ( ::fgets( line.data(), static_cast<int>(MAX_LINE_LEN_L), ::s
+tdin ) != NULL ) {
      if ( ( found = std::find( line.begin(), line.end(), '\t' ) ) == 
+line.end() )
         continue;

      count  = fast_atoll64( found + 1 );
      *found = '\0';  // key name

      if ( ! ::strcmp( line.data(), name.data() ) ) {
         sum += count;
      } else {
         ::print( fast_io::concatln( fast_io::mnp::os_c_str(name.data(
+)), "\t", sum ) );
         ::memcpy( name.data(), line.data(), found - line.data() + 1 )
+;
         sum = count;
      }
   }

   if ( flag )
      ::print( fast_io::concatln( fast_io::mnp::os_c_str(name.data()),
+ "\t", sum ) );

   return 0;
}
[download]

See also Grouped aggregate utility (like SQL GROUP BY)? on Stack Exchange.

GNU Parallel Citation

Tange, O. (2023, January 22). GNU Parallel 20230122 ('Bolsonaristas').
Zenodo. https://doi.org/10.5281/zenodo.7558957

Comment on Re: Rosetta Code: Long List is Long - GNU Parallel Select or Download Code

Replies are listed 'Best First'.

Re^2: Rosetta Code: Long List is Long - GNU Parallel
by marioroy (Prior) on Feb 09, 2023 at 09:59 UTC

At the time of testing, I was hoping to specify the number of threads using parsort but was unable to in a consistent fashion. So, I created a wrapper script named "parallel" that is placed in a path (/usr/local/bin) before (/usr/bin) where "parallel" itself resides.

#!/usr/bin/env bash
# Wrapper script for parallel.

# Whoa!!! GNU Parallel assumes you want to consume all CPU cores.
# Unfortunately, one cannot specify the number of threads for parsort.

CMD="/usr/bin/parallel"

if [[ -z "$PARALLEL_NUM_THREADS" ]]; then
  exec "$CMD" "$@"

elif [[ "$#" -eq 1 && "$1" == "--number-of-threads" ]]; then
  echo $PARALLEL_NUM_THREADS; exit 0

elif [[ "$1" == "-j" ]]; then
  shift; shift; exec "$CMD" -j $PARALLEL_NUM_THREADS "$@"

else
  exec "$CMD" -j $PARALLEL_NUM_THREADS "$@"

fi
[download]

The environment variable prevents parsort (/usr/bin/parallel, behind the scene) from going semi-wild on a big machine with many CPU cores.

export PARALLEL_NUM_THREADS=12

LC_ALL=C parsort -k1 big{1,2,3}.txt | ./tally-count | LC_ALL=C parsort
+ -k2nr >out.txt
[download]

The GNU Parallel "parsort" / "tally-count" combination may be useful for Chuma in the originating Long list is long thread. Chuma wrote, "the files are pretty large (up to a couple hundred MB), there are quite a few of them (2064)."

[reply]
[d/l]
[select]


Just another Perl shrine
	PerlMonks

Re: Rosetta Code: Long List is Long - GNU Parallel