Re: What is the most memory efficient way to (sort) and print a hash?

Greetings, a_salway! This is 10 years late! Recently, chuma requested similarly the Long List is Long challenge. For testing, I modified eyepopslikeamosquito's file generator to emit 16 million unique key-value pairs, in parallel. The key-names length are 8 chars minimum ~ 19 chars maximum, sufficient for uniqueness.

use v5.30;
use MCE;

{
  # Generate a random word, inspired by eyepopslikeamosquito's
  # gen-llil.pl generator https://perlmonks.org/?node_id=11148681
  my $ordmin = ord('A');
  my $ordmax = ord('z') + 1;

  sub gen_random_word {
    my $nchar = shift;  # the number of random chars to append
    my $word  = "";
    for (1 .. $nchar) {
      my $ord = $ordmin + int( rand($ordmax - $ordmin) );
      # try again if 91 [, 92 \, 93 ], 94 ^, 95 _, or 96 `
      while ($ord > 90 && $ord < 97) {
        $ord = $ordmin + int( rand($ordmax - $ordmin) );
      }
      $word .= chr($ord);
    }
    return $word;
  }
}

srand(42);

MCE->new(
  max_workers => MCE::Util::get_ncpu(),
  sequence    => [ 1, 16e6 ],
  bounds_only => 1,
  chunk_size  => 8e3,
  init_relay  => 1,
  posix_exit  => 1,

  user_func   => sub {
    my ($mce, $chunk_ref, $chunk_id) = @_;
    my $seq_beg = $chunk_ref->[0];
    my $seq_end = $chunk_ref->[1];
    my $out = "";

    # Worker seeds generator using internal seed and wid value.
    # Compute similarly using the chunk_id value.
    my $seed = abs(MCE->seed - ($chunk_id * 1e5)) % 1073741780;
    srand($seed);

    for ($seq_beg .. $seq_end) {
      $out .= gen_random_word(int(rand(12) + 8));
      $out .= "\t" . int(rand(1e4) + 1) . "\n";
    }

    # Write output directly and orderly via relay.
    MCE::relay { print $out };
  }
)->run;
[download]

I tried three programs for sorting the output similarly to the Long List is Long challenge. That is count descending and key names ascending. Parsort is a program included with GNU parallel. Another variant is mcesort using mini-MCE integrated into the code.

Testing was captured on an AMD Ryzen Threadripper 3970X machine, 3600 MHz DDR4. The output file md5sum is c6da8fd098664155c190782f1dc0ca7f.

$ rm -f out; time perl demo.pl | LC_ALL=C sort -k2rn > out
real   0m34.779s
user   1m39.925s
sys    0m00.555s

$ rm -f out; time perl demo.pl | LC_ALL=C parsort -k2rn > out
real   0m04.484s
user   1m17.538s
sys    0m00.575s

$ rm -f out; time perl demo.pl | LC_ALL=C mcesort -k2rn > out
real   0m04.287s
user   1m54.859s
sys    0m01.748s
[download]

Next, a C++ demonstration running llil4map_buf2.cc. Set SSO_LIMIT to 20 for best performance, ~ line 131. Key names plus null character greater than SSO_LIMIT are dynamically allocated using blocks of memory.

$ time perl demo.pl > out.in
real   0m01.396s
user   1m16.659s
sys    0m00.287s

$ rm -f out; ./llil4map2 out.in > out
llil4map start
use OpenMP            SSO_LIMIT 12  SSO_LIMIT 20
use boost sort        
get properties         0.166 secs    0.159 secs
map to vector          0.055 secs    0.077 secs
vector stable sort     0.201 secs    0.113 secs
write stdout           0.097 secs    0.068 secs
total time             0.521 secs    0.419 secs
    count lines     16000000
    count unique    16000000
[download]

Comment on Re: What is the most memory efficient way to (sort) and print a hash? Select or Download Code