Long list is long

by eyepopslikeamosquito (Archbishop) on Oct 30, 2022 at 10:11 UTC

A number of problems with the code you posted: Missing strict ... Missing warnings ...

Kudos to kcott for patiently showing by example yet again how to write excellent, clean and succinct Perl code. Unfortunately, it seems unlikely that the monk in question will follow your sage advice, despite a gentle nudge from haukex a year ago.

Given the Bod's recent posting history on this topic:

Re^4: Newline's creep in, while using Tie::File - "is it a universal belief that strictures should be on or a majority belief?"
Backdating strict - "should strict always be used or are there circumstances when it makes sense not to use this pragma?"

I'm trusting that a once similarly recalcitrant Bod has now seen the light and might be persuaded to share his experiences ... along with why (or why not) he uses strict and warnings nowadays.

Oh, and just in case it helps, a non-Perl Monk reference on this topic: Always use strict and warnings in your Perl code (perlmaven)

Update: I also keep a list of references on this topic: use strict and warnings References

Re^3: Long list is long

by kcott (Archbishop) on Oct 30, 2022 at 13:34 UTC

G'day eyepopslikeamosquito,

Thanks for the compliment.

I usually look at an OP's previous posts. This gives me an idea of the OP's level of Perl knowledge and how best to frame my response. I did check on this occasion; wondered if I was flogging a dead horse; but chose to proceed anyway.

To Chuma: I've been writing Perl code for almost 30 years. A substantial part of my paid employment involves writing Perl code. I use these pragmata for personal, $work and PM code. I don't do this because it's trendy, expected, or for any other frivolous reasons; I do it because they're extremely useful and save me a lot of time.

Even with decades of experience, I still make typos and other silly mistakes (just like everyone else does) — I'd like Perl to tell me about these problems as soon as possible, instead of getting weird or unexpected output and spending a lot of time tracking down the source of the bug. I encourage you to use these pragmata for the same reasons.

— Ken

Re: Long list is long
by jwkrahn (Abbot) on Oct 30, 2022 at 03:49 UTC

You don't really need the arrays @p, @q and @i, you could just do this:


my %f;
for my $file ( @files ) {
  open my $IN, '<', $file or die "Cannot open '$file' because: $!";
  while ( <$IN> ) {
    /^(.*)\t(.*)$/ && $f{ $1 } += $2;
  }
}

open my $OUT, '>', $outpath or die "Cannot open '$outpath' because: $!
+";

for my $key ( sort { $f{ $b } <=> $f{ $a } } keys %f ) {
  print $OUT "$key\t$f{$key}\n";
}
[download]

[reply]
[d/l]

by eyepopslikeamosquito (Archbishop) on Oct 30, 2022 at 05:44 UTC

This line:

/^(.*)\t(.*)$/ && $f{ $1 } += $2;
[download]

"Can't modify logical and (&&) in addition (+)"

/^(.*)\t(.*)$/ and $f{ $1 } += $2;
[download]

by Chuma (Scribe) on Oct 30, 2022 at 12:49 UTC

Thanks, looks good!

I try it with a few benchmarks, and it takes... twice as long. Huh! That's weird.

Re: Long list is long
by tybalt89 (Monsignor) on Oct 30, 2022 at 14:28 UTC

If I were on a linux system, I'd try it this way. The perl takes almost no memory :)

#!/usr/bin/perl

use strict; # https://perlmonks.org/?node_id=11147822
use warnings;

use Path::Tiny;
path("/tmp/11147822.$_.txt")->spew( <<END ) for 1 .. 100; # FIXME for 
+testing only
foo     73
bar     35
word    27
blah    23
END

open my $fh, '-|', 'sort /tmp/11147822.*.txt' or die; # FIXME for your
+ file list
open my $tosort, '|-', 'sort -n -k2' or die;
<$fh> =~ /^(\w+)\s+(\d+)/ or die;
my ($name, $value) = ($1, $2);
while( <$fh> )
  {
  if( /^(\w+)\s+(\d+)/ )
    {
    if( $1 eq $name ) { $value += $2; }
    else
      {
      print $tosort "$name\t$value\n";
      ($name, $value) = ($1, $2);
      }
    }
  }
print $tosort "$name\t$value\n"; # be sure to print last one
close $tosort or die;
[download]

Outputs:

blah    2300
word    2700
bar     3500
foo     7300
[download]

by Chuma (Scribe) on Oct 30, 2022 at 15:18 UTC

I have no idea what this does, but it looks cool! I'm using a mac, would that work? Supposing I have the file list in @files and the output file name in $outpath, how would you integrate that?

Re^3: Long list is long

by tybalt89 (Monsignor) on Oct 30, 2022 at 17:08 UTC

I don't have a mac so I can't test this, but I assume the mac has a "sort" app somewhat like the *nix app (try "which sort" at a command line, and then "man sort"), then my guess is something like this:

#!/usr/bin/perl

use strict; # https://perlmonks.org/?node_id=11147822
use warnings;

my @files = map "/tmp/11147822.$_.txt", 1 .. 100; # FIXME
my $outpath = "/tmp/11147822.outpath"; # FIXME

open my $fh, '-|', "sort @files" or die;
open my $tosort, '|-', "sort -n -k2 >$outpath" or die;
<$fh> =~ /^(\w+)\s+(\d+)/ or die;
my ($name, $value) = ($1, $2);
while( <$fh> )
  {
  if( /^(\w+)\s+(\d+)/ )
    {
    if( $1 eq $name ) { $value += $2; }
    else
      {
      print $tosort "$name\t$value\n";
      ($name, $value) = ($1, $2);
      }
    }
  }
print $tosort "$name\t$value\n"; # be sure to print last one
close $tosort or die;
[download]

Assuming mac's are sort of built on BSD and still have a command line and have a more or less "sort" app.

[reply]
[d/l]

Re: Long list is long
by eyepopslikeamosquito (Archbishop) on Oct 31, 2022 at 07:19 UTC

kcott already provided a nice clean Perl solution. If his solution is running too slowly, please post a reply with your new code, along with system profiling output, just in case we can spot any booboos you might have made in implementing his suggestions.

Given you have "2064 files (up to a couple hundred MB)", it would not surprise if you've hit a chronic thrashing problem. If so, I can see three options you might try:

The approach taken by tybalt89 using an external sort command.
Replace Perl's internal sort function with an external sort such as Sort::External - see also the "External Sorting" section at Sorting References.
Depending on your hardware/budget, a simple solution requiring no coding changes at all may be to just buy more memory! :) You can buy 32 GB of PC memory for around $200 nowadays. Update: Note that a DDR4 DIMM can hold up to 64 GB, while DDR5 octuples that to 512 GB ... which makes worrying about a program's memory usage seem much less important than in the good old days. :)

Alternative suggestions welcome.

Re: Long list is long
by NERDVANA (Priest) on Oct 31, 2022 at 02:26 UTC

I believe there is actually an internal optimization for sorting an array in-place, so @foo= sort { ... } @foo; should execute faster than your example because the array never actually goes onto the perl argument stack. Of course, that doesn't really fit your data structure too well.

You could try emptying the hash to free up memory using undef %f; after you pull out all the keys and values. That should cut your ram usage by half before the sort step. (but maybe too late to avoid running into disk swap)

I'd be curious if you could make use of Tree::RB::XS here, to avoid ever having to expand a full list of the items:

use Tree::RB::XS;

for(@files){
  open IN, $_;
  while(<IN>){
    /^( [^\t\n]+ )\t( [0-9]+ )$/x; # avoid backtracking
    $f{$1}+=$2;
  }
}
my $t= Tree::RB::XS->new(compare_fn => 'int', allow_duplicates => 1);
while (my ($word, $count)= each %f) {
  $t->insert($count, $word);
}
my $iter= $t->iter;
printf OUT "%s\t%s\n", reverse $iter->next_kv while !$iter->done;
[download]

I apologize that I'm too lazy to generate some hundred-gigabyte files to benchmark this theory on my own.

If none of these keep the memory usage low enough to be useful, I would suggest trying a database. You could then run the query that sorts them and iterate the rows using DBI. There's a huge overhead in putting them into a database, but if you re-use this data then the database index would save you a lot of computing power in the long run.

Re: Long list is long
by marioroy (Prior) on Feb 10, 2023 at 03:46 UTC

Greetings, Chuma

Eyepopslikeamosquito started a thread here. The thread is quite long containing many solutions. Most of them compute and store the map table in memory. You mentioned having 2,064 list files, up to a couple hundred MB each. If your data set contains billions of unique key-value pairs, most of the solutions may fail due to insufficient memory capacity.

We created a backup plan (well, there are two). One solution requires lesser memory consumption and can handle hundreds of list files, possibly thousands. That is parsort from GNU Parallel and tally-count. The instructions for building tally-count.cpp is in the source.

Do not run the default sort utility on the box and instead run parsort (as shown) for minimum memory consumption. At the time of writing, it is best to have parsort read STDIN when processing hundreds of files. Files specified as arguments causes parsort to create two processes per file.

cat list1.txt list2.txt ... listN.txt |\
LC_ALL=C parsort --parallel=24 -k1 | /path/to/tally-count |\
LC_ALL=C parsort --parallel=24 -k2nr > out.txt
[download]

The output from the first sort command contains duplicate key names. Next, the tally-count command sums adjacent count fields of duplicate names. Its output contains unique names. Finally, parsort is called again to sort by sum in descending order and keyname in ascending order.

TMPDIR available space.

Processing two thousand list files via parsort may cause $TMPDIR (/tmp) to fill up. You can set the TMPDIR environment variable to a location with ample storage or specify the -T option. Ensure 1.2x to 2x available temp-storage of the combined input size.

cat list1.txt list2.txt ... listN.txt |\
LC_ALL=C parsort --parallel=24 -T /path/to/tmpdir -k1 | /path/to/tally
+-count |\
LC_ALL=C parsort --parallel=24 -T /path/to/tmpdir -k2nr > final.txt
[download]

I made a parsort variant named mcesort.

cat list1.txt list2.txt ... listN.txt |\
LC_ALL=C mcesort --parallel=24 -T /path/to/tmpdir -k1 | /path/to/tally
+-count |\
LC_ALL=C mcesort --parallel=24 -T /path/to/tmpdir -k2nr > final.txt
[download]

For extra performance, the --tally option integrates the tally command with merge. The effect is lesser work for subsequent merge, beneficial when there are many duplicate keys. The -A option sets LC_ALL=C. The -j24 option is short for --parallel=24.

cat list1.txt list2.txt ... listN.txt |\
mcesort -A -j24 -T /path/to/tmpdir --tally=/path/to/tally-count -k1 |\
mcesort -A -j24 -T /path/to/tmpdir -k2nr > final.txt
[download]

Noteworthy C++ solutions.

You can run llil4map, llil4map2 (memory efficient), llil4hmap, or llil4emh. Temporary storage isn't used and there's no limit to the number of list files. Are the keys limited-length, let's say 8 ~ 20? Then, uncomment the line "#define MAX_STR_LEN_L" and adjust the value accordingly. That will minimize memory consumption.

NUM_THREADS=24 llil4map list1.txt list2.txt ... listN.txt > final.txt
NUM_THREADS=24 llil4hmap list1.txt list2.txt ... listN.txt > final.txt
NUM_THREADS=24 llil4emh list1.txt list2.txt ... listN.txt > final.txt
[download]

If you're processing a hundred or less files (~ 200MB each; again, depending on memory capacity ~ 50GB), the llil4vec solution favors time over space, based on llil3vec. Be sure to uncomment the line "#define MAX_STR_LEN_L" for limited-length keys; e.g. 20 or less. See also, llil4vec2.

NUM_THREADS=24 llil4vec list1.txt list2.txt ... listN.txt > final.txt
[download]

July 2023 alternative Perl/C++ solutions. Results, refer to C++ vs Perl.

llilsql.pl   - Tie::Hash::DBD SQLite database
llildbf.pl   - DB_File B-tree database
llilmma.pl   - IPC::MMA shared memory hash
lliltch.pl   - Tokyo Cabinet hash database
llilkch.pl   - Kyoto Cabinet hash database
llil4tkh.cc  - tkrzw::ShardDBM demonstration
llil4tkh2.cc - tkrzw::HashDBM demonstration

March 2024 updates.

1. Reported issues to NVIDIA regarding nvc++. Greg fixed warnings
   in his phmap library. No more warnings using nvc++.

2. Use spinlock_mutex found on the web, versus std::mutex.
   This improves llil4map, llil4hmap, and llil4emh.
   Tested g++, clang++, and nvc++.
   https://github.com/clearlinux/distribution/issues/3036

3. Greg, the author of the phmap C++ library, provided suggestion for
   ensuring minimum memory consumption during "map to vector", in llil4map.
   https://github.com/greg7mdp/parallel-hashmap/issues/238

4. The llil4map demonstration lives at the phmap C++ repository.
   examples/llil4map.cc, and associated llil_utils folder
   https://github.com/greg7mdp/parallel-hashmap

5. Applied Greg's minor simplification for parallel sort to C++ demonstrations.
   Updated "map to vector" in Gist llil4map.cc to simplified version.
   https://github.com/greg7mdp/parallel-hashmap/commit/38dd6b52940937cae06191b7fb39ad0d6aeea4d2

April 2024 updates.

1. Added llil4map2, a memory efficient version.
   https://github.com/greg7mdp/parallel-hashmap/issues/238#issuecomment-2036284574

2. A more memory efficient version, by Gregory Popovitch, author of the C++ parallel-hashmap library.
   https://github.com/greg7mdp/parallel-hashmap/issues/238#issuecomment-2039591745

See also, The Long List is Long resurrected.

Re: Long list is long
by Fletch (Bishop) on Oct 31, 2022 at 11:21 UTC

Since (as I'm reading it . . .) the ordering is determined by the final sum of all values from the different files I'd try to leverage DBI and an sqlite DB. Read each file into a table with word and cur_sum columns updating the latter with each word's value. Once you've read all files then use a SELECT word, cur_sum FROM foo ORDER BY cur_sum DESC to pull out the final results.

Alternately DB_File could be ~~done~~ used to do something similar to create a summed values file (read each input file and increment the counts in the tied DB; then walk the keys of the DB to create an (unsorted) output file) and then you could try using sort(1) to create the final file from that. If that is too much for your box then use split (or write multiple files), sort those component files, then write an aggregator to merge those (implement the merge of merge sort: read one line from all the (now sorted) sub-files, output the highest and replace it, lather rinse repeat).

EDIT: wording. ENOCAFFEINE.

The cake is a lie.
The cake is a lie.
The cake is a lie.

Re: Long list is long
by dsheroh (Monsignor) on Oct 31, 2022 at 12:52 UTC

the original lists are sorted

Open all the files at once.
Grab the first line of each file and store it (along with an annotation to tell you which file it came from) in a heap or similar data structure. You could even use a plain array here, splicing each line into its proper (sorted) position as it's added.
Retrieve the first (i.e., earliest-sorting) element from the heap. Parse it into the word and the number. If it's the word you're currently working on, add the number to the total; if it's a new word, write it to your output along with its total and start a new total for the new word.
Read the next line from the file which provided the element you just handled and store that line in the heap.
Repeat until all files run out of lines.

If you aren't able to allocate enough file handles at once to process the entire list of input files in one go, this technique can also be used to do the work in multiple stages - divide the files into N groups, get the combined sums for each group, and then use those combined sums as your new set of sorted input files.

#!/usr/bin/env perl    

use strict;
use warnings;
use 5.010;

use Array::Heap::PriorityQueue::String;

my @file_names = qw( in1 in2 in3 );

my @fh;
my $heap = Array::Heap::PriorityQueue::String->new;

for (0 .. $#file_names) {
  my $fn = $file_names[$_];
  open my $new_fh, '<', $fn or die "Unable to open $fn: $!\n";
  $fh[$_] = $new_fh;

  add_from_fh_id($_);
} 

my ($current_word) = split /\s+/, $heap->peek;
my $total = 0;
while (my $item = $heap->get) {
  my ($word, $count, $fh_id) = split /\s+/, $item;

  if ($word eq $current_word) {
    $total += $count;
  } else {
    say "$current_word\t$total";
    $current_word = $word;
    $total = $count;
  } 

  add_from_fh_id($fh_id);
} 

say "$current_word\t$total";


sub add_from_fh_id {
  my $fh_id = shift;

  my $read_from = $fh[$fh_id];
  my $next_line = <$read_from>;
  if ($next_line) {
    chomp $next_line;
    $heap->add("$next_line $fh_id");
  } 
}
[download]

kcott's reply

bar    70
blah    69
foo    146
life    42
word    81
yada    3
[download]

by hv (Prior) on Oct 31, 2022 at 17:12 UTC

It is not fully spelled out, but I think it likely that "the original lists are sorted" was intended by the OP to mean that "the lines are sorted by value" (as the output is required to be), not that the words appear in ASCII order.

This is additionally hinted at by the initial 4-line example input.

Re^3: Long list is long

by dsheroh (Monsignor) on Nov 01, 2022 at 08:33 UTC

*facepalm*

Re: Long list is long
by eyepopslikeamosquito (Archbishop) on Mar 11, 2023 at 11:11 UTC

Dear Chuma,

Like paco's celebrated question of 1999, your humble question has generated an extraordinary level of excitement, over a period of many months, since you sadly left us thirteen weeks ago. Especially noteworthy is marioroy's terrifyingly beautiful analysis. Update: [OT] The Long List is Long resurrected (April 2024).

I wish I could upvote your node multiple times and have added it to the list at Terrifying Beauty.

Like paco, we patiently await your return.

Re: Long list is long
by eyepopslikeamosquito (Archbishop) on Nov 02, 2022 at 10:17 UTC

I really like your "Long list is long" node title, even though I don't fully understand its meaning.

I long for the long-winded replies you received to go a long way towards a long lasting solution before long that works for you in the long run. :)

by haukex (Archbishop) on Nov 06, 2022 at 08:17 UTC

I really like your "Long list is long" node title, even though I don't fully understand its meaning.

I suspect a play on the "longcat is long" meme.

by Athanasius (Archbishop) on Nov 03, 2022 at 12:36 UTC

And here is some good advice for any new monk wanting to make a good impression. Be friendly in your replies:

The long and the short of it is, don’t be short if you want to belong.

eyepopslikeamosquito: Are we both now eligible for membership in the ancient and venerable order of punsters?

Athanasius <°(((>< contra mundum סתם עוד האקר של פרל,