Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Long list is long

by Chuma (Scribe)
on Oct 30, 2022 at 02:31 UTC ( [id://11147822]=perlquestion: print w/replies, xml ) Need Help??

Chuma has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,

I have a set of files, basically on the format

foo 73 bar 35 word 27 blah 23 ...

Now I need to combine them into one big file, on the same format, so that the number following "foo" is the sum of all the "foo" numbers in the individual files, and the lines are sorted by value.

My attempt is something like:

for(@files){ open IN, $_; while(<IN>){ /^(.*)\t(.*)$/; $f{$1}+=$2; } } @p=values %f; @q=keys %f; @i=sort{$p[$b]<=>$p[$a]} 0..$#p; open OUT, '>', $outpath; for(@i){ print OUT "$q[$_]\t$p[$_]\n"; }

But the files are pretty large (up to a couple hundred MB), there are quite a few of them (2064), and things get pretty slow. Reading in the files seems to sometimes move along nicely, and then suddenly nothing happens for a long time – out of memory, or something? The processor doesn't seem to be very busy, so I'm guessing there's some other bottleneck. And then the step after the loop takes a surprisingly long time, just writing the arrays – I tested it on 12 files, and it took the better part of an hour just for those two lines.

Is there a better way?

If it matters – the original lists are sorted, they don't all contain the same words, all numbers are positive integers.

Replies are listed 'Best First'.
Re: Long list is long
by kcott (Archbishop) on Oct 30, 2022 at 04:12 UTC

    G'day Chuma,

    A number of problems with the code you posted:

    • Missing strict.
    • Missing warnings.
    • Missing I/O checking and exception handling. I suggest autodie.
    • Package variables used throughout. Use lexical (my) variables instead; follow links for details. This applies to your filehandles as well.
    • Use the 3-argument form of open.

    From your description, I'd say the bottleneck lies with the population of the three arrays: @p, @q and @i. This is all unnecessary work and those arrays are not even needed. See "perlperf - Perl Performance and Optimization Techniques", for benchmarking and profiling techniques, to get a clearer picture of where problems lie.

    I created these three dummy test files. In case you're unfamiliar with cat -vet, ^I represents a tab and $ represents a newline.

    $ for i in A B C; do echo -e "\n*** $i"; cat $i; echo '----'; cat -ve +t $i; done *** A foo 73 bar 35 word 27 blah 23 ---- foo^I73$ bar^I35$ word^I27$ blah^I23$ *** B bar 35 yada 3 word 27 blah 23 ---- bar^I35$ yada^I3$ word^I27$ blah^I23$ *** C foo 73 word 27 blah 23 life 42 ---- foo^I73$ word^I27$ blah^I23$ life^I42$

    Then this test code:

    #!/usr/bin/env perl use strict; use warnings; use autodie; my @in_files = qw{A B C}; my $outfile = 'merge_count.out'; my %data; my $out_fmt = "%s\t%d\n"; for my $infile (@in_files) { open my $fh, '<', $infile; while (<$fh>) { my ($word, $count) = split; $data{$word} += $count; } } open my $fh, '>', $outfile; for my $key (sort { $data{$a} <=> $data{$b} } keys %data) { printf $fh $out_fmt, $key, $data{$key}; }

    Output (raw and showing special characters):

    $ cat merge_count.out yada 3 life 42 blah 69 bar 70 word 81 foo 146 $ cat -vet merge_count.out yada^I3$ life^I42$ blah^I69$ bar^I70$ word^I81$ foo^I146$

    Try that with your real files. I suspect it should be faster and not have the bottlenecks. Let us know if you still have problems: show your new code and profiling output (in <readme> or <spoiler> tags).

    — Ken

      A number of problems with the code you posted: Missing strict ... Missing warnings ...

      Kudos to kcott for patiently showing by example yet again how to write excellent, clean and succinct Perl code. Unfortunately, it seems unlikely that the monk in question will follow your sage advice, despite a gentle nudge from haukex a year ago.

      Given the Bod's recent posting history on this topic:

      I'm trusting that a once similarly recalcitrant Bod has now seen the light and might be persuaded to share his experiences ... along with why (or why not) he uses strict and warnings nowadays.

      Oh, and just in case it helps, a non-Perl Monk reference on this topic: Always use strict and warnings in your Perl code (perlmaven)

      Update: I also keep a list of references on this topic: use strict and warnings References

        G'day eyepopslikeamosquito,

        Thanks for the compliment.

        I usually look at an OP's previous posts. This gives me an idea of the OP's level of Perl knowledge and how best to frame my response. I did check on this occasion; wondered if I was flogging a dead horse; but chose to proceed anyway.

        To Chuma: I've been writing Perl code for almost 30 years. A substantial part of my paid employment involves writing Perl code. I use these pragmata for personal, $work and PM code. I don't do this because it's trendy, expected, or for any other frivolous reasons; I do it because they're extremely useful and save me a lot of time.

        Even with decades of experience, I still make typos and other silly mistakes (just like everyone else does) — I'd like Perl to tell me about these problems as soon as possible, instead of getting weird or unexpected output and spending a lot of time tracking down the source of the bug. I encourage you to use these pragmata for the same reasons.

        — Ken

Re: Long list is long
by jwkrahn (Abbot) on Oct 30, 2022 at 03:49 UTC

    You don't really need the arrays @p, @q and @i, you could just do this:

    my %f; for my $file ( @files ) { open my $IN, '<', $file or die "Cannot open '$file' because: $!"; while ( <$IN> ) { /^(.*)\t(.*)$/ && $f{ $1 } += $2; } } open my $OUT, '>', $outpath or die "Cannot open '$outpath' because: $! +"; for my $key ( sort { $f{ $b } <=> $f{ $a } } keys %f ) { print $OUT "$key\t$f{$key}\n"; }

      This line:

      /^(.*)\t(.*)$/ && $f{ $1 } += $2;
      won't compile ("Can't modify logical and (&&) in addition (+)"). Of course, it should read:
      /^(.*)\t(.*)$/ and $f{ $1 } += $2;
      Apart from that, I believe your code is correct -- and a big improvement on the OP's original code (which I couldn't look at without putting on sunglasses ;-).

      Thanks, looks good!

      I try it with a few benchmarks, and it takes... twice as long. Huh! That's weird.

Re: Long list is long
by tybalt89 (Monsignor) on Oct 30, 2022 at 14:28 UTC

    If I were on a linux system, I'd try it this way. The perl takes almost no memory :)

    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11147822 use warnings; use Path::Tiny; path("/tmp/11147822.$_.txt")->spew( <<END ) for 1 .. 100; # FIXME for +testing only foo 73 bar 35 word 27 blah 23 END open my $fh, '-|', 'sort /tmp/11147822.*.txt' or die; # FIXME for your + file list open my $tosort, '|-', 'sort -n -k2' or die; <$fh> =~ /^(\w+)\s+(\d+)/ or die; my ($name, $value) = ($1, $2); while( <$fh> ) { if( /^(\w+)\s+(\d+)/ ) { if( $1 eq $name ) { $value += $2; } else { print $tosort "$name\t$value\n"; ($name, $value) = ($1, $2); } } } print $tosort "$name\t$value\n"; # be sure to print last one close $tosort or die;

    Outputs:

    blah 2300 word 2700 bar 3500 foo 7300
      I have no idea what this does, but it looks cool! I'm using a mac, would that work? Supposing I have the file list in @files and the output file name in $outpath, how would you integrate that?

        I don't have a mac so I can't test this, but I assume the mac has a "sort" app somewhat like the *nix app (try "which sort" at a command line, and then "man sort"), then my guess is something like this:

        #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11147822 use warnings; my @files = map "/tmp/11147822.$_.txt", 1 .. 100; # FIXME my $outpath = "/tmp/11147822.outpath"; # FIXME open my $fh, '-|', "sort @files" or die; open my $tosort, '|-', "sort -n -k2 >$outpath" or die; <$fh> =~ /^(\w+)\s+(\d+)/ or die; my ($name, $value) = ($1, $2); while( <$fh> ) { if( /^(\w+)\s+(\d+)/ ) { if( $1 eq $name ) { $value += $2; } else { print $tosort "$name\t$value\n"; ($name, $value) = ($1, $2); } } } print $tosort "$name\t$value\n"; # be sure to print last one close $tosort or die;

        Assuming mac's are sort of built on BSD and still have a command line and have a more or less "sort" app.

Re: Long list is long
by eyepopslikeamosquito (Archbishop) on Oct 31, 2022 at 07:19 UTC

    kcott already provided a nice clean Perl solution. If his solution is running too slowly, please post a reply with your new code, along with system profiling output, just in case we can spot any booboos you might have made in implementing his suggestions.

    Given you have "2064 files (up to a couple hundred MB)", it would not surprise if you've hit a chronic thrashing problem. If so, I can see three options you might try:

    • The approach taken by tybalt89 using an external sort command.
    • Replace Perl's internal sort function with an external sort such as Sort::External - see also the "External Sorting" section at Sorting References.
    • Depending on your hardware/budget, a simple solution requiring no coding changes at all may be to just buy more memory! :) You can buy 32 GB of PC memory for around $200 nowadays. Update: Note that a DDR4 DIMM can hold up to 64 GB, while DDR5 octuples that to 512 GB ... which makes worrying about a program's memory usage seem much less important than in the good old days. :)

    Alternative suggestions welcome.

Re: Long list is long
by marioroy (Prior) on Feb 10, 2023 at 03:46 UTC

    Greetings, Chuma

    Eyepopslikeamosquito started a thread here. The thread is quite long containing many solutions. Most of them compute and store the map table in memory. You mentioned having 2,064 list files, up to a couple hundred MB each. If your data set contains billions of unique key-value pairs, most of the solutions may fail due to insufficient memory capacity.

    We created a backup plan (well, there are two). One solution requires lesser memory consumption and can handle hundreds of list files, possibly thousands. That is parsort from GNU Parallel and tally-count. The instructions for building tally-count.cpp is in the source.

    Do not run the default sort utility on the box and instead run parsort (as shown) for minimum memory consumption. At the time of writing, it is best to have parsort read STDIN when processing hundreds of files. Files specified as arguments causes parsort to create two processes per file.

    cat list1.txt list2.txt ... listN.txt |\ LC_ALL=C parsort --parallel=24 -k1 | /path/to/tally-count |\ LC_ALL=C parsort --parallel=24 -k2nr > out.txt

    The output from the first sort command contains duplicate key names. Next, the tally-count command sums adjacent count fields of duplicate names. Its output contains unique names. Finally, parsort is called again to sort by sum in descending order and keyname in ascending order.

    TMPDIR available space.

    Processing two thousand list files via parsort may cause $TMPDIR (/tmp) to fill up. You can set the TMPDIR environment variable to a location with ample storage or specify the -T option. Ensure 1.2x to 2x available temp-storage of the combined input size.

    cat list1.txt list2.txt ... listN.txt |\ LC_ALL=C parsort --parallel=24 -T /path/to/tmpdir -k1 | /path/to/tally +-count |\ LC_ALL=C parsort --parallel=24 -T /path/to/tmpdir -k2nr > final.txt

    I made a parsort variant named mcesort.

    cat list1.txt list2.txt ... listN.txt |\ LC_ALL=C mcesort --parallel=24 -T /path/to/tmpdir -k1 | /path/to/tally +-count |\ LC_ALL=C mcesort --parallel=24 -T /path/to/tmpdir -k2nr > final.txt

    For extra performance, the --tally option integrates the tally command with merge. The effect is lesser work for subsequent merge, beneficial when there are many duplicate keys. The -A option sets LC_ALL=C. The -j24 option is short for --parallel=24.

    cat list1.txt list2.txt ... listN.txt |\ mcesort -A -j24 -T /path/to/tmpdir --tally=/path/to/tally-count -k1 |\ mcesort -A -j24 -T /path/to/tmpdir -k2nr > final.txt

    Noteworthy C++ solutions.

    You can run llil4map, llil4map2 (memory efficient), llil4hmap, or llil4emh. Temporary storage isn't used and there's no limit to the number of list files. Are the keys limited-length, let's say 8 ~ 20? Then, uncomment the line "#define MAX_STR_LEN_L" and adjust the value accordingly. That will minimize memory consumption.

    NUM_THREADS=24 llil4map list1.txt list2.txt ... listN.txt > final.txt NUM_THREADS=24 llil4hmap list1.txt list2.txt ... listN.txt > final.txt NUM_THREADS=24 llil4emh list1.txt list2.txt ... listN.txt > final.txt

    If you're processing a hundred or less files (~ 200MB each; again, depending on memory capacity ~ 50GB), the llil4vec solution favors time over space, based on llil3vec. Be sure to uncomment the line "#define MAX_STR_LEN_L" for limited-length keys; e.g. 20 or less. See also, llil4vec2.

    NUM_THREADS=24 llil4vec list1.txt list2.txt ... listN.txt > final.txt

    July 2023 alternative Perl/C++ solutions. Results, refer to C++ vs Perl.

    llilsql.pl   - Tie::Hash::DBD SQLite database
    llildbf.pl   - DB_File B-tree database
    llilmma.pl   - IPC::MMA shared memory hash
    lliltch.pl   - Tokyo Cabinet hash database
    llilkch.pl   - Kyoto Cabinet hash database
    llil4tkh.cc  - tkrzw::ShardDBM demonstration
    llil4tkh2.cc - tkrzw::HashDBM demonstration
    

    March 2024 updates.

    1. Reported issues to NVIDIA regarding nvc++. Greg fixed warnings
       in his phmap library. No more warnings using nvc++.
    
    2. Use spinlock_mutex found on the web, versus std::mutex.
       This improves llil4map, llil4hmap, and llil4emh.
       Tested g++, clang++, and nvc++.
       https://github.com/clearlinux/distribution/issues/3036
    
    3. Greg, the author of the phmap C++ library, provided suggestion for
       ensuring minimum memory consumption during "map to vector", in llil4map.
       https://github.com/greg7mdp/parallel-hashmap/issues/238
    
    4. The llil4map demonstration lives at the phmap C++ repository.
       examples/llil4map.cc, and associated llil_utils folder
       https://github.com/greg7mdp/parallel-hashmap
    
    5. Applied Greg's minor simplification for parallel sort to C++ demonstrations.
       Updated "map to vector" in Gist llil4map.cc to simplified version.
       https://github.com/greg7mdp/parallel-hashmap/commit/38dd6b52940937cae06191b7fb39ad0d6aeea4d2
    

    April 2024 updates.

    1. Added llil4map2, a memory efficient version.
       https://github.com/greg7mdp/parallel-hashmap/issues/238#issuecomment-2036284574
    
    2. A more memory efficient version, by Gregory Popovitch, author of the C++ parallel-hashmap library.
       https://github.com/greg7mdp/parallel-hashmap/issues/238#issuecomment-2039591745
    

    See also, The Long List is Long resurrected.

Re: Long list is long
by NERDVANA (Deacon) on Oct 31, 2022 at 02:26 UTC
    The performance problem all sounds to me like you have an insane number of words being passed to sort. Unfortunately, sort takes a list, and that list has to be on Perl's argument stack all at once. But, you should double-check that: print the count of words at the end of the script, for reference. Also you should check your system memory usage as it runs; if the script runs you into disk swap, then yes it is going to run very slowly.

    I believe there is actually an internal optimization for sorting an array in-place, so @foo= sort { ... } @foo; should execute faster than your example because the array never actually goes onto the perl argument stack. Of course, that doesn't really fit your data structure too well.

    You could try emptying the hash to free up memory using undef %f; after you pull out all the keys and values. That should cut your ram usage by half before the sort step. (but maybe too late to avoid running into disk swap)

    I'd be curious if you could make use of Tree::RB::XS here, to avoid ever having to expand a full list of the items:

    use Tree::RB::XS; for(@files){ open IN, $_; while(<IN>){ /^( [^\t\n]+ )\t( [0-9]+ )$/x; # avoid backtracking $f{$1}+=$2; } } my $t= Tree::RB::XS->new(compare_fn => 'int', allow_duplicates => 1); while (my ($word, $count)= each %f) { $t->insert($count, $word); } my $iter= $t->iter; printf OUT "%s\t%s\n", reverse $iter->next_kv while !$iter->done;

    I apologize that I'm too lazy to generate some hundred-gigabyte files to benchmark this theory on my own.

    If none of these keep the memory usage low enough to be useful, I would suggest trying a database. You could then run the query that sorts them and iterate the rows using DBI. There's a huge overhead in putting them into a database, but if you re-use this data then the database index would save you a lot of computing power in the long run.

Re: Long list is long
by Fletch (Bishop) on Oct 31, 2022 at 11:21 UTC

    Since (as I'm reading it . . .) the ordering is determined by the final sum of all values from the different files I'd try to leverage DBI and an sqlite DB. Read each file into a table with word and cur_sum columns updating the latter with each word's value. Once you've read all files then use a SELECT word, cur_sum FROM foo ORDER BY cur_sum DESC to pull out the final results.

    Alternately DB_File could be done used to do something similar to create a summed values file (read each input file and increment the counts in the tied DB; then walk the keys of the DB to create an (unsorted) output file) and then you could try using sort(1) to create the final file from that. If that is too much for your box then use split (or write multiple files), sort those component files, then write an aggregator to merge those (implement the merge of merge sort: read one line from all the (now sorted) sub-files, output the highest and replace it, lather rinse repeat).

    EDIT: wording. ENOCAFFEINE.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

Re: Long list is long
by dsheroh (Monsignor) on Oct 31, 2022 at 12:52 UTC
    the original lists are sorted
    This suggests a potential optimization, which would greatly reduce memory requirements and completely eliminate the need to sort your output, but at the cost of requiring a large number of filehandles:
    • Open all the files at once.
    • Grab the first line of each file and store it (along with an annotation to tell you which file it came from) in a heap or similar data structure. You could even use a plain array here, splicing each line into its proper (sorted) position as it's added.
    • Retrieve the first (i.e., earliest-sorting) element from the heap. Parse it into the word and the number. If it's the word you're currently working on, add the number to the total; if it's a new word, write it to your output along with its total and start a new total for the new word.
    • Read the next line from the file which provided the element you just handled and store that line in the heap.
    • Repeat until all files run out of lines.
    Since all input files are sorted and you're always processing the unprocessed line which sorts earliest, the output will automatically be sorted, without ever needing to have more than one line per input file in memory concurrently.

    If you aren't able to allocate enough file handles at once to process the entire list of input files in one go, this technique can also be used to do the work in multiple stages - divide the files into N groups, get the combined sums for each group, and then use those combined sums as your new set of sorted input files.

    #!/usr/bin/env perl use strict; use warnings; use 5.010; use Array::Heap::PriorityQueue::String; my @file_names = qw( in1 in2 in3 ); my @fh; my $heap = Array::Heap::PriorityQueue::String->new; for (0 .. $#file_names) { my $fn = $file_names[$_]; open my $new_fh, '<', $fn or die "Unable to open $fn: $!\n"; $fh[$_] = $new_fh; add_from_fh_id($_); } my ($current_word) = split /\s+/, $heap->peek; my $total = 0; while (my $item = $heap->get) { my ($word, $count, $fh_id) = split /\s+/, $item; if ($word eq $current_word) { $total += $count; } else { say "$current_word\t$total"; $current_word = $word; $total = $count; } add_from_fh_id($fh_id); } say "$current_word\t$total"; sub add_from_fh_id { my $fh_id = shift; my $read_from = $fh[$fh_id]; my $next_line = <$read_from>; if ($next_line) { chomp $next_line; $heap->add("$next_line $fh_id"); } }
    For my testing, I used the input files from kcott's reply, sorted and renamed to 'in1', 'in2', and 'in3'. This gave me the output:
    bar 70 blah 69 foo 146 life 42 word 81 yada 3

      It is not fully spelled out, but I think it likely that "the original lists are sorted" was intended by the OP to mean that "the lines are sorted by value" (as the output is required to be), not that the words appear in ASCII order.

      This is additionally hinted at by the initial 4-line example input.

        *facepalm* Good catch. I was so busy being clever that I forgot to read the specs properly. Thanks for pointing it out!
Re: Long list is long
by eyepopslikeamosquito (Archbishop) on Mar 11, 2023 at 11:11 UTC
Re: Long list is long
by eyepopslikeamosquito (Archbishop) on Nov 02, 2022 at 10:17 UTC

    I really like your "Long list is long" node title, even though I don't fully understand its meaning.

    I long for the long-winded replies you received to go a long way towards a long lasting solution before long that works for you in the long run. :)

      I really like your "Long list is long" node title, even though I don't fully understand its meaning.

      I suspect a play on the "longcat is long" meme.

      And here is some good advice for any new monk wanting to make a good impression. Be friendly in your replies:

      The long and the short of it is, don’t be short if you want to belong.

      eyepopslikeamosquito: Are we both now eligible for membership in the ancient and venerable order of punsters?

      Athanasius <°(((><contra mundum סתם עוד האקר של פרל,

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11147822]
Approved by Athanasius
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (6)
As of 2024-04-19 13:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found