Chuma has asked for the wisdom of the Perl Monks concerning the following question:
Dear monks,
I have a set of files, basically on the format
foo 73 bar 35 word 27 blah 23 ...
Now I need to combine them into one big file, on the same format, so that the number following "foo" is the sum of all the "foo" numbers in the individual files, and the lines are sorted by value.
My attempt is something like:
for(@files){ open IN, $_; while(<IN>){ /^(.*)\t(.*)$/; $f{$1}+=$2; } } @p=values %f; @q=keys %f; @i=sort{$p[$b]<=>$p[$a]} 0..$#p; open OUT, '>', $outpath; for(@i){ print OUT "$q[$_]\t$p[$_]\n"; }
But the files are pretty large (up to a couple hundred MB), there are quite a few of them (2064), and things get pretty slow. Reading in the files seems to sometimes move along nicely, and then suddenly nothing happens for a long time – out of memory, or something? The processor doesn't seem to be very busy, so I'm guessing there's some other bottleneck. And then the step after the loop takes a surprisingly long time, just writing the arrays – I tested it on 12 files, and it took the better part of an hour just for those two lines.
Is there a better way?
If it matters – the original lists are sorted, they don't all contain the same words, all numbers are positive integers.
|
---|
Replies are listed 'Best First'. | |||
---|---|---|---|
Re: Long list is long
by kcott (Archbishop) on Oct 30, 2022 at 04:12 UTC | |||
G'day Chuma, A number of problems with the code you posted: From your description, I'd say the bottleneck lies with the population of the three arrays: @p, @q and @i. This is all unnecessary work and those arrays are not even needed. See "perlperf - Perl Performance and Optimization Techniques", for benchmarking and profiling techniques, to get a clearer picture of where problems lie. I created these three dummy test files. In case you're unfamiliar with cat -vet, ^I represents a tab and $ represents a newline.
Then this test code:
Output (raw and showing special characters):
Try that with your real files. I suspect it should be faster and not have the bottlenecks. Let us know if you still have problems: show your new code and profiling output (in <readme> or <spoiler> tags). — Ken | [reply] [d/l] [select] | ||
by eyepopslikeamosquito (Archbishop) on Oct 30, 2022 at 10:11 UTC | |||
A number of problems with the code you posted: Missing strict ... Missing warnings ... Kudos to kcott for patiently showing by example yet again how to write excellent, clean and succinct Perl code. Unfortunately, it seems unlikely that the monk in question will follow your sage advice, despite a gentle nudge from haukex a year ago. Given the Bod's recent posting history on this topic: I'm trusting that a once similarly recalcitrant Bod has now seen the light and might be persuaded to share his experiences ... along with why (or why not) he uses strict and warnings nowadays. Oh, and just in case it helps, a non-Perl Monk reference on this topic: Always use strict and warnings in your Perl code (perlmaven) Update: I also keep a list of references on this topic: use strict and warnings References | [reply] [d/l] [select] | ||
by kcott (Archbishop) on Oct 30, 2022 at 13:34 UTC | |||
G'day eyepopslikeamosquito, Thanks for the compliment. I usually look at an OP's previous posts. This gives me an idea of the OP's level of Perl knowledge and how best to frame my response. I did check on this occasion; wondered if I was flogging a dead horse; but chose to proceed anyway. To Chuma: I've been writing Perl code for almost 30 years. A substantial part of my paid employment involves writing Perl code. I use these pragmata for personal, $work and PM code. I don't do this because it's trendy, expected, or for any other frivolous reasons; I do it because they're extremely useful and save me a lot of time. Even with decades of experience, I still make typos and other silly mistakes (just like everyone else does) — I'd like Perl to tell me about these problems as soon as possible, instead of getting weird or unexpected output and spending a lot of time tracking down the source of the bug. I encourage you to use these pragmata for the same reasons. — Ken | [reply] | ||
Re: Long list is long
by jwkrahn (Abbot) on Oct 30, 2022 at 03:49 UTC | |||
You don't really need the arrays @p, @q and @i, you could just do this:
| [reply] [d/l] | ||
by eyepopslikeamosquito (Archbishop) on Oct 30, 2022 at 05:44 UTC | |||
This line: won't compile ("Can't modify logical and (&&) in addition (+)"). Of course, it should read: Apart from that, I believe your code is correct -- and a big improvement on the OP's original code (which I couldn't look at without putting on sunglasses ;-). | [reply] [d/l] [select] | ||
by Chuma (Scribe) on Oct 30, 2022 at 12:49 UTC | |||
Thanks, looks good! I try it with a few benchmarks, and it takes... twice as long. Huh! That's weird. | [reply] | ||
Re: Long list is long
by tybalt89 (Monsignor) on Oct 30, 2022 at 14:28 UTC | |||
If I were on a linux system, I'd try it this way. The perl takes almost no memory :)
Outputs:
| [reply] [d/l] [select] | ||
by Chuma (Scribe) on Oct 30, 2022 at 15:18 UTC | |||
| [reply] | ||
by tybalt89 (Monsignor) on Oct 30, 2022 at 17:08 UTC | |||
I don't have a mac so I can't test this, but I assume the mac has a "sort" app somewhat like the *nix app (try "which sort" at a command line, and then "man sort"), then my guess is something like this:
Assuming mac's are sort of built on BSD and still have a command line and have a more or less "sort" app. | [reply] [d/l] | ||
Re: Long list is long
by eyepopslikeamosquito (Archbishop) on Oct 31, 2022 at 07:19 UTC | |||
kcott already provided a nice clean Perl solution. If his solution is running too slowly, please post a reply with your new code, along with system profiling output, just in case we can spot any booboos you might have made in implementing his suggestions. Given you have "2064 files (up to a couple hundred MB)", it would not surprise if you've hit a chronic thrashing problem. If so, I can see three options you might try: Alternative suggestions welcome. | [reply] | ||
Re: Long list is long
by marioroy (Prior) on Feb 10, 2023 at 03:46 UTC | |||
Greetings, Chuma Eyepopslikeamosquito started a thread here. The thread is quite long containing many solutions. Most of them compute and store the map table in memory. You mentioned having 2,064 list files, up to a couple hundred MB each. If your data set contains billions of unique key-value pairs, most of the solutions may fail due to insufficient memory capacity. We created a backup plan (well, there are two). One solution requires lesser memory consumption and can handle hundreds of list files, possibly thousands. That is parsort from GNU Parallel and tally-count. The instructions for building tally-count.cpp is in the source. Do not run the default sort utility on the box and instead run parsort (as shown) for minimum memory consumption. At the time of writing, it is best to have parsort read STDIN when processing hundreds of files. Files specified as arguments causes parsort to create two processes per file.
The output from the first sort command contains duplicate key names. Next, the tally-count command sums adjacent count fields of duplicate names. Its output contains unique names. Finally, parsort is called again to sort by sum in descending order and keyname in ascending order. TMPDIR available space. Processing two thousand list files via parsort may cause $TMPDIR (/tmp) to fill up. You can set the TMPDIR environment variable to a location with ample storage or specify the -T option. Ensure 1.2x to 2x available temp-storage of the combined input size.
I made a parsort variant named mcesort.
For extra performance, the --tally option integrates the tally command with merge. The effect is lesser work for subsequent merge, beneficial when there are many duplicate keys. The -A option sets LC_ALL=C. The -j24 option is short for --parallel=24.
Noteworthy C++ solutions. You can run llil4map, llil4map2 (memory efficient), llil4hmap, or llil4emh. Temporary storage isn't used and there's no limit to the number of list files. Are the keys limited-length, let's say 8 ~ 20? Then, uncomment the line "#define MAX_STR_LEN_L" and adjust the value accordingly. That will minimize memory consumption.
If you're processing a hundred or less files (~ 200MB each; again, depending on memory capacity ~ 50GB), the llil4vec solution favors time over space, based on llil3vec. Be sure to uncomment the line "#define MAX_STR_LEN_L" for limited-length keys; e.g. 20 or less. See also, llil4vec2.
July 2023 alternative Perl/C++ solutions. Results, refer to C++ vs Perl. llilsql.pl - Tie::Hash::DBD SQLite database llildbf.pl - DB_File B-tree database llilmma.pl - IPC::MMA shared memory hash lliltch.pl - Tokyo Cabinet hash database llilkch.pl - Kyoto Cabinet hash database llil4tkh.cc - tkrzw::ShardDBM demonstration llil4tkh2.cc - tkrzw::HashDBM demonstration March 2024 updates. 1. Reported issues to NVIDIA regarding nvc++. Greg fixed warnings in his phmap library. No more warnings using nvc++. 2. Use spinlock_mutex found on the web, versus std::mutex. This improves llil4map, llil4hmap, and llil4emh. Tested g++, clang++, and nvc++. https://github.com/clearlinux/distribution/issues/3036 3. Greg, the author of the phmap C++ library, provided suggestion for ensuring minimum memory consumption during "map to vector", in llil4map. https://github.com/greg7mdp/parallel-hashmap/issues/238 4. The llil4map demonstration lives at the phmap C++ repository. examples/llil4map.cc, and associated llil_utils folder https://github.com/greg7mdp/parallel-hashmap 5. Applied Greg's minor simplification for parallel sort to C++ demonstrations. Updated "map to vector" in Gist llil4map.cc to simplified version. https://github.com/greg7mdp/parallel-hashmap/commit/38dd6b52940937cae06191b7fb39ad0d6aeea4d2 April 2024 updates. 1. Added llil4map2, a memory efficient version. https://github.com/greg7mdp/parallel-hashmap/issues/238#issuecomment-2036284574 2. A more memory efficient version, by Gregory Popovitch, author of the C++ parallel-hashmap library. https://github.com/greg7mdp/parallel-hashmap/issues/238#issuecomment-2039591745 See also, The Long List is Long resurrected. | [reply] [d/l] [select] | ||
Re: Long list is long
by NERDVANA (Priest) on Oct 31, 2022 at 02:26 UTC | |||
I believe there is actually an internal optimization for sorting an array in-place, so @foo= sort { ... } @foo; should execute faster than your example because the array never actually goes onto the perl argument stack. Of course, that doesn't really fit your data structure too well. You could try emptying the hash to free up memory using undef %f; after you pull out all the keys and values. That should cut your ram usage by half before the sort step. (but maybe too late to avoid running into disk swap) I'd be curious if you could make use of Tree::RB::XS here, to avoid ever having to expand a full list of the items:
I apologize that I'm too lazy to generate some hundred-gigabyte files to benchmark this theory on my own. If none of these keep the memory usage low enough to be useful, I would suggest trying a database. You could then run the query that sorts them and iterate the rows using DBI. There's a huge overhead in putting them into a database, but if you re-use this data then the database index would save you a lot of computing power in the long run. | [reply] [d/l] [select] | ||
Re: Long list is long
by Fletch (Bishop) on Oct 31, 2022 at 11:21 UTC | |||
Since (as I'm reading it . . .) the ordering is determined by the final sum of all values from the different files I'd try to leverage DBI and an sqlite DB. Read each file into a table with word and cur_sum columns updating the latter with each word's value. Once you've read all files then use a SELECT word, cur_sum FROM foo ORDER BY cur_sum DESC to pull out the final results.
Alternately DB_File could be EDIT: wording. ENOCAFFEINE.
The cake is a lie. | [reply] [d/l] [select] | ||
Re: Long list is long
by dsheroh (Monsignor) on Oct 31, 2022 at 12:52 UTC | |||
the original lists are sortedThis suggests a potential optimization, which would greatly reduce memory requirements and completely eliminate the need to sort your output, but at the cost of requiring a large number of filehandles: If you aren't able to allocate enough file handles at once to process the entire list of input files in one go, this technique can also be used to do the work in multiple stages - divide the files into N groups, get the combined sums for each group, and then use those combined sums as your new set of sorted input files. For my testing, I used the input files from kcott's reply, sorted and renamed to 'in1', 'in2', and 'in3'. This gave me the output:
| [reply] [d/l] [select] | ||
by hv (Prior) on Oct 31, 2022 at 17:12 UTC | |||
It is not fully spelled out, but I think it likely that "the original lists are sorted" was intended by the OP to mean that "the lines are sorted by value" (as the output is required to be), not that the words appear in ASCII order. This is additionally hinted at by the initial 4-line example input. | [reply] | ||
by dsheroh (Monsignor) on Nov 01, 2022 at 08:33 UTC | |||
| [reply] | ||
Re: Long list is long
by eyepopslikeamosquito (Archbishop) on Mar 11, 2023 at 11:11 UTC | |||
Dear Chuma, Like paco's celebrated question of 1999, your humble question has generated an extraordinary level of excitement, over a period of many months, since you sadly left us thirteen weeks ago. Especially noteworthy is marioroy's terrifyingly beautiful analysis. Update: [OT] The Long List is Long resurrected (April 2024). I wish I could upvote your node multiple times and have added it to the list at Terrifying Beauty. Like paco, we patiently await your return. | [reply] | ||
Re: Long list is long
by eyepopslikeamosquito (Archbishop) on Nov 02, 2022 at 10:17 UTC | |||
I really like your "Long list is long" node title, even though I don't fully understand its meaning. I long for the long-winded replies you received to go a long way towards a long lasting solution before long that works for you in the long run. :) | [reply] | ||
by haukex (Archbishop) on Nov 06, 2022 at 08:17 UTC | |||
I really like your "Long list is long" node title, even though I don't fully understand its meaning. I suspect a play on the "longcat is long" meme. | [reply] | ||
by Athanasius (Archbishop) on Nov 03, 2022 at 12:36 UTC | |||
And here is some good advice for any new monk wanting to make a good impression. Be friendly in your replies: The long and the short of it is, don’t be short if you want to belong. eyepopslikeamosquito: Are we both now eligible for membership in the ancient and venerable order of punsters?
| [reply] |