Dear monks,
I have a set of files, basically on the format
foo 73 bar 35 word 27 blah 23 ...
Now I need to combine them into one big file, on the same format, so that the number following "foo" is the sum of all the "foo" numbers in the individual files, and the lines are sorted by value.
My attempt is something like:
for(@files){ open IN, $_; while(<IN>){ /^(.*)\t(.*)$/; $f{$1}+=$2; } } @p=values %f; @q=keys %f; @i=sort{$p[$b]<=>$p[$a]} 0..$#p; open OUT, '>', $outpath; for(@i){ print OUT "$q[$_]\t$p[$_]\n"; }
But the files are pretty large (up to a couple hundred MB), there are quite a few of them (2064), and things get pretty slow. Reading in the files seems to sometimes move along nicely, and then suddenly nothing happens for a long time – out of memory, or something? The processor doesn't seem to be very busy, so I'm guessing there's some other bottleneck. And then the step after the loop takes a surprisingly long time, just writing the arrays – I tested it on 12 files, and it took the better part of an hour just for those two lines.
Is there a better way?
If it matters – the original lists are sorted, they don't all contain the same words, all numbers are positive integers.
In reply to Long list is long by Chuma
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |