Dear monks,

I have a set of files, basically on the format

foo 73 bar 35 word 27 blah 23 ...

Now I need to combine them into one big file, on the same format, so that the number following "foo" is the sum of all the "foo" numbers in the individual files, and the lines are sorted by value.

My attempt is something like:

for(@files){ open IN, $_; while(<IN>){ /^(.*)\t(.*)$/; $f{$1}+=$2; } } @p=values %f; @q=keys %f; @i=sort{$p[$b]<=>$p[$a]} 0..$#p; open OUT, '>', $outpath; for(@i){ print OUT "$q[$_]\t$p[$_]\n"; }

But the files are pretty large (up to a couple hundred MB), there are quite a few of them (2064), and things get pretty slow. Reading in the files seems to sometimes move along nicely, and then suddenly nothing happens for a long time – out of memory, or something? The processor doesn't seem to be very busy, so I'm guessing there's some other bottleneck. And then the step after the loop takes a surprisingly long time, just writing the arrays – I tested it on 12 files, and it took the better part of an hour just for those two lines.

Is there a better way?

If it matters – the original lists are sorted, they don't all contain the same words, all numbers are positive integers.


In reply to Long list is long by Chuma

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.