I assume you've tried it and discovered it's too slow ?

200 files at 6.9M lines each, each line three fields tab delimited -- if that's 60 bytes per line, you're reading ~80G bytes and writing ~27G (assuming roughly equal field sizes). On my little Linux box, I just timed cp foo bar for an ~13G file and it took ~15mins -- so 80G + 27G looks like about two hours work ? Of course, with faster drives and what-not, you may do better.

I've never tried opening 200 files at once...

It's hard to say what the processing time will add to this. It looks pretty I/O bound. If processing time is an issue, then I'd look at reading the files a chunk at a time... but I'd have to be convinced there was a problem; and even then I'm not sure that processing chunks of the files in Perl would be quicker than using Perl to read a line at a time.

So... what are you expecting, and what do you get when you try the straightforward approach ?

Wishing to squeeze as much as possible out of the inner loop, I think the following may be faster:

my $atleastone = 1; while ($atleastone) { $atleastone = 0 ; my @l = () ; foreach my $op (@handles) { my $c2 ; ++$atleastone and (undef, $c2) = split if defined($_ = <$op>) ; + push @l, $c2 || "0" ; } ; print OUTFILE join("\t", @l), "\n" ; } ;
or possibly:
my $atleastone = 1; while ($atleastone) { $atleastone = 0 ; my $l = '' ; foreach my $op (@handles) { my $c2 ; ++$atleastone and (undef, $c2) = split if defined($_ = <$op>) ; + $l .= ($c2 || "0") . "\t" ; } ; chop $l ; # Discard trailing "\t" print OUTFILE "$l\n" ; } ;
You seem concerned that column 2 might be empty, or that there may be blank or whitespace only lines in the input files. If you could guarantee that no line would be empty and no line would be whitespace only, then:
my $atleastone = 1; while (defined($atleastone)) { $atleastone = undef ; my $l = '' ; foreach my $op (@handles) { my $c2 ; ($atleastone, $c2) = split if defined($_ = <$op>) ; $l .= ($c2 || "0") . "\t" ; } ; chop $l ; # Discard trailing "\t" print OUTFILE "$l\n" ; } ;
squeezes out a tiny fraction ! (Setting $atleastone to something (even 0 or "") if <$op> returns a line.)


In reply to Re: Merging Many Files Side by Side by gone2015
in thread Merging Many Files Side by Side by sesemin

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.