comment on

What most the responses so far seem to have missed is that the cast of the in-ram spliting and joining of records, pales into insignificance when compared to the time required for the read-a-record, write-a-record flip-flopping of the read head back and forth across the disk.

There are two possible ways to speed up your processing:

Overlap the spliting & joining with either the reading or the writing (or both).
This can be achieved by using two or more threads or processes.
Read and write larger chunks of the file to minimise the number of seeks the read heads need to make.
Interleaving 80GB/4k = 20 million reads, and (~) 200,000 write means (at least) 400,000 track to track seeks.
If you increase the read & write chunk sizes to 1MB each, that can be reduced to ~1,500 track to track seeks.

Try these:

ibuf.pl:

#! perl -sw
use strict;

our $B //= 64;
$B *= 1024;

my $ibuf = '';
while( sysread( *STDIN, $ibuf, $B, length $ibuf ) ) {
    my $p = 1+rindex( $ibuf, "\n" );
    my $rem = substr( $ibuf, $p );
    substr( $ibuf, $p ) = '';
    open my $RAM, '<', \$ibuf;
    print while <$RAM>;
    $ibuf = $rem;
}
[download]

obuf.pl:
```
#! perl -sw
use strict;

our $B //= 64;
$B *= 1024;

my $obuf;
while( <> ) {
    my @f = split chr(9);
    $obuf .= join( chr(9), @f[2,0,5] ) . "\n";
    if( length( $obuf ) > $B ) {
        print $obuf;
        $obuf = '';
    }
}
print $obuf; ## Corrected C&P error. See [lotus1]'s post below.
[download]
```
Usage: ibuf -B=1024 < in.tsv | obuf.pl -B=1024 > out.tsv.
Based on my experiments, it might be possible to more than halve your processing time, though YMMV.
Experiment with adjusting the -B=nnn (in kbs) parameters up & down in unison and independently to find the sweet spot on your system.
Be aware, bigger is not always better. 1024 for both seems to work quite well on my system, anything larger slows it down.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re: selecting columns from a tab-separated-values file by BrowserUk
in thread selecting columns from a tab-separated-values file by ibm1620

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.