What most the responses so far seem to have missed is that the cast of the in-ram spliting and joining of records, pales into insignificance when compared to the time required for the read-a-record, write-a-record flip-flopping of the read head back and forth across the disk.
There are two possible ways to speed up your processing:
This can be achieved by using two or more threads or processes.
Interleaving 80GB/4k = 20 million reads, and (~) 200,000 write means (at least) 400,000 track to track seeks.
If you increase the read & write chunk sizes to 1MB each, that can be reduced to ~1,500 track to track seeks.
Try these:
#! perl -sw use strict; our $B //= 64; $B *= 1024; my $ibuf = ''; while( sysread( *STDIN, $ibuf, $B, length $ibuf ) ) { my $p = 1+rindex( $ibuf, "\n" ); my $rem = substr( $ibuf, $p ); substr( $ibuf, $p ) = ''; open my $RAM, '<', \$ibuf; print while <$RAM>; $ibuf = $rem; }
#! perl -sw use strict; our $B //= 64; $B *= 1024; my $obuf; while( <> ) { my @f = split chr(9); $obuf .= join( chr(9), @f[2,0,5] ) . "\n"; if( length( $obuf ) > $B ) { print $obuf; $obuf = ''; } } print $obuf; ## Corrected C&P error. See [lotus1]'s post below.
Usage: ibuf -B=1024 < in.tsv | obuf.pl -B=1024 > out.tsv.
Based on my experiments, it might be possible to more than halve your processing time, though YMMV.
Experiment with adjusting the -B=nnn (in kbs) parameters up & down in unison and independently to find the sweet spot on your system.
Be aware, bigger is not always better. 1024 for both seems to work quite well on my system, anything larger slows it down.
In reply to Re: selecting columns from a tab-separated-values file
by BrowserUk
in thread selecting columns from a tab-separated-values file
by ibm1620
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |