in reply to Huge data file and looping best practices
Try something like this:
#! perl -sw use 5.010; use strict; my( $i, @patNos, @data ) = 0; while( <> ) { my @bits =split ','; $patNos[ $i ] = shift @bits; $data[ $i++ ] = pack 'b480', @bits; } open OUT, '>', 'variances.csv' or die $!; my @variances; for my $first ( 0 .. $#patNos ) { for my $second ( 0 .. $#patNos ) { next if $first == $second; say OUT "$patNos[ $first ], $patNos[ $second ], ", unpack '%32b*', ( $data[ $first ] ^ $data[ $second ] ); } } close OUT;
By packing the 0s & 1s on input, each record will compact to 60 bytes, making your 8 million records occupy about 1.2 GB.
The second benefit of this is that you can calculate your variances for each pairing in a single statement, thereby eliminating the expensive 480 iteration inner loop and speeding things up by close to 3 orders of magnitude.
It'll probably make the need to split the task across processes unnecessary. Though it would still be possible if needed.
Once encoded as bitstrings, pack '%32b*', ( $data[ $first ] ^ $data[ $second ] ); will efficiently compare all 480 attributes and count the variance in one go.
|
---|