in reply to Huge data file and looping best practices

Try something like this:

#! perl -sw use 5.010; use strict; my( $i, @patNos, @data ) = 0; while( <> ) { my @bits =split ','; $patNos[ $i ] = shift @bits; $data[ $i++ ] = pack 'b480', @bits; } open OUT, '>', 'variances.csv' or die $!; my @variances; for my $first ( 0 .. $#patNos ) { for my $second ( 0 .. $#patNos ) { next if $first == $second; say OUT "$patNos[ $first ], $patNos[ $second ], ", unpack '%32b*', ( $data[ $first ] ^ $data[ $second ] ); } } close OUT;

By packing the 0s & 1s on input, each record will compact to 60 bytes, making your 8 million records occupy about 1.2 GB.

The second benefit of this is that you can calculate your variances for each pairing in a single statement, thereby eliminating the expensive 480 iteration inner loop and speeding things up by close to 3 orders of magnitude.

It'll probably make the need to split the task across processes unnecessary. Though it would still be possible if needed.

Once encoded as bitstrings, pack '%32b*', ( $data[ $first ] ^ $data[ $second ] ); will efficiently compare all 480 attributes and count the variance in one go.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Replies are listed 'Best First'.
Re^2: Huge data file and looping best practices
by AnomalousMonk (Archbishop) on Apr 26, 2009 at 19:25 UTC
    Would it be worthwhile to pre-allocate the  @patNos and  @data arrays with  $#array = N_000_000; statements? (The exact number of lines/elements could be obtained with a  `wc -l` call.)

    Even though it would lead to 8,000,000 redundant entries giving the variance of each patent with itself (i.e., 0), might it be profitable to eliminate the  next if $first == $second; statement from the inner loop of the variance calculation section? Or, avoiding redundant entries, to break the inner loop into two consecutive loops with identical BLOCK contents:  for my $second (0 .. $first-1) { ... } and then  for my $second ($first+1 .. $#patNos) { ... }?

    Or is all this just premature optimization?

    Update: Slight wording changes.

      And now for the ultimate speedup.

      Instead of building an array of 8 million, 60-byte bitstrings and performing 32e12 xors and bitcounts, you build a single 480 MB bitstring by concatenating all the packed data.

      Now, you can perform the XOR of records 0 through 7,999,999 with records 1 through 8,000,000 in one operation using substr to select the appropriate chunks of the big string. And then use substr with unpack to count the bits in each 60 byte chunk of the result.

      Then you repeat the process with substrs covering 0 .. 7,999,998 & 2 .. 8,000,000. Rinse repeat till done.

      8e6 XORs and 32e12 bitcounts and done.

      My best guestimate is that instead of the OP code requiring upwards of 10 days on 8 cores, this'll take less than half a day on one core.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
      Would it be worthwhile to pre-allocate the @patNos and @data arrays with $#array = N_000_000; statements?

      It wouldn't hurt, but would make little difference to the overall runtime given the scale of the processing.

      Even though it would lead to 8,000,000 redundant entries giving the variance of each patent with itself (i.e., 0), might it be profitable to eliminate the next if $first == $second; statement from the inner loop of the variance calculation section? Or, avoiding redundant entries, to break the inner loop into two consecutive loops with identical BLOCK contents: for my $second (0 .. $first-1) { ... } and then for my $second ($first+1 .. $#patNos) { ... } ?

      With time to give it a little more thought, there is no point in comparing both patent 1 with patent 2 & patent 2 with patent 1. With that insight, the inner loop should run from $first +1 .. $#patNos - 1. That reduces the variance calculations from 64e12 to 32e12 - 8e6 for a further halving of the runtime:

      #! perl -sw use 5.010; use strict; my( $i, @patNos, @data ) = 0; $#patNos = $#data = 8e6; while( <> ) { my @bits =split ','; $patNos[ $i ] = shift @bits; $data[ $i++ ] = pack 'b480', @bits; print "\r$.\t" unless $. % 1000; } say "\n", time; open OUT, '>', 'variances.csv' or die $!; my @variances; for my $first ( 0 .. $#patNos-1 ) { for my $second ( $first+1 .. $#patNos ) { say OUT "$patNos[ $first ], $patNos[ $second ], ", unpack '%32b*', ( $data[ $first ] ^ $data[ $second ] ); } } close OUT;

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re^2: Huge data file and looping best practices
by carillonator (Novice) on Apr 27, 2009 at 20:07 UTC
    beautiful! Thank you everyone for the very thoughtful answers. I think I'm over my head here, but the first step is going to be to reduce the 8 million lines to what we're guessing is 400,000 unique sets of characteristics, then go from there. This is easy enough, but the 400,000 unique sets will then all have to be compared to each other.

    Then I'll work through these methods of converting the data to bit strings to simplify the comparisons.

    Thank you thank you!

      the first step is going to be to reduce the 8 million lines to what we're guessing is 400,000 unique sets of characteristics,

      How long wil it take you to do that reduction?

      Because with the additional efficiencies outlined in 760218 & 760226, and a little threading or forking, you could distribute this over your 8 processors and have the full cross product very quickly.

      Of course, that would be a huge amount of data to further process, but maybe you could apply some (cheap) selection criteria prior to output.

      Anyway, good luck!


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        ok I almost get it! (Update: I had a mistake here before)

        The reduction was easy in STATA, sorting all the data by characteristic columns (as if they were one big binary number), then eliminating duplicates. Down to 400,000 lines of data.

        Now, I've tried adapting your code and can't seem to get it to work. The XOR appears to be working, but somewhere between packing and unpacking something goes wrong.

        Here is a simplification I've written with two lines of data and 8 characteristics:

        #!/usr/bin/perl use strict; my (@patNos, @data) = 0; my $line1 = "44444,1,1,0,0,0,1,1,1"; my @bits1 = split ',',$line1; $patNos[0] = shift @bits1; $data[0] = pack 'b8', @bits1; my $line2 = "55555,0,1,1,0,0,1,0,1"; my @bits2 = split ',',$line2; $patNos[1] = shift @bits2; $data[1] = pack 'b8', @bits2; print "$patNos[0] - @bits1\n"; print "$patNos[1] - @bits2\n"; my $line1 = unpack 'b8', $data[0]; my $line2 = unpack 'b8', $data[1]; my $variance = unpack '%32b*', ($data[0] ^ $data[1]); print "\nline 1: $line1\n"; print "line 2: $line2\n"; print "\nvariance: $variance\n";
        Which returns:
        44444 - 1 1 0 0 0 1 1 1 55555 - 0 1 1 0 0 1 0 1 line 1: 10000000 line 2: 00000000 variance: 1
        I would expect $line1 and $line2 to contain the original '0' and '1' strings as characters, but they just return all '0's except for the lone '1' in $line1. I've been reading about pack and unpack all day, but can't figure out what I'm doing wrong. I also realized that if I do pack 's', the variance is calculated correctly, but I would expect this to stop working once there were more than 16 characteristics.

        I love the idea of making one huge bitstring and then using substr for the comparisons. What is the best way to concatenate the bitstrings? I'm assuming then that substr can work with a bitstring the same as it would a regular string?

        Also, a few questions:

        unpack '%32b*', ( $data[ $first ] ^ $data[ $second ] );
        What is the significance of 32 here? If I'm using a 64 bit machine, should I change it to 64? I'm writing this on a 32 bit, but it will run on a 64 bit?

        Why use 5.010?

        What is the advantage of setting the length of the @patNos and @data arrays at the start?

        print "\r$.\t" unless $. % 1000;
        Is this just printing a status update every 1000 lines?

        say "\n", time;
        What is this doing?

        THANK YOU!!!

        BrowserUk, I wish I could buy you a beer or something, this has all come out perfectly! I put your subroutines into the bigger script and we had all the 400k by 400k comparisons in less than a day. Now to analyzing and plotting it... Thanks again.

        I'm not really sure about your approach. I'd really prefer to limit the data being processed as soon as possible, to save extra time later.

        From what OP writes it looks like his working model on this problem is rather an experimental one and it is not known how many iterations of parsing the same data will be needed, just to discover some patterns and regularities. In such case I would go with limiting the input as much as possible.