Re^2: Huge data file and looping best practices

Replies are listed 'Best First'.
Re^3: Huge data file and looping best practices by BrowserUk (Patriarch) on Apr 27, 2009 at 21:45 UTC
the first step is going to be to reduce the 8 million lines to what we're guessing is 400,000 unique sets of characteristics, How long wil it take you to do that reduction? Because with the additional efficiencies outlined in 760218 & 760226, and a little threading or forking, you could distribute this over your 8 processors and have the full cross product very quickly. Of course, that would be a huge amount of data to further process, but maybe you could apply some (cheap) selection criteria prior to output. Anyway, good luck! Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^4: Huge data file and looping best practices by carillonator (Novice) on Apr 28, 2009 at 17:27 UTC
ok I almost get it! (Update: I had a mistake here before) The reduction was easy in STATA, sorting all the data by characteristic columns (as if they were one big binary number), then eliminating duplicates. Down to 400,000 lines of data. Now, I've tried adapting your code and can't seem to get it to work. The XOR appears to be working, but somewhere between packing and unpacking something goes wrong. Here is a simplification I've written with two lines of data and 8 characteristics: #!/usr/bin/perl use strict; my (@patNos, @data) = 0; my $line1 = "44444,1,1,0,0,0,1,1,1"; my @bits1 = split ',',$line1; $patNos[0] = shift @bits1; $data[0] = pack 'b8', @bits1; my $line2 = "55555,0,1,1,0,0,1,0,1"; my @bits2 = split ',',$line2; $patNos[1] = shift @bits2; $data[1] = pack 'b8', @bits2; print "$patNos[0] - @bits1\n"; print "$patNos[1] - @bits2\n"; my $line1 = unpack 'b8', $data[0]; my $line2 = unpack 'b8', $data[1]; my $variance = unpack '%32b', ($data[0] ^ $data[1]); print "\nline 1: $line1\n"; print "line 2: $line2\n"; print "\nvariance: $variance\n"; [download] Which returns: `44444 - 1 1 0 0 0 1 1 1 55555 - 0 1 1 0 0 1 0 1 line 1: 10000000 line 2: 00000000 variance: 1` [download] I would expect $line1 and $line2 to contain the original '0' and '1' strings as characters, but they just return all '0's except for the lone '1' in $line1. I've been reading about pack and unpack all day, but can't figure out what I'm doing wrong. I also realized that if I do pack 's', the variance is calculated correctly, but I would expect this to stop working once there were more than 16 characteristics. I love the idea of making one huge bitstring and then using substr for the comparisons. What is the best way to concatenate the bitstrings? I'm assuming then that substr can work with a bitstring the same as it would a regular string? Also, a few questions: `unpack '%32b', ( $data[ $first ] ^ $data[ $second ] );` [download] What is the significance of 32 here? If I'm using a 64 bit machine, should I change it to 64? I'm writing this on a 32 bit, but it will run on a 64 bit? Why use 5.010? What is the advantage of setting the length of the @patNos and @data arrays at the start? `print "\r$.\t" unless $. % 1000;` [download] Is this just printing a status update every 1000 lines? `say "\n", time;` [download] What is this doing? THANK YOU!!!	[reply] [d/l] [select]
Re^5: Huge data file and looping best practices by BrowserUk (Patriarch) on Apr 28, 2009 at 19:31 UTC
One my problem, one yours. The code was untested, and I wrote as if `pack 'b', ...` & `unpack 'b', ...` worked they way I would have liked them to work, rather than the way they do. (They would have been more useful!) You need a couple of subroutines to based around vec to build the bitstrings. See below. Your mistake (masked totally by mine) was that you were reusing `@bits` before constructing the first bitstring. Try this: #!/usr/bin/perl use strict; sub toBitstring { my $bits = chr(0) x int( (($#_)/8)+1 ); vec( $bits, $_, 1 ) = $_[ $_ ] for 0 .. $#_; return $bits; } sub fromBitstring { join '', map vec( $_[0], $_, 1 ), 0 .. length( $_[ 0 ] ) * 8; } my (@patNos, @data) = 0; my $line = "44444,1,1,0,0,0,1,1,1"; my @bits = split ',',$line; $patNos[0] = shift @bits; print "$patNos[0] - @bits\n"; $data[0] = toBitstring @bits; $line = "55555,0,1,1,0,0,1,0,1"; @bits = split ',',$line; $patNos[1] = shift @bits; print "$patNos[1] - @bits\n"; $data[1] = toBitstring @bits; my $line1 = fromBitstring $data[0]; my $line2 = fromBitstring $data[1]; my $variance = unpack '%32b', ($data[0] ^ $data[1]); print "\nline 1: $line1\n"; print "line 2: $line2\n"; print "\nvariance: $variance\n"; [download] What is the significance of 32 here? If I'm using a 64 bit machine, should I change it to 64?* It is the length (in bits) of the accumulator used for the checksum calculated. You can use 8, 16, 32, 64. If the number of set bits in your bitstring exceeds the capacity of the accumulator, the result will be silently truncated to that number of bits (as with the `%8b'` example below): `[0] Perl> $x = join '', map chr( rand 256 ), 1 .. 1000;; [0] Perl> print unpack '%8b', $x;; 47 [0] Perl> print unpack '%16b', $x;; 3887 [0] Perl> print unpack '%32b', $x;; 3887 [0] Perl> print unpack '%64b', $x;; 3887` [download] If your strings are less than 0.5GB in length, 32-bits is sufficient. In the case of your 480 bits, 16 would suffice, but not if you move to eliminating the loop using the bigstring method I described. Using 64-bits won't hurt and may even be slightly quicker on a 64-bit machine. Very slightly though. Why use 5.010?* Because it enables the 5.10 special features like say, given/when and defined-OR `//`. Unnecessary if you do not use these features. What is the advantage of setting the length of the @patNos and @data arrays at the start? It pre-allocates the basic internal structures of the arrays and prevents a little memory thrash as they are populated. The benefit is insignificant for this particular application, but can help for applications where the time taken to build large arrays is a significant portion of the overall runtime. Is this just printing a status update every 1000 lines? Yes. I used it to get a feel for how long thngs would take to run, It outputs the currrent line being processed. Due to "\r" it overwrites the line number in place on the terminal rather than scrolling up the screen. say "\n", time; That prints a newline, followed by the current time in seconds. Again, just a part of the simplistic timing I did. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^4: Huge data file and looping best practices by carillonator (Novice) on Apr 30, 2009 at 18:20 UTC
BrowserUk, I wish I could buy you a beer or something, this has all come out perfectly! I put your subroutines into the bigger script and we had all the 400k by 400k comparisons in less than a day. Now to analyzing and plotting it... Thanks again.	[reply]
Re^5: Huge data file and looping best practices by BrowserUk (Patriarch) on May 01, 2009 at 02:10 UTC
YW. I put your subroutines It would really be nice to get `to/fromBitstring()` added to pack/unpack. Done in XS, they ought to be much quicker. Personally I'd happily replace the existing 'b' & 'B' templates as I've never found a good use for them--but that'd never fly for "backward" people :) So, it's a case of deciding which of the remaining template chars would be most (or even vaguely) mnemonic. They are e/E, g/G, k/K, m/M, o/O, r/R or t/T. And none of those really leap off the page at me as candidates? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re^4: Huge data file and looping best practices by przemo (Scribe) on Apr 28, 2009 at 14:23 UTC
I'm not really sure about your approach. I'd really prefer to limit the data being processed as soon as possible, to save extra time later. From what OP writes it looks like his working model on this problem is rather an experimental one and it is not known how many iterations of parsing the same data will be needed, just to discover some patterns and regularities. In such case I would go with limiting the input as much as possible.	[reply]