The script compares each binary characteristic of each patent with every other patent and counts the number of differences for each pair. My attempt at the improved code is below.patent #, char1, char2, char3, ... , char480 1234567,1,0,1,0,1,0, ... (480 characteristics) (x 8 million lines)
I see that the entire 6gb data file is brought into memory, so I'm looking for the best way to go one line at a time. The program will be run on an 8-core machine with 64G memory. Notice it takes arguments that limit execution to a certain range of iterations of the first loop, so I can run 7 different instances at the same time (one per core) on different parts of the data. Or, is there a smarter way to allocate resources? O'Reilly's Perl Best Practices book says to use while instead of for loops when processing files, but I would like to keep the ability to limit iterations with command line arguments.
Since it will take a VERY long time to run all of this program, the slightest improvements could save days or weeks. Any input on making this script as smart and efficient as possible would be greatly appreciated. Thanks in advance!!
#!/usr/bin/perl use strict; my(@patno1,@patno2,@record1,@record2); my $startat=@ARGV[0]; my $endat=@ARGV[1]; open(OUT, "<patents.csv")|| die("Could not open patents.csv file!\n"); + my @lines=<OUT>; close(OUT); #clear variance file if it exists open(OUT, ">variance.csv")|| die("Could not open file variance.csv!\n" +); close(OUT); map(chomp,@lines); # iterate over all patents for(my $i=$startat;$i<=$endat;$i++) { @record1=split(/\,/,$lines[$i]); $patno1=shift(@record1); # iterate through other lines to compare for(my $j=$i+1;$j<$#lines;$j++) { @record2=split(/\,/,$lines[$j]); $patno2=shift @record2; my $variance=0; # iterate through each characteristic for(my $k=0;$k<$#record1;$k++) { if($record1[$k]!=$record2[$k]) { $variance++; } } open(OUT, ">>variance.csv")|| die("Could not open file va +riance.csv!\n"); print OUT $patno1.",".$patno2.",".$variance."\n"; close(OUT); } }
In reply to Huge data file and looping best practices by carillonator
For: | Use: | ||
& | & | ||
< | < | ||
> | > | ||
[ | [ | ||
] | ] |