comment on

Hello! I'm pretty new to Perl, but have experience with PHP. I have been asked to improve a Perl script written by someone else, which analyzes a set of data about patents. The data file has 8 million lines, which look like this:

    
patent #, char1, char2, char3, ... , char480 
 
1234567,1,0,1,0,1,0, ... (480 characteristics) 
      (x 8 million lines)
[download]

The script compares each binary characteristic of each patent with every other patent and counts the number of differences for each pair. My attempt at the improved code is below.

I see that the entire 6gb data file is brought into memory, so I'm looking for the best way to go one line at a time. The program will be run on an 8-core machine with 64G memory. Notice it takes arguments that limit execution to a certain range of iterations of the first loop, so I can run 7 different instances at the same time (one per core) on different parts of the data. Or, is there a smarter way to allocate resources? O'Reilly's Perl Best Practices book says to use while instead of for loops when processing files, but I would like to keep the ability to limit iterations with command line arguments.

Since it will take a VERY long time to run all of this program, the slightest improvements could save days or weeks. Any input on making this script as smart and efficient as possible would be greatly appreciated. Thanks in advance!!

    

#!/usr/bin/perl 
use strict; 
 
my(@patno1,@patno2,@record1,@record2); 
 
my $startat=@ARGV[0]; 
my $endat=@ARGV[1]; 
 
open(OUT, "<patents.csv")|| die("Could not open patents.csv file!\n");
+ 
my @lines=<OUT>; 
close(OUT); 
 
#clear variance file if it exists 
open(OUT, ">variance.csv")|| die("Could not open file variance.csv!\n"
+); 
close(OUT); 
 
map(chomp,@lines); 
 
# iterate over all patents 
for(my $i=$startat;$i<=$endat;$i++) 
{ 
    @record1=split(/\,/,$lines[$i]); 
    $patno1=shift(@record1);  
     
      # iterate through other lines to compare 
    for(my $j=$i+1;$j<$#lines;$j++) 
    { 
        @record2=split(/\,/,$lines[$j]); 
        $patno2=shift @record2; 
         
              my $variance=0; 
         
        # iterate through each characteristic 
        for(my $k=0;$k<$#record1;$k++) 
        { 
            if($record1[$k]!=$record2[$k]) 
            { 
                $variance++;                     
            } 
        } 
         
             open(OUT, ">>variance.csv")|| die("Could not open file va
+riance.csv!\n"); 
        print OUT $patno1.",".$patno2.",".$variance."\n"; 
        close(OUT);         
             
         
    } 
}
[download]

In reply to Huge data file and looping best practices by carillonator

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.