file comparisons: - making it faster

$new_guy has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I am becoming a bit more daring in trying to understand and use perl. Now I wrote the script below, which does excatly what i want but at an alarmingly slow pace. Is there any way you think I could speed it up as well as learn, so I could do this again in the future. My script compares entities in lines between 2 files

#!/usr/bin/perl 

use Bio::Perl;
use IO::String;
use Bio::SeqIO;
use List::Util 'max';
use Text::CSV;
use Array::Utils qw(:all);

if (scalar(@ARGV) != 2) {
    
    print "\n";
    print "Usage: compare.pl <master_file> <query_file>\n" ;
    print "The master file is the annotated one\n";
    print "The query file is the non-annoted subset file\n";
    print "\n"; exit();
}

my ($file1,$file2) = @ARGV; #read in orthomcl results, gene-number and
+ taxa-number
open(INFILE, $file1);

##remove/overwrite on all similarly named output files
my $remove1 = "compared_annotations.txt";
if (unlink($remove1) == 1) {
                        print "Existing \"$remove1\" file was removed\
+n";
                          }

#Create an output file
my $outputfile = "compared_annotations.txt";
           if (! open(POS, ">>$outputfile") ) {
            print "Cannot open file \"$outputfile\" to write to!!\n\n"
+;
                exit;
        } 

# For each line in the input file (i.e. each ortholog group)...
while (my $line = <INFILE>) {
    chomp;

# First, get the cluster number.
    our ($cluster, $other) = split(/\s/, $line, 2);
    
   #print "$cluster : ************************************************
+*********************************************************************
+**************\n";
   #print "$other\n";
   
#declare variables   
    my $a;
    my $b;
    my $c;
    my $d;
    my $e;
    my $f;
    my $g;
    my $i;
    our @a = (); #make @a public
    my @c = ();
        
##remove white spaces in the data! (very annoying)
    $a = $other; 
    $a =~ s/[\t ]+/ /g;
    $b = $a;
    $b =~ s/^ //mg;
    $c = $b;
    $c =~ s/ $//mg;
    
    chomp $c; # remove trailing white space in $c ('ugly stuff')

#now break the entries into pieces and store them in an array!    
    $d = $c;
    
#remove comma's

    $d =~ s/,3/  3/g;
    $d =~ s/\),/)  /g;
     
    #@a = split(/(\))\s/, $d);
    #@a = split(/\)\s([^\)])/, $d);
    #@a = split(/(\))/, $d);

    $d =~ s/\)\s/)>/g;
    
    $e = $d;
    
    $e =~ s/\(unknown\)//g; #remove the unknown brackets from the draf
+t genomes 
    
    $f = $e;

    #push(@a, $f);
    
    @a = split(/>/, $f); # split the second half of the genome info in
+ a line into an array 
    
    #print "@a***\n\n\n";
    
    #foreach $g(@a){
        #print "$g~~\n";   #check out the split bits
    #print "$g\n";
    #}
    
       my $non_ommitted_pattern =~ /(\W+).+/;
    foreach (@a) {our @match = grep {$_ == $non_ommitted_pattern} @a;
    #print "@match\n\n\n\n";
    }
    
    my ($w1, @w1) = read_query_file($file2);  #pretty important - ie a
+llows printing in the sub

}
    
###################################################
sub read_query_file
### Opens and reads file data into an array ### 
    {
    my ($filename) = @_;    
    unless (open(FILEDATA, $filename))
        {print "\nCannot open file \"$filename\".\n";
        exit;
        }
    while (my $line2 = <FILEDATA>){
    chomp;
    # First, get the subset cluster number.
    our ($cluster1, $other1) = split(/\s/, $line2, 2);
    
    chomp $other1;
    
    #print "$cluster1  ###############################################
+#####################################################################
+#####\n";
    #print "$other1\n";
    
   my @A2 = split(/\s/, $other1);
    chomp @A2;
   
   #foreach my $A2(@A2){
    #print "$A2\n";
   #}
           
#now do the matching
        
    foreach $i(@a){
    foreach my $i2(@A2){
        if($i =~ m/$i2/){
        #print POS "$cluster $cluster1 ## $i  $i2 @match\n\n";
        print POS "$cluster,, $cluster1,, @match,,\n";
        next;
        }
        
        else{next;}
        }
    }
    }
    
           next;
        #return;
     #   }
    }
###################################################        
1;
[download]

Comment on file comparisons: - making it faster Download Code

Replies are listed 'Best First'.
Re: file comparisons: - making it faster by moritz (Cardinal) on Jun 20, 2011 at 14:05 UTC
Please read perlperf for general advice for tuning your code. But first you should focus on making your code correct and readable. For example `my $non_ommitted_pattern =~ /(\W+).+/; foreach (@a) {our @match = grep {$_ == $non_ommitted_pattern} @a;` [download] is almost certainly not what you want. `$non_ommitted_pattern` contains the result of matching $_ against the regex, and `==` does numeric coparison between the result of that regex match and the current item. If there is code that doesn't do what you want, and the program as a whole still works correctly, you might even be able to simply remove the non-functional code. So there might be a performance gain in making it correct. Concerning readability, a consistent indention style would help a lot, as well as using a single variable for a single thing (and not $a, $b, $c ... for the same thing). Perl 6 - second systems done right	[reply] [d/l] [select]
Re: file comparisons: - making it faster by roboticus (Chancellor) on Jun 20, 2011 at 14:59 UTC
$new_guy: A couple observations: You're reading a file inside a loop, so you can potentially read and discard it a large number of times. Perhaps it would be better to store the data and simply reuse it? You've got some nested loops: the number of times you execute the code in the innermost loop is typically the product of the number of iterations of each loop. So you need to: Remove nested loops when possible, and when it's not possible, move operations out of the innermost loop. Of course, this is assuming that you've run the appropriate profiling tools and determined that the nested loops are your problem. (Don't optimize what isn't a problem.) ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply]
Re: file comparisons: - making it faster by rev_1318 (Chaplain) on Jun 20, 2011 at 14:42 UTC
Some more general pointers: Start your code with: `use strict; use warnings;` [download] Code like `if (unlink($remove1) == 1) { ...` is better written as `if ( -e $remove ) { unlink $remove or die "unable to delete $remoe: $!\n"; }` [download] Use lexical file handles instead of global ones The call `my ($w1, @w1) = read_query_file($file2);` makes no sense; the sub is not returning anything, only the side effects inside the sub are used! The use of `$a, $b, $c` is not needed. Use describtive variable names and don't use you many... Paul	[reply] [d/l] [select]
Re: file comparisons: - making it faster by Marshall (Canon) on Jun 20, 2011 at 15:14 UTC
A few more suggestions for you... `This code.... # First, get the subset cluster number. our ($cluster1, $other1) = split(/\s/, $line2, 2); chomp $other1; my @A2 = split(/\s/, $other1); chomp @A2; is just this... my ($cluster1, @A2) = split(/\s+/,$line2); This line does not do what you think it does... chomp $c; # remove trailing white space in $c ('ugly stuff')` [download] Probably everywhere you split on \s, \s+ is what you need. This removes any sequence of space characters. Space characters are: the space,\t\f\r\n. So when you split on white spaces, there is no need to chomp. BTW, chomp does not remove trailing white space, only the trailing \n if present. There will not be any trailing \n or any other white space if you split on \s+. There is no need to do the split in two phases. In my line above, $other gets the first token and @A2 get all of the rest. An "our" variable is used to give a variable package scope and also causes it to be put in the symbol table for export. If you really mean to share a variable between two functions, declare a "my" variable at a higher level that is common to both. There is no need in your code for any "our" variables. It would help if you give better names to the variables. You are very close to having this work under "use strict; use warnings;". Go ahead and include those directives. Update: re: performance...I don't know enough about your application (like what input lines look like) and what the definition of "equivalent files" means to say anything other than some generic advice. Code like the below, will cause a*A2 comparisons. If these 2 arrays are big, then that's a lot of compares. Use the profiler to see if this is a bottleneck or not. A hash based approach might work out well. Do a search for "compare arrays". But don't spend time on it if this is not where the problem lies. `foreach $i(@a){ foreach my $i2(@A2){ if($i =~ m/$i2/){` [download]	[reply] [d/l] [select]