Re: Code Optimization

Cutting your code down to a skeleton where I think your two biggest problems are:

if(!($permute)){
    open(SEQ,'<',$sequence_fname)||die "Cannot open sequence file\n";
    while (my @line=split (/\t/,<SEQ>)){
        ...
    }   
    ...
{   
else{
    for (...) {
        ...
        open(SEQ,'<',$sequence_fname)||die "Cannot open sequence file\
+n";#open the sequence file
        while (my @line=split (/\t/,<SEQ>)){#for each sequence record 
+the nmer information
            ...
        }
        close SEQ;
        open(OUT,">>$out")||die "Cannot open $out\n";
        foreach my $key(keys %ktc){
            ...
            print OUT ...
            ...
        }
        close OUT;
    }
}
[download]

You open the SEQ file, read the data from disk, and parse it multiple times: you only need to do this once.

Also, you open and close the OUT file for appending multiple times: you only need to do this once.

Without making changes to your coding style, I suspect this, which does only open those files once, would be substantially faster:

my @seq_data;
open(SEQ,'<',$sequence_fname)||die "Cannot open sequence file\n";
while (<SEQ>) {
    push @seq_data, [split /\t/];
}
close SEQ;

if(!($permute)){
    for (@seq_data) {
        my @line = @$_;
        ...
    }
    ...
{
else{
    open(OUT,">>$out")||die "Cannot open $out\n";

    for (...) {
        ...
        for (@seq_data) {
            my @line = @$_;
            ...
        }

        foreach my $key(keys %ktc){
            ...
            print OUT ...
            ...
        }
    }

    close OUT;
}
[download]

There may be other areas where substantial gains could be made, but I haven't looked beyond the two I/O ones at this point.

-- Ken

Comment on Re: Code Optimization Select or Download Code

Replies are listed 'Best First'.
Re^2: Code Optimization by azheid (Sexton) on Sep 10, 2013 at 09:53 UTC
Originally, I had the program open the file in the manner that you have coded (Once only) however the amount of data in question is massive. This caused the program to use up all the memory of the server I am using, and the OS would randomly kill the program in favor of OS processes. I, perhaps incorrectly, favored the line by line approach because on runtime, the program only uses 0.3% of the RAM, instead of 100%. Perhaps it might be worthwhile to break up the files into chunks and read in maybe 100,000 or so lines at a time?	[reply]
Re^3: Code Optimization by kcott (Archbishop) on Sep 10, 2013 at 10:28 UTC
In that case, I'd suggest using Tie::File: `use Tie::File; ... tie my @seq_data, 'Tie::File', $sequence_fname or die "Can't open $sequence_fname: $!"; if(!($permute)){ for (@seq_data) { my @line = split /\t/; ... } ... { else{ open(OUT,">>$out")\|\|die "Cannot open $out\n"; for (...) { ... for (@seq_data) { my @line = split /\t/; ... } foreach my $key(keys %ktc){ ... print OUT ... ... } } close OUT; } untie @seq_data;` [download] That'll be a little slower because you'll be repeating the `split /\t/`, but at least you won't have memory issues. Also, note the change I made to die. This is not for optimising the efficiency of your code; it's to improve feedback if things go wrong. When you terminate the `die` message with a newline, you prevent file and line information from being output. Also, "`$!`" provides addition information about why `open` failed, see "perlvar: Error Variables". This is a good practice to get into the habit of doing; alternatively, consider using autodie. -- Ken	[reply] [d/l] [select]
Re^4: Code Optimization by azheid (Sexton) on Sep 12, 2013 at 19:05 UTC
Thanks all, after implementing Ken's Tie::File suggestion, the code runs much much faster. I learned something and I appreciate everyone's comments	[reply]