comment on

Hi Monks,

I need your wisdom to improve my "nested for loop". I am trying to do correlation calculation for about 50,000 categories with each category has 100 values. Here is the snippet of my code:


 #!/tools/bin/perl

use strict;
use warnings;

## Variables and Data Structures
my $count = 1;
my @probesArray
my %probes;
my $size = 100;

## Reading the file
open my $FILE, "data.txt" or die "ERROR: cannot read file $!\n";
while (my $line = <$FILE>){
chomp $line;
my @line = split('\t',$line);
$probes{$line[0]} = [@line[1 .. $#line]]; ## value of the hash as an a
+rray
$probesArray[$count] = $line[0]; ## correlation between 1-2 or 2-1 wil
+l be same so using calculating only once can be done thru this array
$count++;
}
close($FILE); ## Frankly speaking reading of the file takes less than 
+3 sec with 50,000 categories each having 100 values

## Correlation Calculation
for(my $i = 0; $i <= $count-1; $i++){
    for(my $j = $i+1; $j <= $count; $j++){
        my @probe1 = @{$probes{$probesArray[$i]}}; ## Array of data
        my @probe2 = @{$probes{$probesArray[$j]}};
        my $cor = correlation(\@probe1, \@probe2, \$size); ## correlat
+ion is the subroutine
        $calProbes{$probesArray[$i]."-".$probesArray[$j]} = $cor;
#        print $count,"\t",$probesArray[$i]."-".$probesArray[$j],"\t",
+$cor,"\n";
        $count++; 
    }
}

## Subroutines
sub mean {
    my ($arr1, $arr2, $size) = @_;
       my @arr1 = @$arr1;
       my @arr2 = @$arr2;
       my $mu_x = sum(@arr1) / $$size;
       my $mu_y = sum(@arr2) / $$size;
       return($mu_x,$mu_y);
}
 
## Sum of Squared Deviations to the mean
sub ss {
    my ($arr1, $arr2, $mean_x,$mean_y) = @_;
       my @arr1 = @$arr1;
       my @arr2 = @$arr2;
       my ($ssxx, $ssxy, $ssyy) = (0) x 3;
 
       ## looping over all the samples      
    for(my $i = 0; $i <= scalar(@arr1)-1; $i++){
        $ssxx = $ssxx + ($arr1[$i] - $mean_x)**2;
         $ssxy = $ssxy + ($arr1[$i] - $mean_x)*($arr2[$i] - $mean_y) ;
         $ssyy = $ssyy + ($arr2[$i] - $mean_y)**2;
    }
    return ($ssxx, $ssxy, $ssyy);
}

## Pearson Correlation Coefficient
sub correlation {
    my ($arr1, $arr2, $size) = @_;
       my ($mean_x,$mean_y) = mean($arr1, $arr2, $size);
       my ($ssxx, $ssxy, $ssyy) = ss($arr1, $arr2, $mean_x, $mean_y);
       my $cor = $ssxy/sqrt($ssxx*$ssyy);
       return($cor);
}
[download]

Correlation Calculation is taking a lot of time. I mean calculating correlation between category1 vs all (i.e. 50000) is taking around 40 sec. Now, if that is the speed then, it will take more than 12 days to calculate the correlation coefficient for all the 50k categories. Nested for loop is taking a lot of time. Is there a way to decrease the run time or am I doing something wrong.

Any hints or advice would be highly appreciated. Thanks!

In reply to Improving the Nested For Loop by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.