Hi Monks,

I need your wisdom to improve my "nested for loop". I am trying to do correlation calculation for about 50,000 categories with each category has 100 values. Here is the snippet of my code:

#!/tools/bin/perl use strict; use warnings; ## Variables and Data Structures my $count = 1; my @probesArray my %probes; my $size = 100; ## Reading the file open my $FILE, "data.txt" or die "ERROR: cannot read file $!\n"; while (my $line = <$FILE>){ chomp $line; my @line = split('\t',$line); $probes{$line[0]} = [@line[1 .. $#line]]; ## value of the hash as an a +rray $probesArray[$count] = $line[0]; ## correlation between 1-2 or 2-1 wil +l be same so using calculating only once can be done thru this array $count++; } close($FILE); ## Frankly speaking reading of the file takes less than +3 sec with 50,000 categories each having 100 values ## Correlation Calculation for(my $i = 0; $i <= $count-1; $i++){ for(my $j = $i+1; $j <= $count; $j++){ my @probe1 = @{$probes{$probesArray[$i]}}; ## Array of data my @probe2 = @{$probes{$probesArray[$j]}}; my $cor = correlation(\@probe1, \@probe2, \$size); ## correlat +ion is the subroutine $calProbes{$probesArray[$i]."-".$probesArray[$j]} = $cor; # print $count,"\t",$probesArray[$i]."-".$probesArray[$j],"\t", +$cor,"\n"; $count++; } } ## Subroutines sub mean { my ($arr1, $arr2, $size) = @_; my @arr1 = @$arr1; my @arr2 = @$arr2; my $mu_x = sum(@arr1) / $$size; my $mu_y = sum(@arr2) / $$size; return($mu_x,$mu_y); } ## Sum of Squared Deviations to the mean sub ss { my ($arr1, $arr2, $mean_x,$mean_y) = @_; my @arr1 = @$arr1; my @arr2 = @$arr2; my ($ssxx, $ssxy, $ssyy) = (0) x 3; ## looping over all the samples for(my $i = 0; $i <= scalar(@arr1)-1; $i++){ $ssxx = $ssxx + ($arr1[$i] - $mean_x)**2; $ssxy = $ssxy + ($arr1[$i] - $mean_x)*($arr2[$i] - $mean_y) ; $ssyy = $ssyy + ($arr2[$i] - $mean_y)**2; } return ($ssxx, $ssxy, $ssyy); } ## Pearson Correlation Coefficient sub correlation { my ($arr1, $arr2, $size) = @_; my ($mean_x,$mean_y) = mean($arr1, $arr2, $size); my ($ssxx, $ssxy, $ssyy) = ss($arr1, $arr2, $mean_x, $mean_y); my $cor = $ssxy/sqrt($ssxx*$ssyy); return($cor); }

Correlation Calculation is taking a lot of time. I mean calculating correlation between category1 vs all (i.e. 50000) is taking around 40 sec. Now, if that is the speed then, it will take more than 12 days to calculate the correlation coefficient for all the 50k categories. Nested for loop is taking a lot of time. Is there a way to decrease the run time or am I doing something wrong.

Any hints or advice would be highly appreciated. Thanks!


In reply to Improving the Nested For Loop by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.