replicant4 has asked for the wisdom of the Perl Monks concerning the following question:

Mighty perl monks, I have the following script, (uses a few bioperl modules, but they are not important to my question.
#A perl script that takes in as arguments, the names of two alignments + files in msf format, #compares the sites in all of them, and calculates mutual information +between sites. #The files have to be in msf format. the calculations are done under t +he assumption that each site #evolved independently to the other sites in its alignmnet. # #Author : Kostantinos Gkogkoglou #e-mail: kg66@le.ac.uk #!/usr/bin/perl -w use warnings; use strict; use Mutual; #load the necessary packages #use Switch; #use Math::ElFun; #use POSIX qw(log10); use Bio::AlignIO; use Bio::Align::AlignI; use Bio::SeqIO; use Bio::Seq; use Mutual; use List::Util; #initializing the necessary variables. my $inputfilename=''; #inputfile for 1st sequence my $inputfilename2=''; my $outputfilename=''; #outputfile for both sequences my $TIM ; #the TIM alignment my $PER ; my $aln=''; #the first alignment my $aln2=''; my $out=''; #the output my $out2=''; my @seq = (); #the array that stores all the sequ +ence objects from the first alignment my @seq2 = (); my $site; #positions in the alignments #creating inputfiles for the user to type the alignment filenames and +the name of the outputfile. $inputfilename = $ARGV[0]; $inputfilename2 = $ARGV[1]; # Using the new file method from the Bio::AlignIO to load the sequence +s from the allignment file. $TIM = Bio::AlignIO->new(-file => $inputfilename ); $PER = Bio::AlignIO->new(-file => $inputfilename2); #Creating an AlignI object from the ALignIO input. Now we can manipula +te the sequence in many possible ways $aln= $TIM->next_aln; $aln2 =$PER->next_aln; #Sorting the sequences alphabetically #$aln->sort_alphabetically; $aln2->sort_alphabetically; #each_seq method. Gets an array of sequences for each sequence of the +array. @seq = $aln->each_seq(); @seq2 = $aln2->each_seq(); #calculating the length of the alignment(length of sequences). It will + be used in order to calculate the frequencies later on. my $no_of_sequences= $aln->no_sequences; #will be used to calculate th +e frequencies #my $no_of_sequences2 = $aln2->no_sequences; my $length1 = $aln->length; my $length2 = $aln2->length; my $no_of_residues = $aln->no_residues; my $no_of_residues2 = $aln2->no_residues; #the following method displays the sequence as a string . Just doing t +hat in order to see what's inside my array. my $a; for ($a=1; $a<=1000; $a++){ print "CURRENT ROUND IS :$a\n"; my @shuffled = fisher_yates_shuffle(\@seq); #my @shuffled = fisher_yates_shuffle(\@seq); foreach my $seq(@seq){ my $id = $seq->display_id; print "$id\n"; } #accecsing all the individual aminoacids of the 1st alignment sequence +. Print all aminoacids for each site of the alignment #my $seq; my $j; my @site1; for ($j=1;$j<= $length1;$j++){ my $site; foreach my $seq(@seq){ my $res = $seq->subseq($j,$j); $site .=$res; } push (@site1, $site); } #doing the same for the second alignment file. my $i; my @site2; for ($i=1;$i<= $length2;$i++){ my $site; foreach my $seq(@seq2){ my $res = $seq->subseq($i,$i); $site .=$res; } push (@site2, $site); } my $site1; my $site2; my $l=0; foreach $site1 (@site1) { my @site1_parts = split(//, $site1); $l++; my $j=0; my $mutual; my $Hx= get_Hx($site1); #print "$Hx\n"; foreach $site2 (@site2) {#begining of loop2 for each site 2 $j++; my $Hy = get_Hx($site2); my $Hxy; my %counts; my @count_collection; my @site2_parts = split(//, $site2); my $i = @site1_parts; $counts{get_index($site1_parts[$i], $site2_parts[$i])}++ while + ($i--); push(@count_collection, \%counts); foreach my $counts (@count_collection) { my $total; foreach (sort keys %$counts) { #print "$counts{$_}\n"; my $freq = $counts->{$_}/length($site1); my $MI = -$freq * log_2$freq; #$total +=$freq; $Hxy +=$MI; } #print "$Hxy \ $Hx\ $Hy \n"; $mutual = $Hx + $Hy - $Hxy; } print " $l,$j,$mutual\n"; #print " $l,$j\n"; #print($_, ': ', $counts{$_}, $/) foreach (sort keys %counts); #print($/); }#end of loop2 for each site2 } }#end of the procedure of shuffling exit; # #

Ignore the first 50 lines, where the bioperl is used. At the end what I do is that I take a sequence of letters and I calculate the value of mutual information($mutual) which is then printed on screen. It does that 1000 times, that is it runs 1000 loops. What I need is that for exaple this script generate 482*482 values per loop, and I only want to keep the highest value per loop, and store it into an array, which will then be printed. That is that in the end I should have 1000 values into my array. How can I select only the highest value $mutual per loop. Also find below the module used in the code.
package Mutual; use strict; require Exporter; our @ISA = qw(Exporter); our @EXPORT = qw(log_2 get_index get_Hx fisher_yates_shuffle); # fisher_yates_shuffle( \@array ) : # generate a random permutation of @array in place sub fisher_yates_shuffle { my $array = shift; my $i; for ($i = @$array; --$i; ) { my $j = int rand ($i+1); next if $i == $j; @$array[$i,$j] = @$array[$j,$i]; } } # fisher_yates_shuffle( \@array ); # permutes @array in place sub log_2 { my $base = 2; my $value = $_[0]; return log($value)/log($base); } sub get_index{ $_[0] lt $_[1] ? ($_[0].$_[1]) : ($_[1].$_[0]) } sub get_Hx{ my $site = $_[0]; my $no_of_sequences =length($site); my $count_A=0 ; my $count_C=0 ; my $count_D=0 ; my $count_E=0 ; my $count_F=0 ; my $count_G=0 ; my $count_H=0 ; my $count_I=0 ; my $count_J=0 ; my $count_K=0 ; my $count_L=0 ; my $count_M=0 ; my $count_N=0 ; my $count_P=0 ; my $count_Q=0 ; my $count_R=0 ; my $count_S=0 ; my $count_T=0 ; my $count_V=0 ; my $count_W=0 ; my $count_Y=0 ; my $count_ =0 ; my $sum1=0; my $sum2=0; my $sum3=0; my $sum4=0; my $sum5=0; my $sum6=0; my $sum7=0; my $sum8=0; my $sum9=0; my $sum10=0; my $sum11=0; my $sum12=0; my $sum13=0; my $sum14=0; my $sum15=0; my $sum16=0; my $sum17=0; my $sum18=0; my $sum19=0; my $sum20=0; my $sum21=0; while($site =~m/A/g){ $count_A++; } while($site =~m/C/g){ $count_C++; } while($site =~m/D/g){ $count_D++; } while($site =~m/E/g){ $count_E++; } while($site =~m/F/g){ $count_F++; } while($site =~m/G/g){ $count_G++; } while($site =~m/H/g){ $count_H++; } while($site =~m/I/g){ $count_I++; } while($site =~m/K/g){ $count_K++; } while($site =~m/L/g){ $count_L++; } while($site =~m/M/g){ $count_M++; } while($site =~m/N/g){ $count_N++; } while($site =~m/P/g){ $count_P++; } while($site =~m/Q/g){ $count_Q++; } while($site =~m/R/g){ $count_R++; } while($site =~m/S/g){ $count_S++; } while($site =~m/T/g){ $count_T++; } while($site =~m/V/g){ $count_V++; } while($site =~m/W/g){ $count_W++; } while($site =~m/Y/g){ $count_Y++; } while($site =~m/\./g){ $count_++; } while($site =~m/\_/g){ $count_++; } #if there is a counter with +ve value in any position of the array, th +en calculate the mutual information #value and store it there. if ($count_A > 0){ $sum1= -($count_A/$no_of_sequences)*log_2($count_A/$no_of_sequenc +es); } if ($count_C > 0){ $sum2= -($count_C/$no_of_sequences)*log_2($count_C/$no_of_sequenc +es); } if ($count_D > 0){ $sum3= -($count_D/$no_of_sequences)*log_2($count_D/$no_of_sequenc +es); } if ($count_E > 0){ $sum4= -($count_E/$no_of_sequences)*log_2($count_E/$no_of_sequenc +es); } if ($count_F > 0){ $sum5= -($count_F/$no_of_sequences)*log_2($count_F/$no_of_sequenc +es); } if ($count_G > 0){ $sum6= -($count_G/$no_of_sequences)*log_2($count_G/$no_of_sequenc +es); } if ($count_H > 0){ $sum7= -($count_H/$no_of_sequences)*log_2($count_H/$no_of_sequenc +es) ; } if ($count_I > 0){ $sum8= -($count_I/$no_of_sequences)*log_2($count_I/$no_of_sequenc +es) ; } if ($count_J > 0){ $sum9= -($count_J/$no_of_sequences)*log_2($count_J/$no_of_sequenc +es); } if ($count_K > 0){ $sum10= -($count_K/$no_of_sequences)*log_2($count_K/$no_of_sequen +ces); } if ($count_L > 0){ $sum11=-($count_L/$no_of_sequences)*log_2($count_L/$no_of_sequenc +es); } if ($count_M > 0){ $sum12=-($count_M/$no_of_sequences)*log_2($count_M/$no_of_sequenc +es); } if ($count_N > 0){ $sum13= -($count_N/$no_of_sequences)*log_2($count_N/$no_of_sequen +ces); } if ($count_P > 0){ $sum14= -($count_P/$no_of_sequences)*log_2($count_P/$no_of_sequen +ces); } if ($count_R > 0){ $sum15= -($count_R/$no_of_sequences)*log_2($count_R/$no_of_sequen +ces); } if ($count_S > 0){ $sum16= -($count_S/$no_of_sequences)*log_2($count_S/$no_of_sequen +ces); } if ($count_T > 0){ $sum17= -($count_T/$no_of_sequences)*log_2($count_T/$no_of_sequen +ces); } if ($count_V > 0){ $sum18= -($count_V/$no_of_sequences)*log_2($count_V/$no_of_sequen +ces); } if ($count_W > 0){ $sum19= -($count_W/$no_of_sequences)*log_2($count_W/$no_of_sequen +ces); } if ($count_Y > 0){ $sum20= -($count_Y/$no_of_sequences)*log_2($count_Y/$no_of_sequen +ces); } if ($count_ > 0){ $sum21= -($count_/$no_of_sequences)*log_2($count_/$no_of_sequence +s); } my $Hx = $sum1+$sum2+$sum3+$sum4+$sum5+$sum6+$sum7+$sum8+$sum9+$sum10 ++$sum11+$sum12+$sum13+$sum14 +$sum15+$sum16+$sum17+$sum18+$sum19+$sum20+$sum21; return $Hx; }#end of foreach 1;

Thanks in advance for anyone's help.

Janitored by Arunbear - added readmore tags, as per Monastery guidelines

Replies are listed 'Best First'.
Re: I need to sort out some values and get only the highest generated by my perl script
by PodMaster (Abbot) on Oct 29, 2004 at 09:05 UTC
    `perldoc perldata'

    Why it's stupid to `use a variable as a variable name' part 1 2 3

    `perldoc -q count'

    my %count; for my $letter ( 'A' .. 'Z', '_', '.' ){ $count{$letter} = () = $site =~ m/\Q$letter/g; }
    One you clean up your program to eliminate all that repetitive scripting, it'll be easier to answer your own question.

    update: noticed '_' and '.' were in the mix and updated example accordingly (added \Q).

    MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
    I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
    ** The third rule of perl club is a statement of fact: pod is sexy.

      Or perhaps just:

      my %count; map { ++$count{$_} } ( $site =~ /([A-WY._])/g );

      Minor note: there is no X or Z.

      Regards,

      PN5

Re: I need to sort out some values and get only the highest generated by my perl script
by reneeb (Chaplain) on Oct 29, 2004 at 09:08 UTC
    you can short your sub a lot:
    sub get_Hx{ my $site = $_[0]; my $no_of_sequences =length($site); my $sum = 0; foreach(qw/A C D E F G H I J K L M N P Q R S T V W Y _ \./){ my $count = () = $site =~ /$_/g; $sum += -($count/$no_of_sequences)*log_2($count/$no_of_sequences) +if($count > 0); } return $sum; }
Re: I need to sort out some values and get only the highest generated by my perl script
by Eimi Metamorphoumai (Deacon) on Oct 29, 2004 at 15:04 UTC
    Without really digging that deep into your code, the task of finding the largest value is usually best handled by (pseudocode):
    my ($largest_index, $largest_value) = (-1E6, undef); #make sure $largest_index starts below the lowest possible #value you can get for my $data (@some_big_long_list){ my $this_index = index_from($data); if ($this_index > $largest_index){ $largest_index = $this_index; $largest_value = $data; } } #do something with largest data
    That is, just keep track of what the largest you've seen so far is as you're going through them, and when you reach the end, it's guaranteed to be the largest overall.
      If all he is doing is printing the largest value, would not the following suffice?

      program.pl|sort -n|tail -1

      No extra coding needed, and it uses programs that already exist on the system.

        If that's what's happening, yes. I read his message as "I have a loop that goes 1000 times, and each time it generates a bunch of data. I want the largest value produced from each run of the loop, ie, 1000 largest values," which wouldn't work from your model. Also, it's almost certain to be less efficient (O(n log n) instead of O(n) to do the sort, then throw away all the extra sort data that was just computed).