kulls has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
I have computed one equation based on the given string.
Can anyone please review my code.
#!c:\perl\bin\perl use strict; use warnings; use Math::BigInt; use Data::Dumper; my %index_val = ( A=> 2, C => 3, D => 4, N => 2, M => 2, T => 1, G => 4); my $sequence="ACTGACN"; my $distance_index= int(rand(4)); my $window=int(rand(3)); my $arr_ref = &alpha_index({ sequence => $sequence, distance => $distance_index, aa_index => \%index_val, window_size => $window }); print STDERR Dumper($arr_ref); sub alpha_index { my $self=shift; my $sequence=$self->{sequence}; my $distance=$self->{distance}; my $aa_index=$self->{aa_index}; my ($left,$middle,$right,@carbon_index); my $window_size= $self->{window_size}; if($self->{window_size} ==1) { $window_size=1;} elsif($self->{window_size} > 1 ) { $window_size = int(($self->{window_size}-1)/2); } else { $window_size=0; } foreach (0.. (length($sequence)-1)) { my %carbon_aa=(); ($left,$middle,$right) = $sequence =~ m/(\w{$_})?(\w{1})?(\w*)/; $carbon_aa{$middle}=$aa_index->{$middle} if($middle); $left=reverse($left); my ($left_aa,$right_aa,$distance_index,$power)=0; my $initial_index=2; my $length_index= ($window_size) ? $window_size:length($left); if($self->{window_size} ==1 ) { push @carbon_index, \%carbon_aa if(%carbon_aa); next; } foreach my $increment (0..$length_index-1) { $left_aa=substr($left,$increment,1); next if not exists $aa_index->{$left_aa}; $power=Math::BigInt->new($initial_index); $distance_index=($power->bpow($distance))->numify(); $carbon_aa{$middle}+= ( $aa_index->{$left_aa} * (1/$distance_index)); $initial_index++; } $initial_index=2; $length_index= ($window_size)?$window_size:length($right); foreach my $increment (0..$length_index-1) { $right_aa=substr($right,$increment,1); next if not exists $aa_index->{$right_aa}; $power=Math::BigInt->new($initial_index); $distance_index=($power->bpow($distance))->numify(); $carbon_aa{$middle}+= ( $aa_index->{$right_aa} * ( 1/$distance_index)); $initial_index++; } push @carbon_index, \%carbon_aa if(%carbon_aa); } return \@carbon_index; }

Thanks in advance.
- kulls

Replies are listed 'Best First'.
Re: Equation - code review
by GrandFather (Saint) on Aug 16, 2006 at 03:10 UTC

    Does it work against your test suite?

    Update

    Why &alpha_index({...?

    Why my $self=... when alpha_index doesn't seem to be a method?

    Why my ($left,$middle,$right when they are used locally to the foreach (@carbon_index is correctly declared)

    Why the assignment in my %carbon_aa=(); when it is empty to start with anyway?

    Why my ($left_aa,$right_aa,$distance_index,$power)=0; when only $left_aa gets a value?

    Why are $left_aa,$right_aa,$distance_index,$power not declared local to their foreach loops?


    DWIM is Perl's answer to Gödel
      Hi,
      Thanks for your valuable comments.
      For your first two question,
      >>
      I just copied the method from my module and created a dummy arguments around it. I can't paste the entire module, since dependency is more and the security. I apologize for that.

      Here I have Updated my code based on your comments, please review it
      #!c:\perl\bin\perl use strict; use warnings; use Math::BigInt; use Data::Dumper; my %index_val = ( A=> 2, C => 3, D => 4, N => 2, M => 2, T => 1, G => 4); my $sequence="ACTGACN"; my $distance_index= int(rand(4)); my $window=int(rand(3)); my $arr_ref = alpha_index({ sequence => $sequence, distance => $distance_index, aa_index => \%index_val, window_size => $window }); print STDERR Dumper($arr_ref); sub alpha_index { my $self=shift; my $sequence=$self->{sequence}; my $distance=$self->{distance}; my $aa_index=$self->{aa_index}; my @carbon_index; my $window_size= $self->{window_size}; if($self->{window_size} ==1) { $window_size=1;} elsif($self->{window_size} > 1 ) { $window_size = int(($self->{window_size}-1)/2); } else { $window_size=0; } foreach (0.. (length($sequence)-1)) { my %carbon_aa; my ($left,$middle,$right) = $sequence =~ m/(\w{$_})?(\w{1} +)?(\w*)/; $carbon_aa{$middle}=$aa_index->{$middle} if($middle); $left=reverse($left); my $initial_index=2; my $length_index= ($window_size) ? $window_size:length($left); if($self->{window_size} ==1 ) { push @carbon_index, \%carbon_aa if(%carbon_aa); next; } foreach my $increment (0..$length_index-1) { my $left_aa=substr($left,$increment,1); next if not exists $aa_index->{$left_aa}; my $power=Math::BigInt->new($initial_index); my $distance_index=($power->bpow($distance))->numify() +; $carbon_aa{$middle}+= ( $aa_index->{$left_aa} * (1/$distance_index)); $initial_index++; } $initial_index=2; $length_index= ($window_size)?$window_size:length($right); foreach my $increment (0..$length_index-1) { my $right_aa=substr($right,$increment,1); next if not exists $aa_index->{$right_aa}; my $power=Math::BigInt->new($initial_index); my $distance_index=($power->bpow($distance))->numify() +; $carbon_aa{$middle}+= ( $aa_index->{$right_aa} * ( 1/$distance_index)); $initial_index++; } push @carbon_index, \%carbon_aa if(%carbon_aa); } return \@carbon_index; }

      Thanks in advance.
      - kulls
        Before you start looking for areas that you can optimize you should look for areas that should be optimized. You can do this using Benchmark and splitting the function into working pieces, and then testing those pieces.

        This line strikes me as possibly inefficient because it creates a new regex for each iteration, which means 1000 regex for a 1000 character sequence.

        ($left,$middle,$right) = $sequence =~ m/(\w{$_})?(\w{1})?(\w*)/;
        So I wrote a benchmark to try a few strategies: Results for running 500 iterations with length 7:
                  Rate re_orig  re_set bsubstr
        re_orig  681/s      --    -12%    -92%
        re_set   774/s     14%      --    -91%
        bsubstr 8475/s   1144%    995%      --
        
        Results for running 500 iterations with length 100:
                  Rate re_orig  re_set bsubstr
        re_orig 48.1/s      --    -12%    -93%
        re_set  54.9/s     14%      --    -92%
        bsubstr  718/s   1392%   1209%      --
        
        Results for running 100 iterations with length 1000:
                  Rate  re_set bsubstr
        re_set  4.00/s      --    -93%
        bsubstr 59.5/s   1387%      --
        
        

        With a little tidying (to my eye anyway) and applying suggestions from imp and BrowserUK the following code gives the same answers as the original for the small range of values I've tested and ought to be somewhat faster.

        #!c:\perl\bin\perl use strict; use warnings; my %index_val = ( A=> 2, C => 3, D => 4, N => 2, M => 2, T => 1, G => 4); my $sequence="ACTGACN"; my $distance_index = 2; my $window = 2; my $arr_ref = &alpha_index({ sequence => $sequence, distance => $distance_index, aa_index => \%index_val, window_size => $window }); for my $hashRef (@$arr_ref) { print join "\n", map {"$_ => $hashRef->{$_}"} keys %$hashRef; print "\n"; } sub alpha_index { my $self=shift; my $sequence=$self->{sequence}; my $distance=$self->{distance}; my $aa_index=$self->{aa_index}; my @carbon_index; my $window_size = int $self->{window_size}; $window_size = int ($window_size - 1) / 2 if $window_size > 1; foreach my $len (0 .. (length($sequence)-1)) { my %carbon_aa; my ($left, $middle, $right) = splitStr ($sequence, $len); $carbon_aa{$middle} = $aa_index->{$middle} if $middle; $left = reverse $left; my $initial_index = 2; my $length_index= $window_size ? $window_size : length $left; if($self->{window_size} == 1 ) { push @carbon_index, \%carbon_aa if %carbon_aa; next; } foreach my $increment (0 .. $length_index-1) { my $left_aa=substr($left,$increment,1); next if not exists $aa_index->{$left_aa}; my $distance_index= $initial_index ** $distance; $carbon_aa{$middle} += $aa_index->{$left_aa} * (1/$distanc +e_index); $initial_index++; } $initial_index=2; $length_index= $window_size ? $window_size : length $right; foreach my $increment (0 .. $length_index-1) { my $right_aa=substr ($right, $increment, 1); next if not exists $aa_index->{$right_aa}; my $distance_index= $initial_index ** $distance; $carbon_aa{$middle} += $aa_index->{$right_aa} * (1 / $dist +ance_index); $initial_index++; } push @carbon_index, \%carbon_aa if %carbon_aa; } return \@carbon_index; } sub splitStr { return ( substr($_[0], 0, $_[1]), substr($_[0], $_[1], 1), substr($_[0], $_[1] + 1) ); }

        DWIM is Perl's answer to Gödel
Re: Equation - code review
by BrowserUk (Patriarch) on Aug 16, 2006 at 08:06 UTC

    The simplest way to speed this up would be to abandon the use of Math::BigInt. You use this in two places to raise numbers to a power:

    $power=Math::BigInt->new($initial_index); $distance_index=($power->bpow($distance))->numify();

    but you then immediately retrieve the result as a standard scalar without performing any check for overflow.

    Assuming the value '4' is typical scale of magnitude for $distance aka $self{ distance } aka $distance_index,

    then using Perl's built-in exponentiation operator,  $power ** $distance

    cmpthese -3, { BigInt => q[ my $x = Math::BigInt->new( $_ )->bpow( 4 )->numify() for 2 .. +1e6; ], standard => q[ my $x = $_ ** 4 for 2 .. 1e6; ], };; (warning: too few iterations for a reliable count) s/iter BigInt standard BigInt 172 -- -100% standard 0.625 27472% --

    will be at least 250 times quicker.

    For sequences of length 10,000, standard math will safely accomodate distances of up to 78.

    • 1e4 => 78
    • 1e5 => 61
    • 1e6 => 51
    • 1e7 => 44

    If you really need to use Math::BigInt for this calculation, then you should be doing the 1 / $distance_index part of your calculation before you numify the result of the previous step.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Hi,
      Thanks for your reply.
      literally i don't know how to handle the built-ins exponent function.I can able to find only  Math::BigInt package. However I can't able to overright the object , instead of creating new objects every time in that module.This makes me very bad situation if my length of string id very high.
      . I have tested my code with in-built function and it is working fine.
      -kulls
Re: Equation - code review
by imp (Priest) on Aug 16, 2006 at 03:21 UTC
    You need to be a little more specific.

    Do you want to know whether it is an efficient algorithm? Whether it works correctly? In both of these cases we need a definition of what the correct behaviour is.

    I can say the following things about the code you provided:

    • It's perl
    • It could use inline comments or refactoring to make the behaviour more intuitive
    • It could use a block of comments before alpha_index
    • More whitespace would make it easier to read
      Hi,
      Thanks for your reply.
      The output of the program is exactly correct which I have checked against the manual work. The performance issue can arise when my given input string is more than 1000 characters.
      Can you please suggest me, how to improve the performance inside the loops ?
      - kulls
        You have been very polite and my inclination is to help you. Please make it easier to do so.

        You have provided us with your current solution to a problem, but you haven't specified what that problem is. It is difficult to identify what areas might be unneccesary or wasteful in this situation.

        Some inline documentation would be useful, as would some additional comments in this thread. Here are a few things that would be nice to know:

        • Why is $distance_index and $window random
        • More sample data. The $sequence provided is 7 characters, but in this comment you state that it can be over 1000 characters in length. The size of the input data will help determine which algorithm is appropriate.
        • How often will this function be called
        Before you start optimizing your code I recommend writing a test suite that verifies the current behaviour. That way when you introduce errors during optimization they will be easier to spot. You have done the manual verification, put that in a test script.

        For example you could do this (and leave alpha_index as it is):

        #!c:\perl\bin\perl use strict; use warnings; use Math::BigInt; use Data::Dumper; use Test::More qw(no_plan); my %index_val = ( A => 2, C => 3, D => 4, N => 2, M => 2, T => 1, G => 4); my $sequence = "ACTGACN"; my $distance_index = 3; my $window = 2; my $arr_ref = alpha_index({ sequence => $sequence, distance => $distance_index, aa_index => \%index_val, window_size => $window }); ok($arr_ref); diag("Testing alpha_index with window: $window, distance: $distance_in +dex, sequence: $sequence"); is(ref($arr_ref),'ARRAY','alpha index returned an array'); is(@$arr_ref,7,'7 entries are in the array'); is($arr_ref->[0]{A},2.51025682971601);
        Take a look at Test::More for other ideas on what you can test.
Re: Equation - code review
by starbolin (Hermit) on Aug 16, 2006 at 05:12 UTC

    Um... every time I run the program I get a different output. Is there some reason for the random number generator??

    As far as coding style, there are no validity checks. If I feed garbage into the subroutine, it quietly gives me garbage out.

    my %index_val = ( F => -1.8, O => 3, O => 4, B => 'quxx', A => 2, R => 1, P=>'foo');
    Gives:
    $VAR1 = [ { 'A' => '2.25' }, { 'C' => '0.5' }, { 'T' => '0.5' }, { 'G' => '0.5' }, { 'A' => '2.25' }, { 'C' => '0.324074074074074' }, { 'N' => '0.324074074074074' } ];

    Definitely "Not ready for Prime-Time."

    Update: I don't mind if a program fails on garbage input, but it should fail noisily.


    s//----->\t/;$~="JAPH";s//\r<$~~/;{s|~$~-|-~$~|||s |-$~~|$~~-|||s,<$~~,<~$~,,s,~$~>,$~~>,, $|=1,select$,,$,,$,,1e-1;print;redo}
Re: Equation - code review
by ikegami (Patriarch) on Aug 16, 2006 at 05:35 UTC
    You have a space before the "#!", but nothing can preceed the "#!" for it to keep its meaning.
Re: Equation - code review
by cdarke (Prior) on Aug 16, 2006 at 06:55 UTC
    A small thing, but you do not seem to use the loop variable ($_) value in your foreach loops. Therefore it might be better to write the sequence starting from 1 and remove the subtraction, for example:
    foreach (0.. (length($sequence)-1))
    becomes:
    foreach (1.. (length $sequence))
      It's used here actually:
      my ($left,$middle,$right) = $sequence =~ m/(\w{$_})?(\w{1})?(\w*)/; ^
      The result for 0..6 would be:
      A CTGACN A C TGACN AC T GACN ACT G ACN ACTG A CN ACTGA C N ACTGAC N