Equation - code review

kulls has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Equation - code review by GrandFather (Saint) on Aug 16, 2006 at 03:10 UTC
Does it work against your test suite? Update Why `&``alpha_index({...`? Why `my $self=...` when alpha_index doesn't seem to be a method? Why `my ($left,$middle,$right` when they are used locally to the foreach (@carbon_index is correctly declared) Why the assignment in `my %carbon_aa=();` when it is empty to start with anyway? Why `my ($left_aa,$right_aa,$distance_index,$power)=0;` when only $left_aa gets a value? Why are `$left_aa,$right_aa,$distance_index,$power` not declared local to their foreach loops? DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re^2: Equation - code review by kulls (Hermit) on Aug 16, 2006 at 04:29 UTC
Hi, Thanks for your valuable comments. For your first two question, >> I just copied the method from my module and created a dummy arguments around it. I can't paste the entire module, since dependency is more and the security. I apologize for that. Here I have Updated my code based on your comments, please review it #!c:\perl\bin\perl use strict; use warnings; use Math::BigInt; use Data::Dumper; my %index_val = ( A=> 2, C => 3, D => 4, N => 2, M => 2, T => 1, G => 4); my $sequence="ACTGACN"; my $distance_index= int(rand(4)); my $window=int(rand(3)); my $arr_ref = alpha_index({ sequence => $sequence, distance => $distance_index, aa_index => \%index_val, window_size => $window }); print STDERR Dumper($arr_ref); sub alpha_index { my $self=shift; my $sequence=$self->{sequence}; my $distance=$self->{distance}; my $aa_index=$self->{aa_index}; my @carbon_index; my $window_size= $self->{window_size}; if($self->{window_size} ==1) { $window_size=1;} elsif($self->{window_size} > 1 ) { $window_size = int(($self->{window_size}-1)/2); } else { $window_size=0; } foreach (0.. (length($sequence)-1)) { my %carbon_aa; my ($left,$middle,$right) = $sequence =~ m/(\w{$_})?(\w{1} +)?(\w)/; $carbon_aa{$middle}=$aa_index->{$middle} if($middle); $left=reverse($left); my $initial_index=2; my $length_index= ($window_size) ? $window_size:length($left); if($self->{window_size} ==1 ) { push @carbon_index, \%carbon_aa if(%carbon_aa); next; } foreach my $increment (0..$length_index-1) { my $left_aa=substr($left,$increment,1); next if not exists $aa_index->{$left_aa}; my $power=Math::BigInt->new($initial_index); my $distance_index=($power->bpow($distance))->numify() +; $carbon_aa{$middle}+= ( $aa_index->{$left_aa} (1/$distance_index)); $initial_index++; } $initial_index=2; $length_index= ($window_size)?$window_size:length($right); foreach my $increment (0..$length_index-1) { my $right_aa=substr($right,$increment,1); next if not exists $aa_index->{$right_aa}; my $power=Math::BigInt->new($initial_index); my $distance_index=($power->bpow($distance))->numify() +; $carbon_aa{$middle}+= ( $aa_index->{$right_aa} * ( 1/$distance_index)); $initial_index++; } push @carbon_index, \%carbon_aa if(%carbon_aa); } return \@carbon_index; } [download] Thanks in advance. - kulls	[reply] [d/l]
Re^3: Equation - code review by imp (Priest) on Aug 16, 2006 at 08:53 UTC
Before you start looking for areas that you can optimize you should look for areas that should be optimized. You can do this using Benchmark and splitting the function into working pieces, and then testing those pieces. This line strikes me as possibly inefficient because it creates a new regex for each iteration, which means 1000 regex for a 1000 character sequence. `($left,$middle,$right) = $sequence =~ m/(\w{$_})?(\w{1})?(\w*)/;` [download] So I wrote a benchmark to try a few strategies: Read more... Benchmark hidden behind readmore (1301 Bytes) Results for running 500 iterations with length 7: Rate re_orig re_set bsubstr re_orig 681/s -- -12% -92% re_set 774/s 14% -- -91% bsubstr 8475/s 1144% 995% -- Results for running 500 iterations with length 100: Rate re_orig re_set bsubstr re_orig 48.1/s -- -12% -93% re_set 54.9/s 14% -- -92% bsubstr 718/s 1392% 1209% -- Results for running 100 iterations with length 1000: Rate re_set bsubstr re_set 4.00/s -- -93% bsubstr 59.5/s 1387% --	[reply] [d/l] [select]
Re^3: Equation - code review by GrandFather (Saint) on Aug 16, 2006 at 11:13 UTC
With a little tidying (to my eye anyway) and applying suggestions from imp and BrowserUK the following code gives the same answers as the original for the small range of values I've tested and ought to be somewhat faster. #!c:\perl\bin\perl use strict; use warnings; my %index_val = ( A=> 2, C => 3, D => 4, N => 2, M => 2, T => 1, G => 4); my $sequence="ACTGACN"; my $distance_index = 2; my $window = 2; my $arr_ref = &alpha_index({ sequence => $sequence, distance => $distance_index, aa_index => \%index_val, window_size => $window }); for my $hashRef (@$arr_ref) { print join "\n", map {"$_ => $hashRef->{$_}"} keys %$hashRef; print "\n"; } sub alpha_index { my $self=shift; my $sequence=$self->{sequence}; my $distance=$self->{distance}; my $aa_index=$self->{aa_index}; my @carbon_index; my $window_size = int $self->{window_size}; $window_size = int ($window_size - 1) / 2 if $window_size > 1; foreach my $len (0 .. (length($sequence)-1)) { my %carbon_aa; my ($left, $middle, $right) = splitStr ($sequence, $len); $carbon_aa{$middle} = $aa_index->{$middle} if $middle; $left = reverse $left; my $initial_index = 2; my $length_index= $window_size ? $window_size : length $left; if($self->{window_size} == 1 ) { push @carbon_index, \%carbon_aa if %carbon_aa; next; } foreach my $increment (0 .. $length_index-1) { my $left_aa=substr($left,$increment,1); next if not exists $aa_index->{$left_aa}; my $distance_index= $initial_index ** $distance; $carbon_aa{$middle} += $aa_index->{$left_aa} * (1/$distanc +e_index); $initial_index++; } $initial_index=2; $length_index= $window_size ? $window_size : length $right; foreach my $increment (0 .. $length_index-1) { my $right_aa=substr ($right, $increment, 1); next if not exists $aa_index->{$right_aa}; my $distance_index= $initial_index ** $distance; $carbon_aa{$middle} += $aa_index->{$right_aa} * (1 / $dist +ance_index); $initial_index++; } push @carbon_index, \%carbon_aa if %carbon_aa; } return \@carbon_index; } sub splitStr { return ( substr($_[0], 0, $_[1]), substr($_[0], $_[1], 1), substr($_[0], $_[1] + 1) ); } [download] DWIM is Perl's answer to Gödel	[reply] [d/l]
Re: Equation - code review by BrowserUk (Patriarch) on Aug 16, 2006 at 08:06 UTC
The simplest way to speed this up would be to abandon the use of Math::BigInt. You use this in two places to raise numbers to a power: `$power=Math::BigInt->new($initial_index); $distance_index=($power->bpow($distance))->numify();` [download] but you then immediately retrieve the result as a standard scalar without performing any check for overflow. Assuming the value '4' is typical scale of magnitude for `$distance` aka `$self{ distance }` aka `$distance_index`, then using Perl's built-in exponentiation operator, `$power $distance` `cmpthese -3, { BigInt => q[ my $x = Math::BigInt->new( $_ )->bpow( 4 )->numify() for 2 .. +1e6; ], standard => q[ my $x = $_ 4 for 2 .. 1e6; ], };; (warning: too few iterations for a reliable count) s/iter BigInt standard BigInt 172 -- -100% standard 0.625 27472% --` [download] will be at least 250 times quicker. For sequences of length 10,000, standard math will safely accomodate distances of up to 78. 1e4 => 78 1e5 => 61 1e6 => 51 1e7 => 44 If you really need to use Math::BigInt for this calculation, then you should be doing the `1 / $distance_index` part of your calculation before you `numify` the result of the previous step. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^2: Equation - code review by kulls (Hermit) on Aug 16, 2006 at 13:04 UTC
Hi, Thanks for your reply. literally i don't know how to handle the built-ins exponent function.I can able to find only `Math::BigInt` package. However I can't able to overright the object , instead of creating new objects every time in that module.This makes me very bad situation if my length of string id very high. . I have tested my code with in-built function and it is working fine. -kulls	[reply] [d/l]
Re: Equation - code review by imp (Priest) on Aug 16, 2006 at 03:21 UTC
You need to be a little more specific. Do you want to know whether it is an efficient algorithm? Whether it works correctly? In both of these cases we need a definition of what the correct behaviour is. I can say the following things about the code you provided: It's perl It could use inline comments or refactoring to make the behaviour more intuitive It could use a block of comments before alpha_index More whitespace would make it easier to read	[reply]
Re^2: Equation - code review by kulls (Hermit) on Aug 16, 2006 at 05:57 UTC
Hi, Thanks for your reply. The output of the program is exactly correct which I have checked against the manual work. The performance issue can arise when my given input string is more than 1000 characters. Can you please suggest me, how to improve the performance inside the loops ? - kulls	[reply]
Re^3: Equation - code review by imp (Priest) on Aug 16, 2006 at 07:43 UTC
You have been very polite and my inclination is to help you. Please make it easier to do so. You have provided us with your current solution to a problem, but you haven't specified what that problem is. It is difficult to identify what areas might be unneccesary or wasteful in this situation. Some inline documentation would be useful, as would some additional comments in this thread. Here are a few things that would be nice to know: Why is $distance_index and $window random More sample data. The $sequence provided is 7 characters, but in this comment you state that it can be over 1000 characters in length. The size of the input data will help determine which algorithm is appropriate. How often will this function be called	[reply]
Re^4: Equation - code review by kulls (Hermit) on Aug 16, 2006 at 08:40 UTC
Re^3: Equation - code review by imp (Priest) on Aug 16, 2006 at 08:00 UTC
Before you start optimizing your code I recommend writing a test suite that verifies the current behaviour. That way when you introduce errors during optimization they will be easier to spot. You have done the manual verification, put that in a test script. For example you could do this (and leave alpha_index as it is): #!c:\perl\bin\perl use strict; use warnings; use Math::BigInt; use Data::Dumper; use Test::More qw(no_plan); my %index_val = ( A => 2, C => 3, D => 4, N => 2, M => 2, T => 1, G => 4); my $sequence = "ACTGACN"; my $distance_index = 3; my $window = 2; my $arr_ref = alpha_index({ sequence => $sequence, distance => $distance_index, aa_index => \%index_val, window_size => $window }); ok($arr_ref); diag("Testing alpha_index with window: $window, distance: $distance_in +dex, sequence: $sequence"); is(ref($arr_ref),'ARRAY','alpha index returned an array'); is(@$arr_ref,7,'7 entries are in the array'); is($arr_ref->[0]{A},2.51025682971601); [download] Take a look at Test::More for other ideas on what you can test.	[reply] [d/l]
Re: Equation - code review by starbolin (Hermit) on Aug 16, 2006 at 05:12 UTC
Um... every time I run the program I get a different output. Is there some reason for the random number generator?? As far as coding style, there are no validity checks. If I feed garbage into the subroutine, it quietly gives me garbage out. `my %index_val = ( F => -1.8, O => 3, O => 4, B => 'quxx', A => 2, R => 1, P=>'foo');` [download] Gives: `$VAR1 = [ { 'A' => '2.25' }, { 'C' => '0.5' }, { 'T' => '0.5' }, { 'G' => '0.5' }, { 'A' => '2.25' }, { 'C' => '0.324074074074074' }, { 'N' => '0.324074074074074' } ];` [download] Definitely "Not ready for Prime-Time." Update: I don't mind if a program fails on garbage input, but it should fail noisily. s//----->\t/;$~="JAPH";s//\r<$~~/;{s\|~$~-\|-~$~\|\|\|s \|-$~~\|$~~-\|\|\|s,<$~~,<~$~,,s,~$~>,$~~>,, $\|=1,select$,,$,,$,,1e-1;print;redo}	[reply] [d/l] [select]
Re: Equation - code review by ikegami (Patriarch) on Aug 16, 2006 at 05:35 UTC
You have a space before the "#!", but nothing can preceed the "#!" for it to keep its meaning.	[reply]
Re: Equation - code review by cdarke (Prior) on Aug 16, 2006 at 06:55 UTC
A small thing, but you do not seem to use the loop variable ($_) value in your foreach loops. Therefore it might be better to write the sequence starting from 1 and remove the subtraction, for example: `foreach (0.. (length($sequence)-1))` becomes: `foreach (1.. (length $sequence))`	[reply] [d/l] [select]
Re^2: Equation - code review by imp (Priest) on Aug 16, 2006 at 07:49 UTC
It's used here actually: `my ($left,$middle,$right) = $sequence =~ m/(\w{$_})?(\w{1})?(\w*)/; ^` [download] The result for 0..6 would be: `A CTGACN A C TGACN AC T GACN ACT G ACN ACTG A CN ACTGA C N ACTGAC N` [download]	[reply] [d/l] [select]