in reply to Re^2: Refactoring a large script
in thread Refactoring a large script

Ah. Then here are some more potentially useful links for you:

-xdg

Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

Replies are listed 'Best First'.
Re^4: Refactoring a large script
by mdunnbass (Monk) on Jan 19, 2007 at 18:27 UTC
    Thanks for the links!

    I finished debugging the vast majority of the script at this point, and now I finally have it at a point where it will run to completion without getting stuck in a loop, or encountering an unrecoverable error of some kind.

    So, I just ran -d:Dprof, and I am pleased with the output:

    Total Elapsed Time = 1135.701 Seconds User+System Time = 770.4718 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 63.8 491.8 491.86 1 491.86 491.86 main::WEED 32.5 251.1 251.10 22012 0.0114 0.0114 main::SEARCHFASTA 2.10 16.20 16.209 164 0.0988 0.0988 main::GET_TEXT 0.71 5.460 5.460 1 5.4600 5.4600 main::INDEX_FASTA 0.40 3.089 3.089 164 0.0188 0.0188 main::ADD_SPAN 0.15 1.140 1.140 1 1.1400 1.1400 main::WEEDNUM 0.08 0.599 770.48 2 0.2994 385.24 main::MAIN 0.05 0.380 0.380 1 0.3800 0.3800 main::OVERLAP 0.05 0.350 0.350 1 0.3500 0.3500 main::CLUSTER 0.02 0.130 0.130 2 0.0650 0.0650 main::GET_ENDS 0.01 0.048 3.176 1 0.0482 3.1764 main::HTML_FORMAT 0.01 0.040 0.040 1 0.0400 0.0400 main::TABLE_IT 0.01 0.040 0.420 1 0.0400 0.4200 main::SORT_HITS 0.00 0.020 0.020 2 0.0100 0.0100 main::WEED_HEADERS 0.00 0.010 0.010 1 0.0100 0.0100 warnings::BEGIN
    Obviously, a 20 minute run-time is far longer than I would have liked. But 96% of that run-time was in 2 subs. The SEARCHFASTA sub searches a file containing (in this run) a .5 gig file of DNA sequences, searching for every occurance of multiple different strings, and saving them to a large hash. So, in that case, it's going to be dependant on the size of the file being searched, and will be large no matter what I do.

    The WEED sub, which is taking 65% of the time, is the one I have to work on then. Its purpose is to organize the results of the SEARCHFASTA sub based on certain criteria, and that is taking it forever. So, I'll have to see what Ican do to tighten that up. I may post that to the Seeking Wisdom node in a bit.

    Thanks to everyone for their help and advice. I am taking much of it to heart, and will be keeping a lot of your tips in mind from now on.

    Thanks
    Matt

      Here's my attempt at refactoring your WEED() sub. As I don't have data, this has been done blind. Ie. Without the tests you would normally use to ensure that it continues to function correctly and that changes aimed at improving performance actually achieve that goal. So, I'll step through the changes I've made, so that you can apply and test them as you go. It should also give you some ideas for refactoring the other subs that need it.

      The first thing I did was to do some trivial reformatting of the code to make the indentation consistent, add some horizontal whitespace and reduce the vertical whitespace. These changes allow me to get a clearer overview of the structure of the sub and (I find) make it easier to read. Along the way, I've also remove the comments as I find them a distraction. Obviously, you don't need to make these changes,

      One anomaly that jumped out at me as I manually reformatted the code I got from your scratchpad was this line:

      WEEDLOW: for my $low ( @{ $matchesfasta_id{ $site } } ) {

      As there is no other reference to a variable named $matchesfasta_id in the file, and there is a line a couple above this line:

      last unless @{$matches{$fasta_id}{$site}};

      I've assumed that this is a typo/transcription error and should read:

      WEEDLOW: for my $low ( @{ $matches{ $fasta_id }{ $site } +}) {

      I've posted my starting point after the reformatting below:

      The first things I did were simple clean ups of various things in the code. Starting from the top:

      1.    my( $site, $fasta, $sitekey) = '';

        The first and last of these declared variables are used as loop iterators, so it is much better to declare them in-line with the loops.

        The middle one is not used and can be done away with.

      2.  my( $setscounter, $lowerlimit, $upperlimit, $i, $set, $yes ) = 0;

        The first three of these are used only within the loops and are never referred to outside of the scope at which they first appear, so they again can be declared inline with their first usage.

        The last three are not used anywhere in the sub and can be dropped.

      3.  my %sets = (); Lexical variables are initialised when they come into being, so the initialisation is unnecessary and simple clutters the code.

      Now I started looking at the function of the code and the first thing I noticed is that you sort the keys of %matches in the for loop at A below, and again each time around the nested inner loop at B.

      for my $site ( sort { $a <=> $b } keys %{ $matches{ $fasta_id } } ) { + ## A last unless @{ $matches{ $fasta_id }{ $site } }; WEEDLOW: for my $low ( @{ $matches{ $fasta_id }{ $site } } ) { my $lowerlimit = $low + 0; my $upperlimit = $span + $lowerlimit; for my $sitekey ( sort { $a <=> $b } keys %{ $matches{ $fasta_ +id } } ) { ## B

      Since you are not modifying %matches between these two points (or at all), then it means that you are sorting and re-sorting these keys many times. It's not easy to tell in the absence of test data how big this hash is, or how many time these loops iterate, but given the sizes of the data files mentioned in the OP, and the timing information you've supplied above, it appears that they may be substantial. I could well see this needless re-sorting being the source of a substantial amount of the runtime of the program.

      Eliminating this duplication is easy and should produce a substantial performance benefit. Replacing the above lines with these should do the trick:

      my @sortedKeys = sort { $a <=> $b } keys %{ $matches{ $fasta_i +d } }; for my $site ( @sortedKeys ) { last unless @{ $matches{ $fasta_id }{ $site } }; WEEDLOW: for my $low ( @{ $matchesfasta_id{ $site } } ) { my $lowerlimit = $low + 0; my $upperlimit = $span + $lowerlimit; for my $sitekey ( @sortedKeys ) {

      The next thing I looked at was the section of code where you 'band pass' filter the matches into a temporary array. You then conditionally copy that temporary array into your results array, or discard the entire super set if nothing made it through the filter:

      for my $hit ( @{ $matches{ $fasta_id }{ $sitekey } } ) { next unless $hit >= $lowerlimit; last unless $hit <= $upperlimit; push @arrayA, $hit + 0; } if( @arrayA ) { @{ $sets{ $fasta_id }[ $setscounter ]{ $sitekey } } = @arrayA; } else { %{ $sets{ $fasta_id }[ $setscounter ] } = (); }

      Whilst there is nothing inherently wrong with this, Perl has a general purpose array filtering mechanism, namely grep specifically for this purpose and it is usually considerably quicker than an explicit loop:

      @{ $sets{ $fasta_id }[ $setscounter ]{ $sitekey } } = grep { $_ >= $lowerlimit and $_ <= $upperlimit } @{ $matches{ $fasta_id }{ $sitekey } } unless( @{ $sets{ $fasta_id }[ $setscounter ]{ $sitekey } } ) { undef $sets{ $fasta_id }[ $setscounter ] }; next WEEDLOW; }

      There is a possible caveat with this change! Your explicit loop does allow you to short circuit the loop at the high pass limit which isn't possible with grep. However, as shown above, using grep does allow you to avoid the creation of and later copying of the temporary array. This combined with greps inherent better performance may outweigh the benefit of that short circuiting. Or it may not.

      The only way to tell will be to make the change and benchmark. If the size of the array being filtered is sufficiently large, and the pass band sufficiently narrow and low down in the range that the short circuit is beneficial, then it would be better to use a binary search algorithm to discover the low and high range limits and then copy the values across to the results set using an array slice.

      I haven't coded that as I have no way to make the test.

      Finally (for now), you use this construct:

      %{ $sets{ $fasta_id }[ $setscounter ] } = ();

      To empty arrays in several places. Whilst there is nothing wrong with this, I think it is clearer, and may be slightly quicker to use

      undef $sets{ $fasta_id }[ $setscounter ] };

      and I've made those changes also.

      The final code is the second of the two code blocks below.

      There are various other things that look suspect.

      • You are obviously converting this code to use strict and get away from using global variables, but there are still several of these in this subroutine:
        • @fastarray which logic suggests you could pass a reference to and assign to directly, rather than having to do wholesale copying at the end.
          # @fastarray = (); @fastarray = @newfast;

          BTW: There is no point in emptying the array before copying over it. The first line above is redundant.

        • The mysterious $h which is the only thing ever assigned to @newfast/@fastarray
          push @newfast, $h;
        • The enigmatic $num
          if( scalar keys %{ $sets{ $fasta_id }[ $setscounter ] } < $num ) {

          This should be passed in as a parameter, and should certainly have a better name.

        • Passing hashes in and out of your sub is costing you a substantial amount of memory reallocation (and therefore time).

          This is probably a holdover from the pervasive use of globals in its previous incarnation and avoiding these globals make good sense. However, passing entire hashes is expensive if they are large (as these appear to be).

          It would be more efficient to pass hash references in and out, but there is a downside to that in the additional complexity it creates in the referencing of these complex structures. There is also a small performance hit as a result.

          There is a solution to this which involves using a localised glob to alias the hash references and so gain the benefits of avoiding copying, whilst retaining the simpler referencing to the structures. However, for reasons that completely bewilder me, this use of globs is generally seen as 'bad thing', so I won't recommend it here.

          There are also various cpan modules that can provide a similar aliasing facility. If you can get one of those to build and install on your system, something which I have failed to achieve on mine, then you could investigate using one of them.

        • There is a lot of logic in the following lines, particularly to do with the variable $setscounter which I do not fully understand but that looks like it ought to be possible to simplify:
          for my $checkhash ( keys %{ $sets{ $fasta_id }[ $setscounter ] + } ) { unless( defined $sets{ $fasta_id }[ $setscounter ]{ $check +hash }[ 0 ] ) { undef $sets{ $fasta_id }[ $setscounter ] }; last; } } if( scalar keys %{ $sets{ $fasta_id }[ $setscounter ] } < $num + ) { undef $sets{ $fasta_id }[ $setscounter ]; } $setscounter++ if scalar %{ $sets{ $fasta_id }[ $setscounter ] + }; } } if( @{ $sets{ $fasta_id } } ) { pop @{ $sets{ $fasta_id } } unless scalar %{ $sets{ $fasta_id }[ $ +#{ $sets{ $fasta_id } } ] }; }

          If the changes I've made so far haven't broken anything and achieve the greater than 50% speedup of this sub that I estimate, then I would look carefully at these lines to see if what they are doing can be improved.

        HTH

        My starting point after the trivial reformating I mentioned at the top:

        And the refactored code as far as I've taken it:


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        Wow. Thank you BrowserUK, you made my week!

        Re: $matchesfasta vs. $matches{$fasta}

        That was a copy and paste problem on my part. Along with the 'enigmatic' $h. I was originally using $h as the variable that became $fasta_id. But I replaced it because I realized no one else would have a clue what it was for. So, that's 2 mysteries resolved.

        As for the sorting issue you brought up next - I had no idea that sort was so memory intensive. Thanks for clearing that up for me.

        It's not easy to tell in the absence of test data how big this hash is +, or how many time these loops iterate, but given the sizes of the data +files mentioned in the OP, and the timing information you've supplied above, + it appears that they may be substantial.

        Yeah, these hashes get big. The input into this sub is a HoAoA. In a typical run of the program, that HoAoA has about 22,000 key/value pairs, the upper-level Array would have only around 3-5 elements, but the lower level arrays would have several hundred elements. So, safe to say the hash is rather large.

        The next thing I looked at was the section of code where you 'band pas +s' filter the matches into a temporary array.
        Thanks for the grep idea! I hadn't even thoguht of that. As for the caveat of it's possibly not exceeding the early loop exit, I'll try them both and benchmark it, as you suggested. That said, I think many of your other changes may help enough that it won't make a be necessary, but we'll see.

        Finally (for now), you use this construct: %{ $sets{ $fasta_id }[ $setscounter ] } = (); To empty arrays in several places. Whilst there is nothing wrong with +this, I think it is clearer, and may be slightly quicker to use undef $sets{ $fasta_id }[ $setscounter ] };

        Actually, I was deliberately trying to avoid the undef specifically. The reason being, I still am slightly confused about how to remove array elements that are undef. when I was originally using the undef line you suggested, I would get output through Data Dumper that some arrays consisted only of undef, significant amounts of others would have undef as the last element, etc. So, the roundabout way I settled on was the best I could jerry-rig with the minimal understanding of exists/undef at the time. If you (or anyone else) can think of a effecient way to excise any undef elements from an array, and additionally remove array elements that consist of an array containing only a single element (undef) as well, I would love it.

        Finally, the global variables that looked suspect. I mentioned $h earlier. $num is a user-defined number that winds up being the number of elements in the upper- level arrays in the HoAoA. In the output of the sub, and HoAoHoA, it is the number of key/value pairs in the subhash. The @fastarray array is essentially identical to the @SortedKeys array you created for me. The reason I kept each of those 2 variables as global was that essentiall each of my 30ish subs use them as-is, so I felt it would be better to just keep them global than pass and dereference them in every sub. Now, considering that @fastarray starts off with around 22,000 elements and eventually only gets reduced to about 2,000 elements, I don't know which would be faster, passing it constantly, or keeping it global. Thoguhts?

        Anyway, this response is getting rambling, as I wrote it in chunks over the course of a day, so I'll wrap it up here. Just let me say again, Thank you so much for your help!

        Matt

        P.S. - I am replacing the huge block of code on my scratchpad with a heavily commented version of this particular sub, though without the changes you have mentioned here. I'll be trying those out tomorrow.

        OK, so here's an interesting tidbit. I tried out your modifications to my WEED() sub, and it wound up doubling the time it took to run! Weird. Heres the (truncated) dprof output of my original sub:
        Total Elapsed Time = 1135.701 Seconds User+System Time = 770.4718 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 63.8 491.8 491.86 1 491.86 491.86 main::WEED 32.5 251.1 251.10 22012 0.0114 0.0114 main::SEARCHFASTA 2.10 16.20 16.209 164 0.0988 0.0988 main::GET_TEXT 0.71 5.460 5.460 1 5.4600 5.4600 main::INDEX_FASTA 0.40 3.089 3.089 164 0.0188 0.0188 main::ADD_SPAN 0.15 1.140 1.140 1 1.1400 1.1400 main::WEEDNUM (......)

        And here's the output with the modifications you suggested:

        Total Elapsed Time = 1292.196 Seconds User+System Time = 1200.096 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 76.9 923.0 923.08 1 923.08 923.08 main::WEED 20.6 247.6 247.61 22012 0.0112 0.0112 main::SEARCHFASTA 1.30 15.57 15.579 164 0.0950 0.0950 main::GET_TEXT 0.60 7.240 7.240 1 7.2400 7.2400 main::INDEX_FASTA 0.17 2.059 2.059 164 0.0126 0.0126 main::ADD_SPAN 0.14 1.729 3.848 1 1.7292 3.8483 main::HTML_FORMAT (......)

        So, I am guessing that the grep wound up being less efficient than the 'band pass' filter I had set up. Weird. That said, I have not yet tried a combination of our two different methods. I have a feeling most of your comments are going to wind up helping, but that in this case, the nature of the data just lends itself better to the filter approach.

        I hope that I can still shed significant time off of this sub, so I'll post it to the "Seekers of Perl Wisdom" forum for additional help.

        Thank you so much though for taking the time to try and help me fix this. I really appreciate it.

        Matt