Debugging help please?

mdunnbass has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Monks,

I've got a snippet of code below that I'd like some help debugging please, if anyone's up to it.

Here's the skinny. This is a subset of a subroutine. Yes, I am using use strict and use warnings. Every single print statement in the code is put there solely for debugging purposes. Eventually, all of the data will be written to a file, with no need for screen output.

What I'm trying to do

I have gathered a series of endpoints to a large text file (containing a genome split into FASTA format records). The endpoints are stored in a %HoAoA, where the keys of the hash correspond to the ID lines of each FASTA record, and the values are arrays of arrays. Each (first level) array corresponds to a particular sequence I want to extract from the genome file. Each (second level) array contains three elements: [0] is the starting point, [1] is the ending point, and [2] is the length of DNA contained between them, defined as [1] - [0]. At this point in the script, I am trying to refer to the %HoAoA to find out where each sequence is, and then extract that sequence from the genome file, and place it into a corresponding element in an %HoA.

How I am trying to do it

I've pasted the code below. For page-length's sake, it's behind a readmore cut. I have a series of comments in the code that are to help me keep track of brackets more than anything else.

For those that don't know, FASTA format sequences start with an ID line, which typically begins with a '>', followed by text and a newline character. What follows that is sequence data. In this case, I have saved the ID lines as elements in @fastarray in the order the arose in my original searches, this way, I can go back to them in order, rather than scanning the entire file for every FASTA sequence (there are ~30,000 in this file). I scaan through the file, looking for the next line that matches /^>/, and compare that to the desired FASTA ID, if it's the match, I start countign characters until I get to the starting point ofthe desired sequence, and use substr to keep it.

open(DNA,"<$filename") or croak("Cannot open $filename:$!\n");
FASTAKEY:  for my $g (@fastarray){
 print "\$g is: $g\n";
 if (@{$sets{$g}}){
   print "This \$g, $g, has a set.\n";
  $buffer = '';
  my $skip = 0;
  my $i = 0;
  $position = 0;
  $num = (@{$sets{$g}});
  LINE: while ($newline = <DNA> ) {
    
    if ($skip == 1) {  
      unless ($newline =~ /^>/) {
        next LINE;
        }
      if ($newline =~ /^>/) {
        $skip = 0;
        goto NOSKIP;
        }
      }
      
      else {
      # If new fasta header, 
        NOSKIP: if($newline =~ /^>/) { 
          print "....We've hit a newline, which is: $newline";
          if ($buffer eq '') { 
            chomp $newline;
            print "We are in the 'if buffer eq' loop.\n";
            print "..\$g is: $g, and \$newline is $newline\n";
            if ($newline ne $g) {
              $skip = 1;
              next LINE;
              }
            if ($newline eq $g) {
              $header = $newline;
              $g = $header;
              $skip = 0;
              print "Gathering sequence data from sequence $g\n";
              next LINE;
              }
              
            # closes if buffer eq ''
            }
            
          else { # ($buffer ne '')
            FINISH: print "......At FINISH, \$num is: $num, and \$i is
+: $i\n"; 
            if ($i == $num) {
              print "in FINISH, \$i is: $i, and \$num is $num.  They s
+hould be equal\n";
              $position = 0;
              $skip = 1;
              seek(DNA,(0-length($buffer)),1);
              $len = 0;
              $buffer = '';
              next FASTAKEY;
              }    
              else {
                print "in FINISH, \$i is: $i, and \$num is $num.  They
+ should not be equal\n";
                print "\$i is: $i, \$HoA{$g}[$i][0] is: $HoA{$g}[$i][0
+], \$HoA{$g}[$i][1] is: $HoA{$g}[$i][1], \$position is: $position and
+ length(\$buffer) is: ".length($buffer)."\n";
                if ($HoA{$g}[$i][0] >= ($position + length($buffer))) 
+{print "The position of \$lowest is past the end of the buffer!  How 
+did that happen?!\n";}
                if ($HoA{$g}[$i][1] < $position) {print "The position 
+of \$highest is below that of \$position!  How did this happen?!\n";}
                
                if ($HoA{$g}[$i][1] >= ($position + length($buffer))) 
+{
                  $match{$g}[$i] = uc(substr($buffer, ($HoA{$g}[$i][0]
+ - $position)));
                  }
                  else {  
                    $match{$g}[$i] = uc(substr($buffer, ($HoA{$g}[$i][
+0] - $position), $HoA{$g}[$i][2]));  
                    }
                
                $i++; 
                goto FINISH;
                # close uncommented else
                }
                
              #closes else  ($buffer ne ''  )
              }
            # closes if newline    
            }
            
          # closes else statement  
          }

        
    chomp $newline;
    $buffer .= $newline;
    $len = length($newline);
    
    if (($position + $span) < $HoA{$g}[$i][0] && length($buffer) > $sp
+an) {
    # Reset the position counter
    # (will be accurate after you reset the buffer, next)
      $position += $span;
     
    # Reset the buffer
    # Discard the first $max worth of data in the buffer
      $buffer = substr($buffer, $span);
      }
    
    unless (length($buffer) >= $HoA{$g}[$i][2] && $position <= $HoA{$g
+}[$i][0] && (length($buffer) + $position) >= $HoA{$g}[$i][1]) {
      next LINE;
      }
    
    if ($i < $num) {  
      $match{$g}[$i] = uc(substr($buffer, ($HoA{$g}[$i][0] - $position
+), $HoA{$g}[$i][2]));  
      $i++;
      }
    if ($i == $num) {
      seek(DNA,(0-length($buffer)),1);
      $buffer = '';
      $position = 0;
      $skip = 1;
      $len = 0;
      next FASTAKEY;
      }
      else {
        if ($position + $len < $HoA{$g}[$i][0]) {
          # Reset the position counter
          # (will be accurate after you reset the buffer, next)
          $position += $len;
     
          # Reset the buffer
          # Discard the first $max worth of data in the buffer
          $buffer = substr($buffer, $len); 
          }
        $len = 0;
        next LINE;   
        }
    # closes LINE
    }
   }
  # closes FASTAKEY
  }   
close(DNA) or croak("Cannot close $filename:$!\n");    
    
print "Finished gathering all the sequences, now moving on.\n";
[download]

So, that's the relevant portion of the code.

What's going wrong

Some sample

$g is: >LG_I
This $g, >LG_I, has a set.
....We've hit a newline, which is: >LG_I
We are in the 'if buffer eq' loop.
..$g is: >LG_I, and $newline is >LG_I
Gathering sequence data from sequence >LG_I
....We've hit a newline, which is: >LG_II
......At FINISH, $num is: 73, and $i is: 68
in FINISH, $i is: 68, and $num is 73.  They should not be equal
$i is: 68, $HoA{>LG_I}[68][0] is: 22325699, $HoA{>LG_I}[68][1] is: 223
+26343, $position is: 33722650 and length($buffer) is: 1848919
The position of $highest is below that of $position!  How did this hap
+pen?!
substr outside of string at Test3 line 1483, <DNA> line 508168.
......At FINISH, $num is: 73, and $i is: 69
in FINISH, $i is: 69, and $num is 73.  They should not be equal
$i is: 69, $HoA{>LG_I}[69][0] is: 27156174, $HoA{>LG_I}[69][1] is: 271
+56931, $position is: 33722650 and length($buffer) is: 1848919
The position of $highest is below that of $position!  How did this hap
+pen?!
substr outside of string at Test3 line 1483, <DNA> line 508168.
......At FINISH, $num is: 73, and $i is: 70
in FINISH, $i is: 70, and $num is 73.  They should not be equal
$i is: 70, $HoA{>LG_I}[70][0] is: 10031764, $HoA{>LG_I}[70][1] is: 100
+32474, $position is: 33722650 and length($buffer) is: 1848919
The position of $highest is below that of $position!  How did this hap
+pen?!
substr outside of string at Test3 line 1483, <DNA> line 508168.
......At FINISH, $num is: 73, and $i is: 71
in FINISH, $i is: 71, and $num is 73.  They should not be equal
$i is: 71, $HoA{>LG_I}[71][0] is: 22783500, $HoA{>LG_I}[71][1] is: 227
+84101, $position is: 33722650 and length($buffer) is: 1848919
The position of $highest is below that of $position!  How did this hap
+pen?!
substr outside of string at Test3 line 1483, <DNA> line 508168.
......At FINISH, $num is: 73, and $i is: 72
in FINISH, $i is: 72, and $num is 73.  They should not be equal
$i is: 72, $HoA{>LG_I}[72][0] is: 28700514, $HoA{>LG_I}[72][1] is: 287
+01277, $position is: 33722650 and length($buffer) is: 1848919
The position of $highest is below that of $position!  How did this hap
+pen?!
substr outside of string at Test3 line 1483, <DNA> line 508168.
......At FINISH, $num is: 73, and $i is: 73
in FINISH, $i is: 73, and $num is 73.  They should be equal
$g is: >LG_II
This $g, >LG_II, has a set.
....We've hit a newline, which is: >LG_II
......At FINISH, $num is: 61, and $i is: 4
in FINISH, $i is: 4, and $num is 61.  They should not be equal
$i is: 4, $HoA{>LG_II}[4][0] is: 2635020, $HoA{>LG_II}[4][1] is: 26352
+37, $position is: 1822784 and length($buffer) is: 86
The position of $lowest is past the end of the buffer!  How did that h
+appen?!
substr outside of string at Test3 line 1480, <DNA> line 534211.

...and so on.  it continues all the way up to $i is 61.
[download]

, followed by why it is bothersome:

Here are the problems. First off, when I get towards the end of a particular FASTA sequence, I try to finish off saving matching sequences while there is still buffer containing matches. But, I am getting out-of-bounds errors.

Second, Even though I reset the variable $buffer = '' twice, upon reacing the second FASTA ID line, it skips right past the if ($buffer eq '') { line. This results in the first few sequences of >LG_II actually saving from the data for >LG_I. I don't understand why that it, but it is. I assume that since the buffer variable should be empty at this point, at the NOSKIP label, flow should be directed into the 'if bufer eq' loop, but it isn't. That's my biggest concern.

Any help at all would be greatly appreciated.

Thanks everyone. This thing is driving me batty.
Matt

Comment on Debugging help please? Select or Download Code

Replies are listed 'Best First'.
Re: Debugging help please? by shmem (Chancellor) on Dec 05, 2006 at 22:08 UTC
Hmm. Here's the skinny. This is a subset of a subroutine. Yes, I am using use strict and use warnings. Every single print statement in the code is put there solely for debugging purposes. For being a subset of a subroutine, this is waaay too long. You are giving yourself a clue writing I have a series of comments in the code that are to help me keep track of brackets more than anything else. Anything else would be far better. Do you think we are so much smarter than you following the code flow, and spotting the bug? Try to break up your long loop into smaller chunks, and put those into subroutines. recently there was a discussion about variable naming. `FASTAKEY: for my $g (@fastarray){` [download] This one strikes me for it's lousieness. The Thing All The Loop Is About is just named `$g`! And nobody gets to know what the hell it is, because there is no comment. avoid goto. Avoid goto! Avoid nested gotos going back and forth within a loop like the sin! Almost everytime you feel like needing goto, or if you are using excessive calls to next and last, you really need to rethink the logic of your program flow. Go back. GOTO your starting point: What I'm trying to do I have gathered a series of endpoints to a large text file (containing a genome split into FASTA format records). The endpoints are stored in a %HoAoA, where the keys of the hash correspond to the ID lines of each FASTA record, and the values are arrays of arrays. Each (first level) array corresponds to a particular sequence I want to extract from the genome file. Each (second level) array contains three elements: [0] is the starting point, [1] is the ending point, and [2] is the length of DNA contained between them, defined as [1] - [0]. At this point in the script, I am trying to refer to the %HoAoA to find out where each sequence is, and then extract that sequence from the genome file, and place it into a corresponding element in an %HoA. Start from there. Put this block as a comment into your code. Program what you're stating here. Tweak the formulation of the problem 'til it fits with the code you program for it. Each statement of what I'm trying to do should be followed with the code which is meant doing what is stated, i.e. translate the requirements into perl, and re-formulate the requirements to be translatable. Take care that the outcome of both ways to express the same thing - i.e. comment and code - remain understandable. (In an ideal program a bug would then be obvious, because the description in program code would be a contradiction to what is expressed in words) Bottom line, rather than debugging, I would rewrite that code. --shmem _($_=" "x(1<<5)."?\n".q·/)Oo. G°\ / /\_¯/(q / ---------------------------- \__(m.====·.(_("always off the crowd"))."· ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}	[reply] [d/l] [select]
Re: Debugging help please? by osunderdog (Deacon) on Dec 05, 2006 at 18:54 UTC
Have you actually tried bringing this up in the perl debugger? You can get a lot of information about the state of things approaching the error using the debugger. In this particular case, I would probably put a `$DB::single=1;` near where you think it's going awry and then check out the values at that point using the debugger's inspection commands. See an old thread of mine for more information: Neat Debugger tricks Hazah! I'm Employed!	[reply] [d/l]
Re: Debugging help please? by jbert (Priest) on Dec 05, 2006 at 19:56 UTC
Are any of the FASTA modules on cpan of use to you? Using someone else's parser might be easier than rolling your own.	[reply]