comment on

Greetings Monks,

I've got a snippet of code below that I'd like some help debugging please, if anyone's up to it.

Here's the skinny. This is a subset of a subroutine. Yes, I am using use strict and use warnings. Every single print statement in the code is put there solely for debugging purposes. Eventually, all of the data will be written to a file, with no need for screen output.

What I'm trying to do

I have gathered a series of endpoints to a large text file (containing a genome split into FASTA format records). The endpoints are stored in a %HoAoA, where the keys of the hash correspond to the ID lines of each FASTA record, and the values are arrays of arrays. Each (first level) array corresponds to a particular sequence I want to extract from the genome file. Each (second level) array contains three elements: [0] is the starting point, [1] is the ending point, and [2] is the length of DNA contained between them, defined as [1] - [0]. At this point in the script, I am trying to refer to the %HoAoA to find out where each sequence is, and then extract that sequence from the genome file, and place it into a corresponding element in an %HoA.

How I am trying to do it

I've pasted the code below. For page-length's sake, it's behind a readmore cut. I have a series of comments in the code that are to help me keep track of brackets more than anything else.

For those that don't know, FASTA format sequences start with an ID line, which typically begins with a '>', followed by text and a newline character. What follows that is sequence data. In this case, I have saved the ID lines as elements in @fastarray in the order the arose in my original searches, this way, I can go back to them in order, rather than scanning the entire file for every FASTA sequence (there are ~30,000 in this file). I scaan through the file, looking for the next line that matches /^>/, and compare that to the desired FASTA ID, if it's the match, I start countign characters until I get to the starting point ofthe desired sequence, and use substr to keep it.

open(DNA,"<$filename") or croak("Cannot open $filename:$!\n");
FASTAKEY:  for my $g (@fastarray){
 print "\$g is: $g\n";
 if (@{$sets{$g}}){
   print "This \$g, $g, has a set.\n";
  $buffer = '';
  my $skip = 0;
  my $i = 0;
  $position = 0;
  $num = (@{$sets{$g}});
  LINE: while ($newline = <DNA> ) {
    
    if ($skip == 1) {  
      unless ($newline =~ /^>/) {
        next LINE;
        }
      if ($newline =~ /^>/) {
        $skip = 0;
        goto NOSKIP;
        }
      }
      
      else {
      # If new fasta header, 
        NOSKIP: if($newline =~ /^>/) { 
          print "....We've hit a newline, which is: $newline";
          if ($buffer eq '') { 
            chomp $newline;
            print "We are in the 'if buffer eq' loop.\n";
            print "..\$g is: $g, and \$newline is $newline\n";
            if ($newline ne $g) {
              $skip = 1;
              next LINE;
              }
            if ($newline eq $g) {
              $header = $newline;
              $g = $header;
              $skip = 0;
              print "Gathering sequence data from sequence $g\n";
              next LINE;
              }
              
            # closes if buffer eq ''
            }
            
          else { # ($buffer ne '')
            FINISH: print "......At FINISH, \$num is: $num, and \$i is
+: $i\n"; 
            if ($i == $num) {
              print "in FINISH, \$i is: $i, and \$num is $num.  They s
+hould be equal\n";
              $position = 0;
              $skip = 1;
              seek(DNA,(0-length($buffer)),1);
              $len = 0;
              $buffer = '';
              next FASTAKEY;
              }    
              else {
                print "in FINISH, \$i is: $i, and \$num is $num.  They
+ should not be equal\n";
                print "\$i is: $i, \$HoA{$g}[$i][0] is: $HoA{$g}[$i][0
+], \$HoA{$g}[$i][1] is: $HoA{$g}[$i][1], \$position is: $position and
+ length(\$buffer) is: ".length($buffer)."\n";
                if ($HoA{$g}[$i][0] >= ($position + length($buffer))) 
+{print "The position of \$lowest is past the end of the buffer!  How 
+did that happen?!\n";}
                if ($HoA{$g}[$i][1] < $position) {print "The position 
+of \$highest is below that of \$position!  How did this happen?!\n";}
                
                if ($HoA{$g}[$i][1] >= ($position + length($buffer))) 
+{
                  $match{$g}[$i] = uc(substr($buffer, ($HoA{$g}[$i][0]
+ - $position)));
                  }
                  else {  
                    $match{$g}[$i] = uc(substr($buffer, ($HoA{$g}[$i][
+0] - $position), $HoA{$g}[$i][2]));  
                    }
                
                $i++; 
                goto FINISH;
                # close uncommented else
                }
                
              #closes else  ($buffer ne ''  )
              }
            # closes if newline    
            }
            
          # closes else statement  
          }

        
    chomp $newline;
    $buffer .= $newline;
    $len = length($newline);
    
    if (($position + $span) < $HoA{$g}[$i][0] && length($buffer) > $sp
+an) {
    # Reset the position counter
    # (will be accurate after you reset the buffer, next)
      $position += $span;
     
    # Reset the buffer
    # Discard the first $max worth of data in the buffer
      $buffer = substr($buffer, $span);
      }
    
    unless (length($buffer) >= $HoA{$g}[$i][2] && $position <= $HoA{$g
+}[$i][0] && (length($buffer) + $position) >= $HoA{$g}[$i][1]) {
      next LINE;
      }
    
    if ($i < $num) {  
      $match{$g}[$i] = uc(substr($buffer, ($HoA{$g}[$i][0] - $position
+), $HoA{$g}[$i][2]));  
      $i++;
      }
    if ($i == $num) {
      seek(DNA,(0-length($buffer)),1);
      $buffer = '';
      $position = 0;
      $skip = 1;
      $len = 0;
      next FASTAKEY;
      }
      else {
        if ($position + $len < $HoA{$g}[$i][0]) {
          # Reset the position counter
          # (will be accurate after you reset the buffer, next)
          $position += $len;
     
          # Reset the buffer
          # Discard the first $max worth of data in the buffer
          $buffer = substr($buffer, $len); 
          }
        $len = 0;
        next LINE;   
        }
    # closes LINE
    }
   }
  # closes FASTAKEY
  }   
close(DNA) or croak("Cannot close $filename:$!\n");    
    
print "Finished gathering all the sequences, now moving on.\n";
[download]

So, that's the relevant portion of the code.

What's going wrong

Some sample

$g is: >LG_I
This $g, >LG_I, has a set.
....We've hit a newline, which is: >LG_I
We are in the 'if buffer eq' loop.
..$g is: >LG_I, and $newline is >LG_I
Gathering sequence data from sequence >LG_I
....We've hit a newline, which is: >LG_II
......At FINISH, $num is: 73, and $i is: 68
in FINISH, $i is: 68, and $num is 73.  They should not be equal
$i is: 68, $HoA{>LG_I}[68][0] is: 22325699, $HoA{>LG_I}[68][1] is: 223
+26343, $position is: 33722650 and length($buffer) is: 1848919
The position of $highest is below that of $position!  How did this hap
+pen?!
substr outside of string at Test3 line 1483, <DNA> line 508168.
......At FINISH, $num is: 73, and $i is: 69
in FINISH, $i is: 69, and $num is 73.  They should not be equal
$i is: 69, $HoA{>LG_I}[69][0] is: 27156174, $HoA{>LG_I}[69][1] is: 271
+56931, $position is: 33722650 and length($buffer) is: 1848919
The position of $highest is below that of $position!  How did this hap
+pen?!
substr outside of string at Test3 line 1483, <DNA> line 508168.
......At FINISH, $num is: 73, and $i is: 70
in FINISH, $i is: 70, and $num is 73.  They should not be equal
$i is: 70, $HoA{>LG_I}[70][0] is: 10031764, $HoA{>LG_I}[70][1] is: 100
+32474, $position is: 33722650 and length($buffer) is: 1848919
The position of $highest is below that of $position!  How did this hap
+pen?!
substr outside of string at Test3 line 1483, <DNA> line 508168.
......At FINISH, $num is: 73, and $i is: 71
in FINISH, $i is: 71, and $num is 73.  They should not be equal
$i is: 71, $HoA{>LG_I}[71][0] is: 22783500, $HoA{>LG_I}[71][1] is: 227
+84101, $position is: 33722650 and length($buffer) is: 1848919
The position of $highest is below that of $position!  How did this hap
+pen?!
substr outside of string at Test3 line 1483, <DNA> line 508168.
......At FINISH, $num is: 73, and $i is: 72
in FINISH, $i is: 72, and $num is 73.  They should not be equal
$i is: 72, $HoA{>LG_I}[72][0] is: 28700514, $HoA{>LG_I}[72][1] is: 287
+01277, $position is: 33722650 and length($buffer) is: 1848919
The position of $highest is below that of $position!  How did this hap
+pen?!
substr outside of string at Test3 line 1483, <DNA> line 508168.
......At FINISH, $num is: 73, and $i is: 73
in FINISH, $i is: 73, and $num is 73.  They should be equal
$g is: >LG_II
This $g, >LG_II, has a set.
....We've hit a newline, which is: >LG_II
......At FINISH, $num is: 61, and $i is: 4
in FINISH, $i is: 4, and $num is 61.  They should not be equal
$i is: 4, $HoA{>LG_II}[4][0] is: 2635020, $HoA{>LG_II}[4][1] is: 26352
+37, $position is: 1822784 and length($buffer) is: 86
The position of $lowest is past the end of the buffer!  How did that h
+appen?!
substr outside of string at Test3 line 1480, <DNA> line 534211.

...and so on.  it continues all the way up to $i is 61.
[download]

, followed by why it is bothersome:

Here are the problems. First off, when I get towards the end of a particular FASTA sequence, I try to finish off saving matching sequences while there is still buffer containing matches. But, I am getting out-of-bounds errors.

Second, Even though I reset the variable $buffer = '' twice, upon reacing the second FASTA ID line, it skips right past the if ($buffer eq '') { line. This results in the first few sequences of >LG_II actually saving from the data for >LG_I. I don't understand why that it, but it is. I assume that since the buffer variable should be empty at this point, at the NOSKIP label, flow should be directed into the 'if bufer eq' loop, but it isn't. That's my biggest concern.

Any help at all would be greatly appreciated.

Thanks everyone. This thing is driving me batty.
Matt

In reply to Debugging help please? by mdunnbass

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.