Regexp and newlines, what am I missing?

mdunnbass has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Monks,

I thought I had a lock on this concept, which is definitive proof that I was missing something. I am getting unexpected results from a sub that FASTA format files, i.e.
>identifier (newline) lots_of_text_no_whitespace (newline)
that's a rough approximation, but you get the point.

The sub in question is behind the

Disclaimer: use strict and warnings are in use here. any variables used but not declared are (gasp!) global. %index_hash is a has hstoring the position in the file or each $fasta_id, for easier lookup later, since I have to go back to them 3 or 4 times per $fasta_id through the run of the script. And @pattern contains short text sequences for searching.

####
#  This subroutine takes a file containing any number of FASTA sequenc
+es, and
#  by reading the sequences in a line at a time, searches all the sequ
+ences for
#  matches to binding sites entered.  The header line of the FASTA seq
+uence is
#  set to a key in a hash, and each site searched for is set to a key 
+in a sub hash
#  for each Fasta ID.  The position of the 1st base of the matches is 
+set to the 
#  values of each subhash.  Only the position of the matches is
#  saved, no actual text.  Any formatting or output of the matches nee
+ds to be 
#  done elsewhere.
####

sub SEARCHFASTA {
  my $site = 0; 
  my $fasta_id = shift;
  my $position = 0; 
  my $buffer = ''; 
  my %matches = ();
 
  open(DNA,"<", "$filename") or croak("Cannot open $filename:$!\n");
  seek(DNA,$index_hash{$fasta_id},0);
    
  GETSLINE: while($newline = <DNA> ) {
   # commented out debugging statement: print "\$newline is: $newline"
+;
 
  # If new fasta header or blank line,
    if($newline =~ /(^>)|(^$)/) {  
      chomp $newline;
      if ($newline eq $fasta_id) {
        print "Currently searching sequence:\n     $newline\n";
        unless ($buffer eq '') {
          $i = 0;
          # commented out debugging statement: print "in unless buffer
+ loop\n";
          while ($i < $num) {
            while($buffer =~ /$pattern[$i]/gxsi) {
              # commented out debugging statement: print "  match is $
+1\n";
              push @{$matches{$i}}, $position + $-[0]; 
              }
            $i++;
            } 
          $buffer = '';
          $position = 0;
          }
        }
      # if the newline is not the fasta_id, then we finished searching
+ it, and
      # need to exit the loop structure, and close the file.
      if ($newline ne $fasta_id) {
        last GETSLINE;
        }
      }       
    
    chomp $newline;
    $buffer .= $newline;
    
    if (length($buffer) < $span ) {
      next GETSLINE;
      }
    
    # Search for the DNA pattern
    $i = 0;
    while ($i < $num) {
      while($buffer =~ /$pattern[$i]/gsxi) {
        push @{$matches{$i}}, $position + $-[0]; 
        }
      $i++;
      } 
      
    # Reset the position counter
    # (will be accurate after you reset the buffer, next)
    $position += (length($buffer) - $lg[0]+1);
  
    # Reset the buffer
    # Discard the first $max worth of data in the buffer
    $buffer = substr($buffer, (length($buffer) - $lg[0] + 1));     
    } 
      
  # This erases some of the larger variables from the search routine  
+ 
  $buffer = '';
  $position = 0; 
   
  for my $tfbs (keys %matches) {
    foreach (@{$matches{$tfbs}}) {
      $_ -= length($fasta_id);
      }
    } 
  close(DNA) or croak("Cannot close $filename:$!\n");
  return %matches; 
}
[download]

Most of the code was yoinked directly from Tisdall's 'Beginning Perl for Bioinformatics', and has been working beautifully for me (or so I thought)

The problem I'm getting is when I start playing around with files with different types of carriage returns/newlines. I think. I say this, because for most files I use, everything works great. But when i cut and paste together FASTA sequences from certain editors, the sequences are completely skipped (i.e., Data::Dumper output of %matches is has $fasta_id as key, but values are empty. I threw in a few print statements to see where the problems were occurring, and here's what I got:

Sample sequence:
>LepenhancerForward
TTTATGCACTCATGTTTAGACATATTTCCTACACCCATATTTGAAGACCA
AAATAAACTGCATGTTATTTCATTGTAGGCATAGTACCTGTAGTATAATC
TTATATCTCATGCAGTGGTTCAGTCAGAAAAAAAAAGAAAAAAAGGGTGT
TTCTCAACATTTTTTCATAATTTGATTTTTTTGTAAAAGAAGAGGGCTTG
CGATCCCAGTAACCCCTTGCAGCCCCACTGATCTCATGTATATTTCAGAT
CACTACCAAGCTGTATTAGAATGCTACATGGCATGCCCTGATGAAGCAGC
TAGAGTAAACGCTAATTCATTCATCAACAATTTTGTGCCAAAAGTTTATC
ATTACCTGCAGTTTGCTTACTACCATGGTAGGGTATTTACCCAAGTTAGT
CACTTTATATTTCATCGCCACATATACATTTGTTGTTCCAATTCCGTGTA
TTATTTTAAACATTGTATCTAAGTGTGATACTTTGTGTAAACTCTAAAAG
TAGATTCTGTAAACTGTGTCCATTTCTGTGCCTATTTACTTGGTATTTGT
TAACTCTTAGTACACATAAGTTTACGTACC

print and Dumper statements then output:

$newline is: >LepenhancerForward

Currently searching sequence:
     >LepenhancerForward
TAACTCTTAGTACACATAAGTTTACGTACCCCTATTTACTTGGTATTTGTTATTTGAAGACCA

$newline is: 

Dumper output is:
$VAR1 = '>LepenhancerForward';
$VAR2 = {};
[download]

The problem lies with the "Currently searching sequence named" bit. That should be the $fasta_id only, but somehow, it's taking the last line of sequence along with part of the next-to-last and first lines too. I am confused.

I have a feeling the problem is with the initial /(^>)|(^$)/, but I want to make sure that in addition to looking for ^>, which denotes a new fasta id, it also keeps an eye out for blank lines as well, and knows to go to last GETSLINE when it hits one.

If someone could please point out my folly, I'd really appreciate it.

Thanks
Matt

P.S. -
if I wanted to slightly tweak this sub so that it runs a bit faster, would passing the global variables (even if for instance %index_hash is very large) make it go any quicker, you think? I suppose I can just try it myself and benchmark, but I was just wondering if there was something glaring off the bat you guys saw that threw up red flags, etc.

Comment on Regexp and newlines, what am I missing? Select or Download Code

Replies are listed 'Best First'.
Re: Regexp and newlines, what am I missing? by almut (Canon) on Feb 06, 2007 at 22:14 UTC
I guess what's causing you problems is that on some lines, there's only a carriage return at the end of line (in your example, it's those lines with the sequence data). So what happens is that all those lines (up to the next proper newline (i.e. the one used on your platform)) are being read in as one line; and when you print that string, the carriage returns reposition the terminal cursor to the beginning of the line, so most of data is overprinted... The "weird" result you're seeing is what remains. The following snippet demonstrates the effect (note the "\r" at the end of the substrings): `my $newline = "TTTATGCACTCATGTTTAGACATATTTCCTACACCCATATTTGAAGACCA\r" . "... other lines (overprinted) ...\r" . "TAGATTCTGTAAACTGTGTCCATTTCTGTGCCTATTTACTTGGTATTTGT\r" . "TAACTCTTAGTACACATAAGTTTACGTACC\r"; print "\$newline is: $newline\n"; # (this is your debug print)` [download] In other words, you have to make sure that - after cut-n-pasting from one platform to another - the resulting lines are all conforming to the same newline style (preferably the one being used by your OS :) Several tools can help here, for example `dos2unix, unix2dos, ...`, or Perl's `tr///`, or `s///` (e.g. `tr/\r/\n/`). Some editors can do auto-conversions, too. But you have to fix things before you process the data with your program. Good luck.	[reply] [d/l] [select]
Re: Regexp and newlines, what am I missing? by graff (Chancellor) on Feb 07, 2007 at 02:11 UTC
If you're having trouble with wierd and uncontrolled whitespace, it might be a good idea to put together a proper diagnosis tool, to at least warn you when the whitespace might cause trouble for the assumptions that are implicit in your script. Something like: #!/usr/bin/perl =head1 NAME space-hist -- tabulate histogram of whitespace patterns =head1 SYNOPSIS space-hist file.name =head1 DESCRIPTION This will read the full content of the given text file into memory, tokenize the data on whitespace, and then count how many times each distinct whitespace string occurs. The various whitespace string patterns are listed on STDOUT as strings of hex codes, with the frequency of occurrence for each, in descending order of frequency. =cut use strict; die "Usage: $0 file.txt\n" unless ( @ARGV == 1 and -f $ARGV[0] ); $/ = undef; # slurp mode open( IN, "<", $ARGV[0] ) or die "open failed for $ARGV[0]: $!"; $_ = <IN>; @_ = split /(\s+)/; my %whitespace; for my $tkn ( @_ ) { next unless ( $tkn =~ /^\s+$/ ); my $wshex = join " ", map { sprintf "%02x", ord($_) } split //, $t +kn; $whitespace{$wshex}++; } print "Whitespace tokens found in $ARGV[0]:\n"; for ( sort {$whitespace{$b} <=> $whitespace{$a}} keys %whitespace ) { printf "%4d %s\n", $whitespace{$_}, $_; } [download] (updated to print "Usage" when @ARGV is not as expected) The point is: you either need to adapt your script to the range of whitespace patterns that actually occur in your data, or else you need to doctor your data so that its whitespace patterns conform to the limitations of your script. Looking at what's in the data is only the first step, and it's totally worthwhile to make that step as thorough as possible. BTW, using my script as input to itself, I get the following (because I created it with emacs on macosx): `$ space-hist space-hist Whitespace tokens found in space-hist: 125 20 12 0a 0a 11 0a 4 0a 20 20 20 20 2 20 20` [download]	[reply] [d/l] [select]
Re: Regexp and newlines, what am I missing? by samtregar (Abbot) on Feb 06, 2007 at 22:24 UTC
You posted too much code. I think that's the reason no-one else has responded to your question - we just don't have time to read that much code. You need to boil down your problem to just a few lines of code - you might find the problem yourself during this process, but if not you'll have something we can help you with. But since I'm here anyway, I've got a random guess: something else is setting $/ in your program, messing up <> and causing you to get multiple lines in the (oddly named) $newline variable. I don't see how else that line could print two lines of text - but I do note that the code you posted isn't what produced the output you show... Maybe something changed aside from commenting some stuff out? -sam	[reply]