mdunnbass has asked for the wisdom of the Perl Monks concerning the following question:
I thought I had a lock on this concept, which is definitive proof that I was missing something. I am getting unexpected results from a sub that FASTA format files, i.e.
>identifier (newline) lots_of_text_no_whitespace (newline)
that's a rough approximation, but you get the point.
The sub in question is behind the
Most of the code was yoinked directly from Tisdall's 'Beginning Perl for Bioinformatics', and has been working beautifully for me (or so I thought)#### # This subroutine takes a file containing any number of FASTA sequenc +es, and # by reading the sequences in a line at a time, searches all the sequ +ences for # matches to binding sites entered. The header line of the FASTA seq +uence is # set to a key in a hash, and each site searched for is set to a key +in a sub hash # for each Fasta ID. The position of the 1st base of the matches is +set to the # values of each subhash. Only the position of the matches is # saved, no actual text. Any formatting or output of the matches nee +ds to be # done elsewhere. #### sub SEARCHFASTA { my $site = 0; my $fasta_id = shift; my $position = 0; my $buffer = ''; my %matches = (); open(DNA,"<", "$filename") or croak("Cannot open $filename:$!\n"); seek(DNA,$index_hash{$fasta_id},0); GETSLINE: while($newline = <DNA> ) { # commented out debugging statement: print "\$newline is: $newline" +; # If new fasta header or blank line, if($newline =~ /(^>)|(^$)/) { chomp $newline; if ($newline eq $fasta_id) { print "Currently searching sequence:\n $newline\n"; unless ($buffer eq '') { $i = 0; # commented out debugging statement: print "in unless buffer + loop\n"; while ($i < $num) { while($buffer =~ /$pattern[$i]/gxsi) { # commented out debugging statement: print " match is $ +1\n"; push @{$matches{$i}}, $position + $-[0]; } $i++; } $buffer = ''; $position = 0; } } # if the newline is not the fasta_id, then we finished searching + it, and # need to exit the loop structure, and close the file. if ($newline ne $fasta_id) { last GETSLINE; } } chomp $newline; $buffer .= $newline; if (length($buffer) < $span ) { next GETSLINE; } # Search for the DNA pattern $i = 0; while ($i < $num) { while($buffer =~ /$pattern[$i]/gsxi) { push @{$matches{$i}}, $position + $-[0]; } $i++; } # Reset the position counter # (will be accurate after you reset the buffer, next) $position += (length($buffer) - $lg[0]+1); # Reset the buffer # Discard the first $max worth of data in the buffer $buffer = substr($buffer, (length($buffer) - $lg[0] + 1)); } # This erases some of the larger variables from the search routine + $buffer = ''; $position = 0; for my $tfbs (keys %matches) { foreach (@{$matches{$tfbs}}) { $_ -= length($fasta_id); } } close(DNA) or croak("Cannot close $filename:$!\n"); return %matches; }
The problem I'm getting is when I start playing around with files with different types of carriage returns/newlines. I think. I say this, because for most files I use, everything works great. But when i cut and paste together FASTA sequences from certain editors, the sequences are completely skipped (i.e., Data::Dumper output of %matches is has $fasta_id as key, but values are empty. I threw in a few print statements to see where the problems were occurring, and here's what I got:
The problem lies with the "Currently searching sequence named" bit. That should be the $fasta_id only, but somehow, it's taking the last line of sequence along with part of the next-to-last and first lines too. I am confused.Sample sequence: >LepenhancerForward TTTATGCACTCATGTTTAGACATATTTCCTACACCCATATTTGAAGACCA AAATAAACTGCATGTTATTTCATTGTAGGCATAGTACCTGTAGTATAATC TTATATCTCATGCAGTGGTTCAGTCAGAAAAAAAAAGAAAAAAAGGGTGT TTCTCAACATTTTTTCATAATTTGATTTTTTTGTAAAAGAAGAGGGCTTG CGATCCCAGTAACCCCTTGCAGCCCCACTGATCTCATGTATATTTCAGAT CACTACCAAGCTGTATTAGAATGCTACATGGCATGCCCTGATGAAGCAGC TAGAGTAAACGCTAATTCATTCATCAACAATTTTGTGCCAAAAGTTTATC ATTACCTGCAGTTTGCTTACTACCATGGTAGGGTATTTACCCAAGTTAGT CACTTTATATTTCATCGCCACATATACATTTGTTGTTCCAATTCCGTGTA TTATTTTAAACATTGTATCTAAGTGTGATACTTTGTGTAAACTCTAAAAG TAGATTCTGTAAACTGTGTCCATTTCTGTGCCTATTTACTTGGTATTTGT TAACTCTTAGTACACATAAGTTTACGTACC print and Dumper statements then output: $newline is: >LepenhancerForward Currently searching sequence: >LepenhancerForward TAACTCTTAGTACACATAAGTTTACGTACCCCTATTTACTTGGTATTTGTTATTTGAAGACCA $newline is: Dumper output is: $VAR1 = '>LepenhancerForward'; $VAR2 = {};
I have a feeling the problem is with the initial /(^>)|(^$)/, but I want to make sure that in addition to looking for ^>, which denotes a new fasta id, it also keeps an eye out for blank lines as well, and knows to go to last GETSLINE when it hits one.
If someone could please point out my folly, I'd really appreciate it.
Thanks
Matt
P.S. -
if I wanted to slightly tweak this sub so that it runs a bit faster, would passing the global variables (even if for instance %index_hash is very large) make it go any quicker, you think? I suppose I can just try it myself and benchmark, but I was just wondering if there was something glaring off the bat you guys saw that threw up red flags, etc.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Regexp and newlines, what am I missing?
by almut (Canon) on Feb 06, 2007 at 22:14 UTC | |
|
Re: Regexp and newlines, what am I missing?
by graff (Chancellor) on Feb 07, 2007 at 02:11 UTC | |
|
Re: Regexp and newlines, what am I missing?
by samtregar (Abbot) on Feb 06, 2007 at 22:24 UTC |