No, A gene is like this: it is preceded by a string TATAAT and after this string there can be one or many strings of letters A,C,G,T . then ATG string follows them, then again random amount of A,C,G,T's follow it and the gene ends with one of the strings TAA, TGA or TAG. for example a line is TATAATATTACAATGGATCATACAGTTAG ... our gene is the part between ATG and TAG (ATGGATCATACAGTTAG here) but we also have to make sure it is preceded by a TATAAT.. I have to print out the genes in the txt file according to these rules. | [reply] |
Incorporating citromatik's suggestion, we have now:
use strict;
use warnings;
my $filename = 'input.txt';
open (my $IN, '<', $filename) or die "Can't open file $filename : $! "
+;
my $text; # no need to initialize it to the empty string!
while($line = <$IN>) {
chomp $line; ## <---- !!!
$text .= $line;
}
while($text =~ m/TATAAT[ACGT]+?(ATG[ACGT]+?(?:TGA|TAG|TAA))/g) {
print "$1\n";
}
I have added some non capturing parentheses (the (?: ... ) around the TGA|TAG|TAA) and most important made the + quantifiers non-greedy by adding a ? as otherwise they would match too much and only return you the last gene in your string.</c>
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James
| [reply] [d/l] [select] |