Extracting a substring according to some criteria

Benson has asked for the wisdom of the Perl Monks concerning the following question:

Friends,
I have a problem in getting substring according to some criteria. I have an input sequence that contains characters A,T,G,C only.

ALLOWED=("AA","AG","GC","GT","CA","CG","TT","TC ");
DISALLOWED=("AC","AT","GG","GA","CC","CT","TG","TA")
[download]

using the above DISALLOWED regions when the program reads the input file it should chop(break) and continue. For the ALLOWED region it should not chop unless it finds a disallowed character. And also at the end of each line if the last character of each line and the first character of next line is in the ALLOWED region it should not chop. And also if the character N comes in between the sequence it should be chopped also. For Ex

CTGTCAGCNNNCCGGTTTTCAAGNNGAGCACACACCAAAAATGCACCAAAGCTTNACATCCATACAAA
[download]

For the above input sequence the output should be

C
T 
GTCAGC 
NNN
C 
CG 
GTTTTCAAG 
NN
G 
AGCA 
CA 
CA 
C 
CAAAAA 
T 
GCA 
C 
CAAAGC 
TT 
N
A 
CA 
TC 
CA 
T 
A 
CAAA
[download]

Edit: g0n - added code tags

2005-10-19 Retitled by g0n, as per Monastery guidelines
Original title: 'substring'

Comment on Extracting a substring according to some criteria Select or Download Code

Replies are listed 'Best First'.
Re: Extracting a substring according to some criteria by blazar (Canon) on Oct 19, 2005 at 12:07 UTC
IIUC your question, which I'm not completely sure about, and taking into account Skeeve's comment, I'm giving you a (hopefully correct) hint/incomplete solution: `my $allow=join '\|', qw/AA AG GC GT CA CG TT TC/; my $disallow=join '\|', qw/AC AT GG GA CC CT TG TA/; # ... my @chunks=/(?:$allow).*?(?=$disallow)/g;` [download]	[reply] [d/l]
Re: Extracting a substring according to some criteria by blokhead (Monsignor) on Oct 19, 2005 at 16:08 UTC
Here's how to do it with a regex. You inch along the string, and if the next two characters are allowed, then keep matching. As soon as the next two characters aren't allowed, take one more character and stop. (BTW, you don't need to explicitly list @disallowed if it is the complement of @allowed). `my @allowed = qw[ AA AG GC GT CA CG TT TC ]; my $allowed = join "\|", @allowed; my $regex = qr/ N+ \| (?: (?=$allowed) . )* . /x; my $data = "CTGTCAGCNNNCCGGTTTTCAAGNNGAGCACACACCAAAAATGCACCAAAGCTTNACA +TCCATACAAA"; print "$_\n" for $data =~ m/$regex/g;` [download] The regex also has to match /N+/ sequences, so that is added as a special case (Update: Adding "NN" to @allowed would also have the same effect). This doesn't work for multi-line data. You can either first remove all the newlines from the data, or if you don't want to have the whole file in memory at once, you can do this "inching along" process manually. Take one character at a time, keeping track of the last one you've seen as well. If the last two characters are an allowed sequence, then add the new character to a buffer. If the last two characters are disallowed, then print the buffer (it is a maximal allowed string), and restart the buffer starting with this new character. Update: Something like this: `my @allowed = qw[ AA AG GC GT CA CG TT TC NN ]; my %allowed = map { $_ => 1 } @allowed; my $buf; while ( get next input character as $c ) { if ($allowed{ substr($buf,-1).$c }) { $buf .= $c; } else { print "$buf\n"; $buf = $c; } } print "$buf\n"; # don't forget last one` [download] blokhead	[reply] [d/l] [select]
Re^2: Extracting a substring according to some criteria by Benson (Initiate) on Oct 20, 2005 at 07:19 UTC
Friends, I have encountered one problem for the below input because the last character G of first line and the first character C of the next line are @allowed region. But when I get the output, G and C are not in allowed. This is because There is a space after the last character G. `my @allowed = qw[ AA AG GC GT CA CG TT TC ]; my $allowed = join "\|", @allowed; my $regex = qr/ N+ \| (?: (?=$allowed) . )* . /x; $data="TGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG +GGGGGGGGGGGGGGGGGGGGGATAG C"; print "$_\n" for $data =~ m/$regex/g;` [download] ie instead to be printed as GC I get it as `G C` [download] which is wrong Please give a solution. Edit: g0n - code tags	[reply] [d/l] [select]
Re: Extracting a substring according to some criteria by Skeeve (Parson) on Oct 19, 2005 at 11:42 UTC
Do you want us to do your homework? ;-) Show us some code and most monks will be more than willing to solve any problem you encountered. `s$$([},&%#}/&/]+}%&{});#$&&s&&$^X.($'^"%]=\&(\|?{%` `+`.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e	[reply] [d/l] [select]
A reply falls below the community's threshold of quality. You may see it by logging in.