Benson has asked for the wisdom of the Perl Monks concerning the following question:

Friends,
I have a problem in getting substring according to some criteria. I have an input sequence that contains characters A,T,G,C only.

ALLOWED=("AA","AG","GC","GT","CA","CG","TT","TC "); DISALLOWED=("AC","AT","GG","GA","CC","CT","TG","TA")

using the above DISALLOWED regions when the program reads the input file it should chop(break) and continue. For the ALLOWED region it should not chop unless it finds a disallowed character. And also at the end of each line if the last character of each line and the first character of next line is in the ALLOWED region it should not chop. And also if the character N comes in between the sequence it should be chopped also. For Ex

CTGTCAGCNNNCCGGTTTTCAAGNNGAGCACACACCAAAAATGCACCAAAGCTTNACATCCATACAAA

For the above input sequence the output should be

C T GTCAGC NNN C CG GTTTTCAAG NN G AGCA CA CA C CAAAAA T GCA C CAAAGC TT N A CA TC CA T A CAAA

Edit: g0n - added code tags

2005-10-19 Retitled by g0n, as per Monastery guidelines
Original title: 'substring'

Replies are listed 'Best First'.
Re: Extracting a substring according to some criteria
by blazar (Canon) on Oct 19, 2005 at 12:07 UTC
    IIUC your question, which I'm not completely sure about, and taking into account Skeeve's comment, I'm giving you a (hopefully correct) hint/incomplete solution:
    my $allow=join '|', qw/AA AG GC GT CA CG TT TC/; my $disallow=join '|', qw/AC AT GG GA CC CT TG TA/; # ... my @chunks=/(?:$allow).*?(?=$disallow)/g;
Re: Extracting a substring according to some criteria
by blokhead (Monsignor) on Oct 19, 2005 at 16:08 UTC
    Here's how to do it with a regex. You inch along the string, and if the next two characters are allowed, then keep matching. As soon as the next two characters aren't allowed, take one more character and stop. (BTW, you don't need to explicitly list @disallowed if it is the complement of @allowed).
    my @allowed = qw[ AA AG GC GT CA CG TT TC ]; my $allowed = join "|", @allowed; my $regex = qr/ N+ | (?: (?=$allowed) . )* . /x; my $data = "CTGTCAGCNNNCCGGTTTTCAAGNNGAGCACACACCAAAAATGCACCAAAGCTTNACA +TCCATACAAA"; print "$_\n" for $data =~ m/$regex/g;
    The regex also has to match /N+/ sequences, so that is added as a special case (Update: Adding "NN" to @allowed would also have the same effect).

    This doesn't work for multi-line data. You can either first remove all the newlines from the data, or if you don't want to have the whole file in memory at once, you can do this "inching along" process manually. Take one character at a time, keeping track of the last one you've seen as well. If the last two characters are an allowed sequence, then add the new character to a buffer. If the last two characters are disallowed, then print the buffer (it is a maximal allowed string), and restart the buffer starting with this new character.

    Update: Something like this:

    my @allowed = qw[ AA AG GC GT CA CG TT TC NN ]; my %allowed = map { $_ => 1 } @allowed; my $buf; while ( get next input character as $c ) { if ($allowed{ substr($buf,-1).$c }) { $buf .= $c; } else { print "$buf\n"; $buf = $c; } } print "$buf\n"; # don't forget last one

    blokhead

      Friends,

      I have encountered one problem for the below input because the last character G of first line and the first character C of the next line are @allowed region. But when I get the output, G and C are not in allowed. This is because There is a space after the last character G.

      my @allowed = qw[ AA AG GC GT CA CG TT TC ]; my $allowed = join "|", @allowed; my $regex = qr/ N+ | (?: (?=$allowed) . )* . /x; $data="TGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG +GGGGGGGGGGGGGGGGGGGGGATAG C"; print "$_\n" for $data =~ m/$regex/g;

      ie instead to be printed as GC I get it as

      G C
      which is wrong Please give a solution.

      Edit: g0n - code tags

Re: Extracting a substring according to some criteria
by Skeeve (Parson) on Oct 19, 2005 at 11:42 UTC
    Do you want us to do your homework? ;-) Show us some code and most monks will be more than willing to solve any problem you encountered.

    s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
    +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
A reply falls below the community's threshold of quality. You may see it by logging in.