ihperlbeg has asked for the wisdom of the Perl Monks concerning the following question:

I am reading the following sequence from a file:

.........RKRMMWW*VWMWRYHDWMH*HR*DRMDMWHMWYVMWVRWMVBHWKVYWSMHYWY*HWVMVS +KDHMDBYKMWRSMDSD*...**Y*WD*VWDRYHHYRYKRWWDDKDDH*DV**HYW*RW*WMYMRV*BWB

each character in the sequence stands for something. I am interested in locating '*'. I want to parse out the above sequence (length could be in 100's) into sub-sequences of length in the range 10-25, based on the maximum number of '*' the subsequence could possibly cover, if not then at least should have one '*' in a subsequence string, and finally print the subseq's.

I am just starting in perl..kind of have an idea about the basic search patterns but here I would really appreciate any kind of help!!

  • Comment on How to parse string to substrings based on character occurence in the string
  • Download Code

Replies are listed 'Best First'.
Re: How to parse string to substrings based on character occurence in the string
by ikegami (Patriarch) on Mar 09, 2010 at 05:45 UTC

    Can the subsequences overlap?

    If not, what should

    aaaaaaaaaaa*aa*aaaaaaaaaaaaaaa*aa*aaaaaaaaaaa
    return? The following all match some interpretation of your spec:
    aaaaaaaaaaa*aa*aaaaaaaa 2 aaaaaaa*aa*aaaaaaaaaaa 2
    or
    aaaaaaaaaaa*aa*aaaaaaaaaa 2 aaaaa*aa*aaaaaaaaaaa 2
    or
    aaaaaaaaaa 0 a*aa*aaaaaaaaaaaaaaa*aa*a 4 aaaaaaaaaa 0
    or
    aaaaaaaaaaa 0 *aa*aaaaaaaaaaaaaaa*aa* 4 aaaaaaaaaaa 0
Re: How to parse string to substrings based on character occurence in the string
by BrowserUk (Patriarch) on Mar 09, 2010 at 05:31 UTC

    What's more important?

    1. That the subsequences contain as many * as possible?
    2. That they are as long as possible? (Or short?)
    3. Must the all subsequences recombine to form the original string?

      Eg. If there is a 50 char section without a *, do you make the subsequences longer to cover it, or omit a bit of the string from the set of substrings?

    4. Is the range 10-25 hard coded, or could you accept 1 or 2 shorter or longer?

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      1. yes, the subsequence with as many * possible. that is of main importance

      2. the range is 10-25. as long as the length of the subsequence falls in this range, it is accepted

      3. no, not required.

      If there is a 50 char section without a *, do you make the subsequences longer to cover it--- No.

      Yes- omit a bit of the string from the set of substrings?

      4. yes, only if 1 or 2 shorter or longer.

Re: How to parse string to substrings based on character occurence in the string
by BioLion (Curate) on Mar 09, 2010 at 09:47 UTC

    It would really help us to help you in the right direction if you could give us some input and the accompanying expected/desired output. < plug > Also check out Perl and Bioinformatics <\plug>

    Just a something something...

      let me clarify my input and output that I want

      input- a sequence

      .........RKRMMWW*VWMWRYHDWMH*HR*DRMDMWHMWYVMWVRWMVBHWKVYWSMHYWY*HWVMVS +KDHMDBYKMWRSMDSD*...**Y*WD*VWDRYHHYRYKRWWDDKDDH*DV**HYW*RW*WMYMRV*BWB +WWDMVSYWDBDWWYSMKW*YRVWVYYRMV*KRK*WWDMRMWR*KR**YWHHWH...DYD*MWKKKKWS

      here are the potential subsequences that I am looking for from this sequence:

      Output in the order of the occurrence in the sequence (from left to right)

      1. *VWMWRYHDWMH*HR* (3 *, 16 length) 2. *HWVMVSKDHMDBYKMWRSMDSD* (2 *, 24 length) 3. *...**Y*WD* (5 *, 11 length) 4. *HDV**HYW*RW*WMYMRV* (6 *, 20 length) 5. *YRVWVYYRMV*KRK*WWDMRMWR* (4*, 25 length) 6. *KRK*WWDMRMWR*KR** (5 *, 18 length) 7. R*KR**YWHHWH...DYD* (4*, 19 length)

      does this make sense?

      I do not want the subsequences to overlap necessarily. The importance is more on the number of * than the length. For example, subseq with 6* and 10 length is as good as subseq with 6* and 20 length. So as shorter/longer the subseq could be with max * is accepted (though in the range 10-25)

        does this make sense?

        Not completely.

        1. *HDV**HYW*RW*WMYMRV* doesn't appear in your input.

          Did you mean *DV**HYW*RW*WMYMRV* (6*, 19) or H*DV**HYW*RW*WMYMRV* (6*,20)?

          And if the latter, why?

        2. Why R*KR**YWHHWH...DYD* (4*, 19 length) instead of *KR**YWHHWH...DYD* (4*, 18 length)

        Maybe this is something like your goal?

        #! perl -slw use strict; my $seq = 'RKRMMWW*VWMWRYHDWMH*HR*DRMDMWHMWYVMWVRWMVBHWKVYWSMHYWY*HWVM +VSKD' . 'HMDBYKMWRSMDSD*...**Y*WD*VWDRYHHYRYKRWWDDKDDH*DV**HYW*RW*WMYMRV +*BWB'; my %uniq; substr( $seq, $_, 25 ) =~ m[(\*.{8,23}\*)] and ++$uniq{ $1 } == 1 and print "'$1'" for 0 .. length( $seq )-1; __END__ C:\test>827470 '*VWMWRYHDWMH*HR*' '*HWVMVSKDHMDBYKMWRSMDSD*' '*...**Y*WD*' '*WD*VWDRYHHYRYKRWWDDKDDH*' '*VWDRYHHYRYKRWWDDKDDH*' '*VWDRYHHYRYKRWWDDKDDH*DV*' '*DV**HYW*RW*' '*DV**HYW*RW*WMYMRV*' '**HYW*RW*WMYMRV*' '*HYW*RW*WMYMRV*' '*RW*WMYMRV*'

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.