How to parse string to substrings based on character occurence in the string

ihperlbeg has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: How to parse string to substrings based on character occurence in the string by ikegami (Patriarch) on Mar 09, 2010 at 05:45 UTC
Can the subsequences overlap? If not, what should `aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa` [download] return? The following all match some interpretation of your spec: `aaaaaaaaaaaaaaaaaaaaa 2 aaaaaaaaaaaaaaaaaaaa 2` [download] or `aaaaaaaaaaaaaaaaaaaaaaa 2 aaaaaaaaaaaaaaaaaa 2` [download] or `aaaaaaaaaa 0 aaaaaaaaaaaaaaaaaaaaa 4 aaaaaaaaaa 0` [download] or `aaaaaaaaaaa 0 aaaaaaaaaaaaaaaaaaa 4 aaaaaaaaaaa 0` [download]	[reply] [d/l] [select]
Re: How to parse string to substrings based on character occurence in the string by BrowserUk (Patriarch) on Mar 09, 2010 at 05:31 UTC
What's more important? That the subsequences contain as many * as possible? That they are as long as possible? (Or short?) Must the all subsequences recombine to form the original string? Eg. If there is a 50 char section without a *, do you make the subsequences longer to cover it, or omit a bit of the string from the set of substrings? Is the range 10-25 hard coded, or could you accept 1 or 2 shorter or longer? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "I'd rather go naked than blow up my ass"	[reply]
Re^2: How to parse string to substrings based on character occurence in the string by ihperlbeg (Novice) on Mar 09, 2010 at 14:48 UTC
1. yes, the subsequence with as many * possible. that is of main importance 2. the range is 10-25. as long as the length of the subsequence falls in this range, it is accepted 3. no, not required. If there is a 50 char section without a *, do you make the subsequences longer to cover it--- No. Yes- omit a bit of the string from the set of substrings? 4. yes, only if 1 or 2 shorter or longer.	[reply]
Re: How to parse string to substrings based on character occurence in the string by BioLion (Curate) on Mar 09, 2010 at 09:47 UTC
It would really help us to help you in the right direction if you could give us some input and the accompanying expected/desired output. < plug > Also check out Perl and Bioinformatics <\plug> Just a something something...	[reply]
Re^2: How to parse string to substrings based on character occurence in the string by ihperlbeg (Novice) on Mar 09, 2010 at 15:31 UTC
let me clarify my input and output that I want input- a sequence `.........RKRMMWWVWMWRYHDWMHHRDRMDMWHMWYVMWVRWMVBHWKVYWSMHYWYHWVMVS +KDHMDBYKMWRSMDSD...YWDVWDRYHHYRYKRWWDDKDDHDV*HYWRWWMYMRVBWB +WWDMVSYWDBDWWYSMKWYRVWVYYRMVKRKWWDMRMWRKR*YWHHWH...DYDMWKKKKWS` [download] here are the potential subsequences that I am looking for from this sequence: Output in the order of the occurrence in the sequence (from left to right) `1. VWMWRYHDWMHHR* (3 , 16 length) 2. HWVMVSKDHMDBYKMWRSMDSD* (2 , 24 length) 3. ...*YWD* (5 , 11 length) 4. HDV*HYWRWWMYMRV (6 , 20 length) 5. YRVWVYYRMVKRKWWDMRMWR* (4, 25 length) 6. KRKWWDMRMWRKR** (5 , 18 length) 7. RKR*YWHHWH...DYD (4, 19 length)` [download] does this make sense? I do not want the subsequences to overlap necessarily. The importance is more on the number of than the length. For example, subseq with 6* and 10 length is as good as subseq with 6* and 20 length. So as shorter/longer the subseq could be with max * is accepted (though in the range 10-25)	[reply] [d/l] [select]
Re^3: How to parse string to substrings based on character occurence in the string by BrowserUk (Patriarch) on Mar 09, 2010 at 21:59 UTC
does this make sense? Not completely. `HDVHYWRWWMYMRV` doesn't appear in your input. Did you mean `DVHYWRWWMYMRV (6, 19)` or `HDV*HYWRWWMYMRV (6,20)`? And if the latter, why? Why `RKR*YWHHWH...DYD (4, 19 length)` instead of `KR*YWHHWH...DYD (4, 18 length)` Maybe this is something like your goal? #! perl -slw use strict; my $seq = 'RKRMMWWVWMWRYHDWMHHRDRMDMWHMWYVMWVRWMVBHWKVYWSMHYWYHWVM +VSKD' . 'HMDBYKMWRSMDSD...*YWDVWDRYHHYRYKRWWDDKDDHDV*HYWRWWMYMRV +BWB'; my %uniq; substr( $seq, $_, 25 ) =~ m[(\.{8,23}\)] and ++$uniq{ $1 } == 1 and print "'$1'" for 0 .. length( $seq )-1; __END__ C:\test>827470 'VWMWRYHDWMHHR' 'HWVMVSKDHMDBYKMWRSMDSD' '...*YWD' 'WDVWDRYHHYRYKRWWDDKDDH' 'VWDRYHHYRYKRWWDDKDDH' 'VWDRYHHYRYKRWWDDKDDHDV' 'DV*HYWRW' 'DV*HYWRWWMYMRV' '*HYWRWWMYMRV' 'HYWRWWMYMRV' 'RWWMYMRV*' [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "I'd rather go naked than blow up my ass"	[reply] [d/l] [select]