Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

dear monks,

thank you for all your before helps.

in one protein sequence as you know we have a string from amino acids.for example:

AACCCDGYAEELPSWWYAOOLLLSSBBBDDD. I want to seeprate them . i=2:AA AC CC CC CD DG... i=3:AAC ACC CCC CCD.... i=4:.... . . .

i started to write, but i thought it is impossible or beyond my abilities!! please help me.

Replies are listed 'Best First'.
Re: separation a string
by BrowserUk (Patriarch) on Nov 03, 2011 at 11:00 UTC

    Update: Improved the performance.

    sub ntuples{ my( $n, $s ) = @_; my $b = $n -1; my $n2 = length( $s ) - $n +1; return unpack "(A$n X$b)$n2", $s; } ;; $genome = 'AACCCDGYAEELPSWWYAOOLLLSSBBBDDD';; print join ' - ', ntuples( $_, $genome ) for 2 .. 5;; AA - AC - CC - CC - CD - DG - GY - YA - AE - EE - EL - LP - PS - SW - +WW - WY - YA - AO - OO - OL - LL - LL - LS - SS - SB - BB - BB - BD - + DD - DD AAC - ACC - CCC - CCD - CDG - DGY - GYA - YAE - AEE - EEL - ELP - LPS +- PSW - SWW - WWY - WYA - YAO - AOO - OOL - OLL - LLL - LLS - LSS - S +SB - SBB - BBB - BBD - BDD - DDD AACC - ACCC - CCCD - CCDG - CDGY - DGYA - GYAE - YAEE - AEEL - EELP - +ELPS - LPSW - PSWW - SWWY - WWYA - WYAO - YAOO - AOOL - OOLL - OLLL - + LLLS - LLSS - LSSB - SSBB - SBBB - BBBD - BBDD - BDDD AACCC - ACCCD - CCCDG - CCDGY - CDGYA - DGYAE - GYAEE - YAEEL - AEELP +- EELPS - ELPSW - LPSWW - PSWWY - SWWYA - WWYAO - WYAOO - YAOOL - AOO +LL - OOLLL - OLLLS - LLLSS - LLSSB - LSSBB - SSBBB - SBBBD - BBBDD - +BBDDD

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: separation a string
by Ratazong (Monsignor) on Nov 03, 2011 at 09:39 UTC

    substr could be your friend: it gets you substrings, starting at a defined offset and with a defined length. Now all you have to do is to create suitable loops for the required offsets and lengths.

    HTH, Rata

      yes,substr was the only thing that i used, but my bigger problem is that i don't want to just separate 2-2 or 3-3. i need to have all 2 alphabets or 3 alphabets words as i brought in example. thank you again.

        i need to have all 2 alphabets or 3 alphabets words

        You can call substr() repeatedly in a loop (as Ratazong pointed out), which gives you all 2/3/...-substrings.  I'm not 100% sure what your task is, but judging from the sample output, you seem to want something like this:

        my $s = "AACCCDGYAEELPSWWYAOOLLLSSBBBDDD"; for my $len (2..4) { my @parts; for my $offs (0..length($s)-$len) { push @parts, substr($s, $offs, $len); } print "i=$len: @parts\n"; } __END__ i=2: AA AC CC CC CD DG GY YA AE EE EL LP PS SW WW WY YA AO OO OL LL LL + LS SS SB BB BB BD DD DD i=3: AAC ACC CCC CCD CDG DGY GYA YAE AEE EEL ELP LPS PSW SWW WWY WYA Y +AO AOO OOL OLL LLL LLS LSS SSB SBB BBB BBD BDD DDD i=4: AACC ACCC CCCD CCDG CDGY DGYA GYAE YAEE AEEL EELP ELPS LPSW PSWW +SWWY WWYA WYAO YAOO AOOL OOLL OLLL LLLS LLSS LSSB SSBB SBBB BBBD BBDD + BDDD
Re: separation a string
by AnomalousMonk (Archbishop) on Nov 03, 2011 at 12:06 UTC

    BrowserUk's unpack approach is probably a bit faster, but here's the 'standard' regex approach:

    >perl -wMstrict -le "my $s = 'AACCCDGYAEELPSWWYA'; ;; for my $n (2 .. 5) { my @subseqs = $s =~ m{ (?= (.{$n})) }xmsg; print qq{n $n: @subseqs}; } " n 2: AA AC CC CC CD DG GY YA AE EE EL LP PS SW WW WY YA n 3: AAC ACC CCC CCD CDG DGY GYA YAE AEE EEL ELP LPS PSW SWW WWY WYA n 4: AACC ACCC CCCD CCDG CDGY DGYA GYAE YAEE AEEL EELP ELPS LPSW PSWW +SWWY WWYA n 5: AACCC ACCCD CCCDG CCDGY CDGYA DGYAE GYAEE YAEEL AEELP EELPS ELPSW + LPSWW PSWWY SWWYA