in reply to Re: how do i count the 22 selected di-peptides from a multifasta file separately for each sequence
in thread how do i count the 22 selected di-peptides from a multifasta file separately for each sequence

Sir dipeptides are the combination of two amino acids. Generally there are 2o amino acids, so 400 combinations. Here i want to find out the count of the 22 combinations present and number of di-peptides absent in each sequence of the input (multifasta) file. For eg. Input file : >seq1, complete cds AAALVDENEC >seq2, complete cds AATLVDEGDG observed output: >seq1, complete cds sum = 4, abs = 18, >seq2, complete cds sum = 2, abs = 20, Expected output: >seq1, complete cds sum = 5, abs = 17, >seq2, complete cds sum = 3, abs = 19, I THINK THE ERROR RESULT COMES DUE TO THE WHILE LOOP WHICH FAILS TO COUNT THE OVERLAPPING DI-PEPTIDE PRESENT IN THE SEQUENCE
  • Comment on Re^2: how do i count the 22 selected di-peptides from a multifasta file separately for each sequence

Replies are listed 'Best First'.
Re^3: how do i count the 22 selected di-peptides from a multifasta file separately for each sequence
by Corion (Patriarch) on Apr 26, 2015 at 09:46 UTC

    If you want to count overlapping matches, a plain regular expression as you wrote it isn't the easiest way to approach the problem.

    Personally, I would simply iterate over the string either for each character or by resetting pos and using \G as documented in perlre:

    use strict; my $line= 'AAALVDENEC'; while( $line =~ /\G.*?(AA|AL|DA|DE|DV|VD|DW|QD|SD|HD|ED|DY|VE|EN|EI|KE +|NV|VP|FV|SS|WK|KK)/igc ) { print sprintf 'Matched [%s] at %d', $1, pos($line); pos( $line )--; }
    A reply falls below the community's threshold of quality. You may see it by logging in.
Re^3: how do i count the 22 selected di-peptides from a multifasta file separately for each sequence
by Anonymous Monk on Apr 26, 2015 at 10:58 UTC

    Here's one way to match overlapping patterns:

    #!/usr/bin/perl # http://perlmonks.org/?node_id=1124725 use Data::Dump qw(pp); use strict; $| = 1; my %answer; for my $line ( 'AAALVDENEC', 'AATLVDEGDG' ) { $answer{$line}{$_}++ for $line =~ /(?=(AA|AL|DA|DE|DV|VD|DW|QD|SD|HD|ED|DY|VE|EN|EI|KE|NV|VP|FV|SS|W +K|KK))/g; } pp \%answer;

    which produces:

    { AAALVDENEC => { AA => 2, AL => 1, DE => 1, EN => 1, VD => 1 }, AATLVDEGDG => { AA => 1, DE => 1, VD => 1 }, }

    That has the 5 and 3 you are looking for.

      Thank you it worked.......thanks for the help
Re^3: how do i count the 22 selected di-peptides from a multifasta file separately for each sequence
by Anonymous Monk on Apr 26, 2015 at 20:48 UTC

    There is an alternative to matching with overlapping patterns:

    #!/usr/bin/perl # http://perlmonks.org/?node_id=1124725 use strict; my @dipeptides = split /\|/, 'AA|AL|DA|DE|DV|VD|DW|QD|SD|HD|ED|DY|VE|EN|EI|KE|NV|VP|FV|SS|WK|KK'; for my $line ( 'AAALVDENEC', 'AATLVDEGDG' ) { my $sum = grep $line =~ $_, @dipeptides; my $abs = @dipeptides - $sum; print "$line sum: $sum abs: $abs\n"; }