in reply to how do i count the 22 selected di-peptides from a multifasta file separately for each sequence

Please help us help you better!

I don't know what a di-peptide is and how it appears in your program. Maybe you can show us relevant input data and the output data you get. Please also explain what you expect as output data.

  • Comment on Re: how do i count the 22 selected di-peptides from a multifasta file separately for each sequence

Replies are listed 'Best First'.
Re^2: how do i count the 22 selected di-peptides from a multifasta file separately for each sequence
by SOMEN (Initiate) on Apr 26, 2015 at 09:01 UTC
    Sir dipeptides are the combination of two amino acids. Generally there are 2o amino acids, so 400 combinations. Here i want to find out the count of the 22 combinations present and number of di-peptides absent in each sequence of the input (multifasta) file. For eg. Input file : >seq1, complete cds AAALVDENEC >seq2, complete cds AATLVDEGDG observed output: >seq1, complete cds sum = 4, abs = 18, >seq2, complete cds sum = 2, abs = 20, Expected output: >seq1, complete cds sum = 5, abs = 17, >seq2, complete cds sum = 3, abs = 19, I THINK THE ERROR RESULT COMES DUE TO THE WHILE LOOP WHICH FAILS TO COUNT THE OVERLAPPING DI-PEPTIDE PRESENT IN THE SEQUENCE

      If you want to count overlapping matches, a plain regular expression as you wrote it isn't the easiest way to approach the problem.

      Personally, I would simply iterate over the string either for each character or by resetting pos and using \G as documented in perlre:

      use strict; my $line= 'AAALVDENEC'; while( $line =~ /\G.*?(AA|AL|DA|DE|DV|VD|DW|QD|SD|HD|ED|DY|VE|EN|EI|KE +|NV|VP|FV|SS|WK|KK)/igc ) { print sprintf 'Matched [%s] at %d', $1, pos($line); pos( $line )--; }
      A reply falls below the community's threshold of quality. You may see it by logging in.

      Here's one way to match overlapping patterns:

      #!/usr/bin/perl # http://perlmonks.org/?node_id=1124725 use Data::Dump qw(pp); use strict; $| = 1; my %answer; for my $line ( 'AAALVDENEC', 'AATLVDEGDG' ) { $answer{$line}{$_}++ for $line =~ /(?=(AA|AL|DA|DE|DV|VD|DW|QD|SD|HD|ED|DY|VE|EN|EI|KE|NV|VP|FV|SS|W +K|KK))/g; } pp \%answer;

      which produces:

      { AAALVDENEC => { AA => 2, AL => 1, DE => 1, EN => 1, VD => 1 }, AATLVDEGDG => { AA => 1, DE => 1, VD => 1 }, }

      That has the 5 and 3 you are looking for.

        Thank you it worked.......thanks for the help

      There is an alternative to matching with overlapping patterns:

      #!/usr/bin/perl # http://perlmonks.org/?node_id=1124725 use strict; my @dipeptides = split /\|/, 'AA|AL|DA|DE|DV|VD|DW|QD|SD|HD|ED|DY|VE|EN|EI|KE|NV|VP|FV|SS|WK|KK'; for my $line ( 'AAALVDENEC', 'AATLVDEGDG' ) { my $sum = grep $line =~ $_, @dipeptides; my $abs = @dipeptides - $sum; print "$line sum: $sum abs: $abs\n"; }