SOMEN has asked for the wisdom of the Perl Monks concerning the following question:

this code fails to count the dipeptides globally. gives count less than the actual value. Unable to figure it out...anyone plese help......thanks in advance

#!/usr/bin/perl use strict; use warnings; use List::Util qw(sum); print "Please type the filename of the DNA sequence MMta: "; my $file = <STDIN>; chomp $file; # Use three arg open and test for errors open my $fh_in, '<', $file or die "Unable to open '$file': $!"; my %count; my $seq; # loop on every lines of input file while(my $line = <$fh_in>) { chomp $line; # Find the sequence name $seq = $1 if $line =~ /(>.*?cds)/; # Count while($line =~ /(AA|AL|DA|DE|DV|VD|DW|QD|SD|HD|ED|DY|VE|EN|EI|KE|N +V|VP|FV|SS|WK|KK)/ig) { $count{$seq}{$1}++; } } open my $fh_out, '>', 'countbase.txt' or die "Unable to open 'countbas +e.txt': $!"; # Display the result separately for each sequence foreach(keys %count) { my $sum = sum(values %{$count{$_}}); my $abs = 22 - scalar keys %{$count{$_}}; print $fh_out $_,"\nsum = ",$sum,"\nabs = ", $abs,"\n"; }
  • Comment on how do i count the 22 selected di-peptides from a multifasta file separately for each sequence
  • Download Code

Replies are listed 'Best First'.
Re: how do i count the 22 selected di-peptides from a multifasta file separately for each sequence
by Corion (Patriarch) on Apr 26, 2015 at 08:39 UTC

    Please help us help you better!

    I don't know what a di-peptide is and how it appears in your program. Maybe you can show us relevant input data and the output data you get. Please also explain what you expect as output data.

      Sir dipeptides are the combination of two amino acids. Generally there are 2o amino acids, so 400 combinations. Here i want to find out the count of the 22 combinations present and number of di-peptides absent in each sequence of the input (multifasta) file. For eg. Input file : >seq1, complete cds AAALVDENEC >seq2, complete cds AATLVDEGDG observed output: >seq1, complete cds sum = 4, abs = 18, >seq2, complete cds sum = 2, abs = 20, Expected output: >seq1, complete cds sum = 5, abs = 17, >seq2, complete cds sum = 3, abs = 19, I THINK THE ERROR RESULT COMES DUE TO THE WHILE LOOP WHICH FAILS TO COUNT THE OVERLAPPING DI-PEPTIDE PRESENT IN THE SEQUENCE

        If you want to count overlapping matches, a plain regular expression as you wrote it isn't the easiest way to approach the problem.

        Personally, I would simply iterate over the string either for each character or by resetting pos and using \G as documented in perlre:

        use strict; my $line= 'AAALVDENEC'; while( $line =~ /\G.*?(AA|AL|DA|DE|DV|VD|DW|QD|SD|HD|ED|DY|VE|EN|EI|KE +|NV|VP|FV|SS|WK|KK)/igc ) { print sprintf 'Matched [%s] at %d', $1, pos($line); pos( $line )--; }
        A reply falls below the community's threshold of quality. You may see it by logging in.

        Here's one way to match overlapping patterns:

        #!/usr/bin/perl # http://perlmonks.org/?node_id=1124725 use Data::Dump qw(pp); use strict; $| = 1; my %answer; for my $line ( 'AAALVDENEC', 'AATLVDEGDG' ) { $answer{$line}{$_}++ for $line =~ /(?=(AA|AL|DA|DE|DV|VD|DW|QD|SD|HD|ED|DY|VE|EN|EI|KE|NV|VP|FV|SS|W +K|KK))/g; } pp \%answer;

        which produces:

        { AAALVDENEC => { AA => 2, AL => 1, DE => 1, EN => 1, VD => 1 }, AATLVDEGDG => { AA => 1, DE => 1, VD => 1 }, }

        That has the 5 and 3 you are looking for.

        There is an alternative to matching with overlapping patterns:

        #!/usr/bin/perl # http://perlmonks.org/?node_id=1124725 use strict; my @dipeptides = split /\|/, 'AA|AL|DA|DE|DV|VD|DW|QD|SD|HD|ED|DY|VE|EN|EI|KE|NV|VP|FV|SS|WK|KK'; for my $line ( 'AAALVDENEC', 'AATLVDEGDG' ) { my $sum = grep $line =~ $_, @dipeptides; my $abs = @dipeptides - $sum; print "$line sum: $sum abs: $abs\n"; }
Re: how do i count the 22 selected di-peptides from a multifasta file separately for each sequence
by kcott (Archbishop) on Apr 26, 2015 at 10:48 UTC

    G'day SOMEN,

    Welcome to the Monastery.

    [Assumption: $sum is the global dipeptide count.]

    Your problem lies here:

    my $sum = sum(values %{$count{$_}});

    You're counting the number of unique dipeptides found; not a sum of their values.

    Look at how you populate %count:

    $count{$seq}{$1}++

    So, %count has a series of keys representing sequences ($seq). Each of those has a series of keys representing a dipeptide ($_).

    When you access %count, i.e. with $count{$_}, you're only drilling down to the sequence. You need to go one more level to access the dipeptides.

    -- Ken

Re: how do i count the 22 selected di-peptides from a multifasta file separately for each sequence
by hdb (Monsignor) on Apr 26, 2015 at 14:11 UTC

    Why bother searching for specific pairs of letters? Just count them all, there are not so many of them, and then pick the counts you are interested in:

    use strict; use warnings; my $line= 'AAALVDENEC'; my %counts; $counts{substr $line, $_, 2}++ for 0..length($line)-2; my @wanted = qw(AA AL DA DE DV VD DW QD SD HD ED DY VE EN EI KE NV VP +FV SS WK KK); for (@wanted) { print "$_ => $counts{$_}\n" if defined $counts{$_}; }
Re: how do i count the 22 selected di-peptides from a multifasta file separately for each sequence
by AnomalousMonk (Archbishop) on Apr 26, 2015 at 14:54 UTC

    It seems to me that the count of individual di-peptides in each sequence does not matter, but only the number of unique (overlapping) di-peptides found. If I understand this correctly, here's another approach, including some approaches from other replies (with some debug statements):

    c:\@Work\Perl\monks>perl -wMstrict -le "use Data::Dump; ;; use constant DIPEPS => qw(AA AL DA DE DV VD DW QD SD HD ED DY VE EN + EI KE NV VP FV SS WK KK); use constant N_DIPEPS => scalar DIPEPS; ;; my ($dipep) = map qr{ (?i) (?: $_) }xms, join '|', DIPEPS ; print $dipep; ;; my %count; for my $seq (qw(AAALVDENEC AATLVDEGDG)) { $count{$seq}{$1} = 1 while $seq =~ m{ (?= ($dipep)) }xmsg; } dd \%count; ;; for my $seq (keys %count) { my $sum = keys %{ $count{$seq} }; my $abs = N_DIPEPS - $sum; print qq{$seq: sum = $sum, abs = $abs}; } " (?^msx: (?i) (?: AA|AL|DA|DE|DV|VD|DW|QD|SD|HD|ED|DY|VE|EN|EI|KE|NV|VP +|FV|SS|WK|KK) ) { AAALVDENEC => { AA => 1, AL => 1, DE => 1, EN => 1, VD => 1 }, AATLVDEGDG => { AA => 1, DE => 1, VD => 1 }, } AAALVDENEC: sum = 5, abs = 17 AATLVDEGDG: sum = 3, abs = 19

    Update: And another approach, based on the same (mis?)understanding that only the number of unique di-peptides matter and not their individual counts:

    c:\@Work\Perl\monks>perl -wMstrict -le "use List::MoreUtils qw(uniq); use Data::Dump; ;; use constant DIPEPS => qw(AA AL DA DE DV VD DW QD SD HD ED DY VE EN + EI KE NV VP FV SS WK KK); use constant N_DIPEPS => scalar DIPEPS; ;; my ($dipep) = map qr{ (?i) (?: $_) }xms, join '|', DIPEPS ; ;; my %count; for my $seq (qw(AAALVDENEC AATLVDEGDG)) { $count{$seq} = uniq $seq =~ m{ (?= ($dipep)) }xmsg; } dd \%count; ;; for my $seq (keys %count) { my $sum = $count{$seq}; my $abs = N_DIPEPS - $sum; print qq{$seq: sum = $sum, abs = $abs}; } " { AAALVDENEC => 5, AATLVDEGDG => 3 } AAALVDENEC: sum = 5, abs = 17 AATLVDEGDG: sum = 3, abs = 19


    Give a man a fish:  <%-(-(-(-<

Re: how do i count the 22 selected di-peptides from a multifasta file separately for each sequence
by Laurent_R (Canon) on Apr 26, 2015 at 09:18 UTC
    Please read Markup in the Monastery and reformat your posts accordingly. That will really help us to help you.

    Je suis Charlie.