how do i count the 22 selected di-peptides from a multifasta file separately for each sequence

SOMEN has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: how do i count the 22 selected di-peptides from a multifasta file separately for each sequence by Corion (Patriarch) on Apr 26, 2015 at 08:39 UTC
Please help us help you better! I don't know what a di-peptide is and how it appears in your program. Maybe you can show us relevant input data and the output data you get. Please also explain what you expect as output data.	[reply]
Re^2: how do i count the 22 selected di-peptides from a multifasta file separately for each sequence by SOMEN (Initiate) on Apr 26, 2015 at 09:01 UTC
Sir dipeptides are the combination of two amino acids. Generally there are 2o amino acids, so 400 combinations. Here i want to find out the count of the 22 combinations present and number of di-peptides absent in each sequence of the input (multifasta) file. For eg. Input file : >seq1, complete cds AAALVDENEC >seq2, complete cds AATLVDEGDG observed output: >seq1, complete cds sum = 4, abs = 18, >seq2, complete cds sum = 2, abs = 20, Expected output: >seq1, complete cds sum = 5, abs = 17, >seq2, complete cds sum = 3, abs = 19, I THINK THE ERROR RESULT COMES DUE TO THE WHILE LOOP WHICH FAILS TO COUNT THE OVERLAPPING DI-PEPTIDE PRESENT IN THE SEQUENCE	[reply]
Re^3: how do i count the 22 selected di-peptides from a multifasta file separately for each sequence by Corion (Patriarch) on Apr 26, 2015 at 09:46 UTC
If you want to count overlapping matches, a plain regular expression as you wrote it isn't the easiest way to approach the problem. Personally, I would simply iterate over the string either for each character or by resetting pos and using `\G` as documented in perlre: `use strict; my $line= 'AAALVDENEC'; while( $line =~ /\G.*?(AA\|AL\|DA\|DE\|DV\|VD\|DW\|QD\|SD\|HD\|ED\|DY\|VE\|EN\|EI\|KE +\|NV\|VP\|FV\|SS\|WK\|KK)/igc ) { print sprintf 'Matched [%s] at %d', $1, pos($line); pos( $line )--; }` [download]	[reply] [d/l] [select]
A reply falls below the community's threshold of quality. You may see it by logging in.
Re^3: how do i count the 22 selected di-peptides from a multifasta file separately for each sequence by Anonymous Monk on Apr 26, 2015 at 10:58 UTC
Here's one way to match overlapping patterns: `#!/usr/bin/perl # http://perlmonks.org/?node_id=1124725 use Data::Dump qw(pp); use strict; $\| = 1; my %answer; for my $line ( 'AAALVDENEC', 'AATLVDEGDG' ) { $answer{$line}{$_}++ for $line =~ /(?=(AA\|AL\|DA\|DE\|DV\|VD\|DW\|QD\|SD\|HD\|ED\|DY\|VE\|EN\|EI\|KE\|NV\|VP\|FV\|SS\|W +K\|KK))/g; } pp \%answer;` [download] which produces: `{ AAALVDENEC => { AA => 2, AL => 1, DE => 1, EN => 1, VD => 1 }, AATLVDEGDG => { AA => 1, DE => 1, VD => 1 }, }` [download] That has the 5 and 3 you are looking for.	[reply] [d/l] [select]
Re^4: how do i count the 22 selected di-peptides from a multifasta file separately for each sequence by SOMEN (Initiate) on Apr 26, 2015 at 11:16 UTC
Re^3: how do i count the 22 selected di-peptides from a multifasta file separately for each sequence by Anonymous Monk on Apr 26, 2015 at 20:48 UTC
There is an alternative to matching with overlapping patterns: `#!/usr/bin/perl # http://perlmonks.org/?node_id=1124725 use strict; my @dipeptides = split /\\|/, 'AA\|AL\|DA\|DE\|DV\|VD\|DW\|QD\|SD\|HD\|ED\|DY\|VE\|EN\|EI\|KE\|NV\|VP\|FV\|SS\|WK\|KK'; for my $line ( 'AAALVDENEC', 'AATLVDEGDG' ) { my $sum = grep $line =~ $_, @dipeptides; my $abs = @dipeptides - $sum; print "$line sum: $sum abs: $abs\n"; }` [download]	[reply] [d/l]
Re: how do i count the 22 selected di-peptides from a multifasta file separately for each sequence by kcott (Archbishop) on Apr 26, 2015 at 10:48 UTC
G'day SOMEN, Welcome to the Monastery. [Assumption: `$sum` is the global dipeptide count.] Your problem lies here: `my $sum = sum(values %{$count{$_}});` [download] You're counting the number of unique dipeptides found; not a sum of their values. Look at how you populate `%count`: `$count{$seq}{$1}++` [download] So, `%count` has a series of keys representing sequences (`$seq`). Each of those has a series of keys representing a dipeptide (`$_`). When you access `%count`, i.e. with `$count{$_}`, you're only drilling down to the sequence. You need to go one more level to access the dipeptides. -- Ken	[reply] [d/l] [select]
Re: how do i count the 22 selected di-peptides from a multifasta file separately for each sequence by hdb (Monsignor) on Apr 26, 2015 at 14:11 UTC
Why bother searching for specific pairs of letters? Just count them all, there are not so many of them, and then pick the counts you are interested in: `use strict; use warnings; my $line= 'AAALVDENEC'; my %counts; $counts{substr $line, $_, 2}++ for 0..length($line)-2; my @wanted = qw(AA AL DA DE DV VD DW QD SD HD ED DY VE EN EI KE NV VP +FV SS WK KK); for (@wanted) { print "$_ => $counts{$_}\n" if defined $counts{$_}; }` [download]	[reply] [d/l]
Re: how do i count the 22 selected di-peptides from a multifasta file separately for each sequence by AnomalousMonk (Archbishop) on Apr 26, 2015 at 14:54 UTC
It seems to me that the count of individual di-peptides in each sequence does not matter, but only the number of unique (overlapping) di-peptides found. If I understand this correctly, here's another approach, including some approaches from other replies (with some debug statements): c:\@Work\Perl\monks>perl -wMstrict -le "use Data::Dump; ;; use constant DIPEPS => qw(AA AL DA DE DV VD DW QD SD HD ED DY VE EN + EI KE NV VP FV SS WK KK); use constant N_DIPEPS => scalar DIPEPS; ;; my ($dipep) = map qr{ (?i) (?: $_) }xms, join '\|', DIPEPS ; print $dipep; ;; my %count; for my $seq (qw(AAALVDENEC AATLVDEGDG)) { $count{$seq}{$1} = 1 while $seq =~ m{ (?= ($dipep)) }xmsg; } dd \%count; ;; for my $seq (keys %count) { my $sum = keys %{ $count{$seq} }; my $abs = N_DIPEPS - $sum; print qq{$seq: sum = $sum, abs = $abs}; } " (?^msx: (?i) (?: AA\|AL\|DA\|DE\|DV\|VD\|DW\|QD\|SD\|HD\|ED\|DY\|VE\|EN\|EI\|KE\|NV\|VP +\|FV\|SS\|WK\|KK) ) { AAALVDENEC => { AA => 1, AL => 1, DE => 1, EN => 1, VD => 1 }, AATLVDEGDG => { AA => 1, DE => 1, VD => 1 }, } AAALVDENEC: sum = 5, abs = 17 AATLVDEGDG: sum = 3, abs = 19 [download] Update: And another approach, based on the same (mis?)understanding that only the number of unique di-peptides matter and not their individual counts: c:\@Work\Perl\monks>perl -wMstrict -le "use List::MoreUtils qw(uniq); use Data::Dump; ;; use constant DIPEPS => qw(AA AL DA DE DV VD DW QD SD HD ED DY VE EN + EI KE NV VP FV SS WK KK); use constant N_DIPEPS => scalar DIPEPS; ;; my ($dipep) = map qr{ (?i) (?: $_) }xms, join '\|', DIPEPS ; ;; my %count; for my $seq (qw(AAALVDENEC AATLVDEGDG)) { $count{$seq} = uniq $seq =~ m{ (?= ($dipep)) }xmsg; } dd \%count; ;; for my $seq (keys %count) { my $sum = $count{$seq}; my $abs = N_DIPEPS - $sum; print qq{$seq: sum = $sum, abs = $abs}; } " { AAALVDENEC => 5, AATLVDEGDG => 3 } AAALVDENEC: sum = 5, abs = 17 AATLVDEGDG: sum = 3, abs = 19 [download] Give a man a fish: `<%-(-(-(-<`	[reply] [d/l] [select]
Re: how do i count the 22 selected di-peptides from a multifasta file separately for each sequence by Laurent_R (Canon) on Apr 26, 2015 at 09:18 UTC
Please read Markup in the Monastery and reformat your posts accordingly. That will really help us to help you. Je suis Charlie.	[reply]