Re: how do i count the 22 selected di-peptides from a multifasta file separately for each sequence

Replies are listed 'Best First'.
Re^2: how do i count the 22 selected di-peptides from a multifasta file separately for each sequence by SOMEN (Initiate) on Apr 26, 2015 at 09:01 UTC
Sir dipeptides are the combination of two amino acids. Generally there are 2o amino acids, so 400 combinations. Here i want to find out the count of the 22 combinations present and number of di-peptides absent in each sequence of the input (multifasta) file. For eg. Input file : >seq1, complete cds AAALVDENEC >seq2, complete cds AATLVDEGDG observed output: >seq1, complete cds sum = 4, abs = 18, >seq2, complete cds sum = 2, abs = 20, Expected output: >seq1, complete cds sum = 5, abs = 17, >seq2, complete cds sum = 3, abs = 19, I THINK THE ERROR RESULT COMES DUE TO THE WHILE LOOP WHICH FAILS TO COUNT THE OVERLAPPING DI-PEPTIDE PRESENT IN THE SEQUENCE	[reply]
Re^3: how do i count the 22 selected di-peptides from a multifasta file separately for each sequence by Corion (Patriarch) on Apr 26, 2015 at 09:46 UTC
If you want to count overlapping matches, a plain regular expression as you wrote it isn't the easiest way to approach the problem. Personally, I would simply iterate over the string either for each character or by resetting pos and using `\G` as documented in perlre: `use strict; my $line= 'AAALVDENEC'; while( $line =~ /\G.*?(AA\|AL\|DA\|DE\|DV\|VD\|DW\|QD\|SD\|HD\|ED\|DY\|VE\|EN\|EI\|KE +\|NV\|VP\|FV\|SS\|WK\|KK)/igc ) { print sprintf 'Matched [%s] at %d', $1, pos($line); pos( $line )--; }` [download]	[reply] [d/l] [select]
A reply falls below the community's threshold of quality. You may see it by logging in.
Re^3: how do i count the 22 selected di-peptides from a multifasta file separately for each sequence by Anonymous Monk on Apr 26, 2015 at 10:58 UTC
Here's one way to match overlapping patterns: `#!/usr/bin/perl # http://perlmonks.org/?node_id=1124725 use Data::Dump qw(pp); use strict; $\| = 1; my %answer; for my $line ( 'AAALVDENEC', 'AATLVDEGDG' ) { $answer{$line}{$_}++ for $line =~ /(?=(AA\|AL\|DA\|DE\|DV\|VD\|DW\|QD\|SD\|HD\|ED\|DY\|VE\|EN\|EI\|KE\|NV\|VP\|FV\|SS\|W +K\|KK))/g; } pp \%answer;` [download] which produces: `{ AAALVDENEC => { AA => 2, AL => 1, DE => 1, EN => 1, VD => 1 }, AATLVDEGDG => { AA => 1, DE => 1, VD => 1 }, }` [download] That has the 5 and 3 you are looking for.	[reply] [d/l] [select]
Re^4: how do i count the 22 selected di-peptides from a multifasta file separately for each sequence by SOMEN (Initiate) on Apr 26, 2015 at 11:16 UTC
Thank you it worked.......thanks for the help	[reply]
Re^3: how do i count the 22 selected di-peptides from a multifasta file separately for each sequence by Anonymous Monk on Apr 26, 2015 at 20:48 UTC
There is an alternative to matching with overlapping patterns: `#!/usr/bin/perl # http://perlmonks.org/?node_id=1124725 use strict; my @dipeptides = split /\\|/, 'AA\|AL\|DA\|DE\|DV\|VD\|DW\|QD\|SD\|HD\|ED\|DY\|VE\|EN\|EI\|KE\|NV\|VP\|FV\|SS\|WK\|KK'; for my $line ( 'AAALVDENEC', 'AATLVDEGDG' ) { my $sum = grep $line =~ $_, @dipeptides; my $abs = @dipeptides - $sum; print "$line sum: $sum abs: $abs\n"; }` [download]	[reply] [d/l]